Writing / Technical Story
PolicyLens was my attempt to turn opaque insurance PDFs into structured truth.
It started as a practical frustration with insurance-buying UX and ended as a document compiler for regulated PDFs: one that could parse structure, link facts to evidence, and still tell me where the product thesis broke.
Origin
The project existed because the buying experience felt structurally broken.
The original problem was not academic. While buying health insurance for myself and family, I kept running into the same pattern: aggregators generated spam calls, hid important products behind unclear filters, and often surfaced recommendations that did not line up cleanly with what their own UI claimed to support.
I wanted one screen where a user could describe their situation and get a short, source-backed answer. No call-center theater. No forced opacity. No pretending that policy comparison is obvious when the underlying documents are dense legal PDFs.
Buyer frustration
Opaque aggregator flows, hidden products, and sales pressure created the original problem statement.
Naive LLM pass
Whole-PDF prompting and markdown conversion both plateaued. The output looked fluent but was not comparison-grade.
Compiler pivot
The problem changed from “ask a model” to “compile a legal document into a structured, evidence-linked representation.”
Prototype truth gap
The local prototype showed that a correctly parsed policy can still be a misleading product recommendation.
Closeout
The engine worked. The broad consumer-product path did not. The repo became a finished open-source artifact.
False Start
The first approaches failed for a simple reason: they never became trustworthy.
The obvious first idea was to let a general model read policy PDFs directly. I tried that. Then I tried converting PDFs into markdown and plain text first, hoping cleaner input would improve the result. It did not.
The shape of the failure was revealing. The model could summarize. It could sound competent. But the output was still brittle for the thing I actually needed: structured, comparison-grade facts with reliable grounding in source text.
approach
Direct PDFs
Fastest path to try, weakest path to trust.
- High token pressure on long legal documents
- Weak structure fidelity on clauses and tables
- Difficult to guarantee typed output quality
Good for a demo. Bad for a document compiler.
approach
Markdown conversion
Cleaner input, same ceiling.
- Text became easier to inspect
- Tables still lost operational structure
- Model behavior was materially similar to native PDF prompting
Changing format did not change the bottleneck.
approach
Compiler pipeline
More work up front, but finally testable.
- Separate physical, logical, and semantic layers
- Store evidence and coordinates
- Extract facts with explicit status instead of vibes
This is where the project became real engineering.
The bottleneck was not whether the model saw a PDF, markdown, or plain text. The bottleneck was that I still did not have a structured representation of the document.
Mental Model Shift
The real shift was understanding that a policy PDF is not just text.
Once I stopped thinking about the problem as “ask an AI about a PDF,” the architecture got clearer. A policy document is layered. It has layout, headings, clauses, tables, and product-specific conditions. Flattening all of that into raw text throws away exactly the context that makes the document legally meaningful.
What is a clause?
A clause is a self-contained policy statement with legal effect: the unit that says what is covered, excluded, conditioned, or time-bound.
What is a section tree?
It is the hierarchy of the document: headings, subheadings, sections, and the clauses that live under them. Without it, the parser loses context.
What is source truth?
It means every extracted fact stays traceable to the exact text or table row that supports it, instead of becoming detached summary.
Clause tree explainer
Architecture view
Physical layer
Pages, blocks, lines, coordinates
Logical layer
Headings, sections, clauses, table parents
Semantic layer
Facts, statuses, evidence, normalized values
Pipeline
PolicyLens became a layered document compiler.
The pipeline eventually settled into eight stages. Each stage had a different job, a different failure mode, and a different way to measure whether it was working.
stage 1
Corpus lockdown
stage 2
UIN reconciliation
stage 3
Physical layer
stage 4
Logical layer
stage 5
Table engine
stage 6
Clause store
stage 7
Fact extraction
stage 8
Compiled export
Upstream identity work
The engine first had to know what each file actually was. That meant corpus lockdown, document filtering, insurer-plan normalization, and UIN reconciliation before parsing anything.
Downstream fact work
After structure and tables came clause storage, source spans, and deterministic extraction for concrete concepts like waiting periods, co-pay, room rent limits, and deductibles.
Evidence chain
Raw text
A line or table row inside the original policy PDF.
Clause
The text is attached to a document section and given context.
Extracted fact
A concept is emitted with value, status, confidence, and evidence.
Compiled output
The result becomes structured JSON instead of a loose answer.
Outputs
The project did not just produce ideas. It produced measurable artifacts.
By the end, the repo had moved far past the “interesting prototype” stage. It had a real gold corpus, a real parser stack, a real table engine, deterministic extractors for the priority concepts, and a documented closeout state with a full passing test suite.
20
reviewed gold policies
The manually reviewed benchmark corpus used to keep the pipeline honest.
647
active policy wordings
The full policy-wording corpus processed after document filtering and UIN reconciliation.
20/20
priority concepts active
Deterministic extraction for the core comparison features actually turned on.
490/490
tests passing
The final closeout state of the test suite before the project was formally stopped.
Draft source-bundle registry
507 draft bundles were generated. 504 were still `missing_pbt`, 1 was `acceptable_with_known_gap`, and 2 were `rejected`.
Curated MVP corpus
30 latest-reviewed products survived the verification gate, backed by 45 current official source documents downloaded and hashed.
Curated bundle quality
The 30-product set still had visible gaps: 18 `acceptable_with_known_gap`, 6 `missing_cis`, 4 `missing_pbt`, and 2 `stale_version`.
Table engine result
The table eval eventually passed on the reviewed corpus: 387/387 labels detected, with 100% priority recall, 100% type accuracy, and 100% header lineage.
Those numbers matter because they separate “I tried some prompts” from “I built an evaluated document system.” The system knew how many files it had, which documents were in play, how tables were performing, and where product-truth gaps still existed.
Hard Lesson
The downstream prototype exposed the difference between parsing truth and recommendation truth.
This was the project’s most important reality check. The parser could extract a value that was locally defensible, and the product could still be wrong in the way that matters to a buyer.
The clearest example was co-pay. In one case, the local prototype showed a single co-pay value with strong confidence. But the real product truth depended on the variant. The policy wording alone was not enough. Variant-specific numbers lived in adjacent official documents like Product Benefit Tables and Customer Information Sheets.
Recommendation risk
Parsed correctly, still misleading
A single extracted number can look clean in a UI while still hiding variant-level conditions that matter to the user. That is a product-truth failure, not just a parser bug.
Design response
Source bundles, not single PDFs
The repo had to evolve from “one product equals one wording PDF” to “one product equals a bundle of official documents and version checks.”
Source-bundle explainer
Policy wording
PBT / benefit table
CIS
Brochure
Version drift check
Closeout
The project stopped because the product problem turned out to be operational, not just technical.
The closeout was not a dramatic collapse. It was a sober decision. The engine was real. The extraction work was valuable. But building a broad, continuously current consumer comparison product from public insurer PDFs was not a sane side-project operating model.
Once you care about trustworthy recommendations, you are no longer just building a parser. You are maintaining product catalogs, cross-document precedence, live version drift, and source freshness. That is ongoing data operations, not just model quality.
The engine worked
The parser, table stack, and extractor pipeline became robust enough to generate structured outputs with evidence.
The market surface was unstable
Public insurer documents were fragmented, versioned unevenly, and often split across wording, tables, CIS documents, and brochures.
The honest ending was to stop
Forcing the broad product thesis would have created a trust problem disguised as product progress.
What Survived
What remains valuable is the architecture, the discipline, and the record.
Even though the broad consumer product stopped, the repo still holds several things worth reusing. It is a practical reference for building document pipelines where evidence matters and wrong answers are more dangerous than missing ones.
- Document-intelligence architecture: separate physical, logical, and semantic layers instead of treating a PDF as one blob of text.
- Evaluation discipline: use reviewed corpora and hard gates so the system cannot quietly flatter itself.
- Evidence-linked extraction: every user-facing fact should stay tied to the source that justifies it.
- Source-bundle modeling: product truth often lives across multiple official documents, not one canonical file.
That is the part I would carry forward into another domain. Not the fantasy of “AI can read everything,” but the more defensible claim that complex documents can be compiled into explicit, inspectable structures.
Appendix
Project record
Everything above is traceable to the project’s own repository record: the README, the evaluation docs, the task ledger, the bundle closeout, and the final project closeout. That paper trail is part of the work.