Writing / Technical Story

PolicyLens was my attempt to turn opaque insurance PDFs into structured truth.

It started as a practical frustration with insurance-buying UX and ended as a document compiler for regulated PDFs: one that could parse structure, link facts to evidence, and still tell me where the product thesis broke.

2026-06-08technical story12 min readopen-source repo

Origin

The project existed because the buying experience felt structurally broken.

The original problem was not academic. While buying health insurance for myself and family, I kept running into the same pattern: aggregators generated spam calls, hid important products behind unclear filters, and often surfaced recommendations that did not line up cleanly with what their own UI claimed to support.

I wanted one screen where a user could describe their situation and get a short, source-backed answer. No call-center theater. No forced opacity. No pretending that policy comparison is obvious when the underlying documents are dense legal PDFs.

01

Buyer frustration

Opaque aggregator flows, hidden products, and sales pressure created the original problem statement.

02

Naive LLM pass

Whole-PDF prompting and markdown conversion both plateaued. The output looked fluent but was not comparison-grade.

03

Compiler pivot

The problem changed from “ask a model” to “compile a legal document into a structured, evidence-linked representation.”

04

Prototype truth gap

The local prototype showed that a correctly parsed policy can still be a misleading product recommendation.

05

Closeout

The engine worked. The broad consumer-product path did not. The repo became a finished open-source artifact.

False Start

The first approaches failed for a simple reason: they never became trustworthy.

The obvious first idea was to let a general model read policy PDFs directly. I tried that. Then I tried converting PDFs into markdown and plain text first, hoping cleaner input would improve the result. It did not.

The shape of the failure was revealing. The model could summarize. It could sound competent. But the output was still brittle for the thing I actually needed: structured, comparison-grade facts with reliable grounding in source text.

approach

Direct PDFs

Fastest path to try, weakest path to trust.

  • High token pressure on long legal documents
  • Weak structure fidelity on clauses and tables
  • Difficult to guarantee typed output quality

Good for a demo. Bad for a document compiler.

approach

Markdown conversion

Cleaner input, same ceiling.

  • Text became easier to inspect
  • Tables still lost operational structure
  • Model behavior was materially similar to native PDF prompting

Changing format did not change the bottleneck.

approach

Compiler pipeline

More work up front, but finally testable.

  • Separate physical, logical, and semantic layers
  • Store evidence and coordinates
  • Extract facts with explicit status instead of vibes

This is where the project became real engineering.

The bottleneck was not whether the model saw a PDF, markdown, or plain text. The bottleneck was that I still did not have a structured representation of the document.

Mental Model Shift

The real shift was understanding that a policy PDF is not just text.

Once I stopped thinking about the problem as “ask an AI about a PDF,” the architecture got clearer. A policy document is layered. It has layout, headings, clauses, tables, and product-specific conditions. Flattening all of that into raw text throws away exactly the context that makes the document legally meaningful.

What is a clause?

A clause is a self-contained policy statement with legal effect: the unit that says what is covered, excluded, conditioned, or time-bound.

What is a section tree?

It is the hierarchy of the document: headings, subheadings, sections, and the clauses that live under them. Without it, the parser loses context.

What is source truth?

It means every extracted fact stays traceable to the exact text or table row that supports it, instead of becoming detached summary.

Clause tree explainer

Policy wording
4. Waiting Periods
4.1 Pre-existing disease waiting period
Clause: coverage begins after 36 months, unless noted otherwise

Architecture view

Physical layer

Pages, blocks, lines, coordinates

Logical layer

Headings, sections, clauses, table parents

Semantic layer

Facts, statuses, evidence, normalized values

Pipeline

PolicyLens became a layered document compiler.

The pipeline eventually settled into eight stages. Each stage had a different job, a different failure mode, and a different way to measure whether it was working.

stage 1

Corpus lockdown

stage 2

UIN reconciliation

stage 3

Physical layer

stage 4

Logical layer

stage 5

Table engine

stage 6

Clause store

stage 7

Fact extraction

stage 8

Compiled export

Upstream identity work

The engine first had to know what each file actually was. That meant corpus lockdown, document filtering, insurer-plan normalization, and UIN reconciliation before parsing anything.

Downstream fact work

After structure and tables came clause storage, source spans, and deterministic extraction for concrete concepts like waiting periods, co-pay, room rent limits, and deductibles.

Evidence chain

Raw text

A line or table row inside the original policy PDF.

Clause

The text is attached to a document section and given context.

Extracted fact

A concept is emitted with value, status, confidence, and evidence.

Compiled output

The result becomes structured JSON instead of a loose answer.

Outputs

The project did not just produce ideas. It produced measurable artifacts.

By the end, the repo had moved far past the “interesting prototype” stage. It had a real gold corpus, a real parser stack, a real table engine, deterministic extractors for the priority concepts, and a documented closeout state with a full passing test suite.

20

reviewed gold policies

The manually reviewed benchmark corpus used to keep the pipeline honest.

647

active policy wordings

The full policy-wording corpus processed after document filtering and UIN reconciliation.

20/20

priority concepts active

Deterministic extraction for the core comparison features actually turned on.

490/490

tests passing

The final closeout state of the test suite before the project was formally stopped.

Draft source-bundle registry

507 draft bundles were generated. 504 were still `missing_pbt`, 1 was `acceptable_with_known_gap`, and 2 were `rejected`.

Curated MVP corpus

30 latest-reviewed products survived the verification gate, backed by 45 current official source documents downloaded and hashed.

Curated bundle quality

The 30-product set still had visible gaps: 18 `acceptable_with_known_gap`, 6 `missing_cis`, 4 `missing_pbt`, and 2 `stale_version`.

Table engine result

The table eval eventually passed on the reviewed corpus: 387/387 labels detected, with 100% priority recall, 100% type accuracy, and 100% header lineage.

Those numbers matter because they separate “I tried some prompts” from “I built an evaluated document system.” The system knew how many files it had, which documents were in play, how tables were performing, and where product-truth gaps still existed.

Hard Lesson

The downstream prototype exposed the difference between parsing truth and recommendation truth.

This was the project’s most important reality check. The parser could extract a value that was locally defensible, and the product could still be wrong in the way that matters to a buyer.

The clearest example was co-pay. In one case, the local prototype showed a single co-pay value with strong confidence. But the real product truth depended on the variant. The policy wording alone was not enough. Variant-specific numbers lived in adjacent official documents like Product Benefit Tables and Customer Information Sheets.

Recommendation risk

Parsed correctly, still misleading

A single extracted number can look clean in a UI while still hiding variant-level conditions that matter to the user. That is a product-truth failure, not just a parser bug.

Design response

Source bundles, not single PDFs

The repo had to evolve from “one product equals one wording PDF” to “one product equals a bundle of official documents and version checks.”

Source-bundle explainer

Policy wording

PBT / benefit table

CIS

Brochure

Version drift check

Closeout

The project stopped because the product problem turned out to be operational, not just technical.

The closeout was not a dramatic collapse. It was a sober decision. The engine was real. The extraction work was valuable. But building a broad, continuously current consumer comparison product from public insurer PDFs was not a sane side-project operating model.

Once you care about trustworthy recommendations, you are no longer just building a parser. You are maintaining product catalogs, cross-document precedence, live version drift, and source freshness. That is ongoing data operations, not just model quality.

The engine worked

The parser, table stack, and extractor pipeline became robust enough to generate structured outputs with evidence.

The market surface was unstable

Public insurer documents were fragmented, versioned unevenly, and often split across wording, tables, CIS documents, and brochures.

The honest ending was to stop

Forcing the broad product thesis would have created a trust problem disguised as product progress.

What Survived

What remains valuable is the architecture, the discipline, and the record.

Even though the broad consumer product stopped, the repo still holds several things worth reusing. It is a practical reference for building document pipelines where evidence matters and wrong answers are more dangerous than missing ones.

  • Document-intelligence architecture: separate physical, logical, and semantic layers instead of treating a PDF as one blob of text.
  • Evaluation discipline: use reviewed corpora and hard gates so the system cannot quietly flatter itself.
  • Evidence-linked extraction: every user-facing fact should stay tied to the source that justifies it.
  • Source-bundle modeling: product truth often lives across multiple official documents, not one canonical file.

That is the part I would carry forward into another domain. Not the fantasy of “AI can read everything,” but the more defensible claim that complex documents can be compiled into explicit, inspectable structures.

Appendix

Project record

Everything above is traceable to the project’s own repository record: the README, the evaluation docs, the task ledger, the bundle closeout, and the final project closeout. That paper trail is part of the work.