Verify Structured Output with Field-Level Citations

S

Sarah Guthals, PhD

Guest
Missing evidence is one of the biggest blockers in production AI workflows.

It’s not enough to say what a document claims, you need to show where in the source that claim came from. Whether you’re auditing bank statements, verifying medical referral forms, or investigating fraud, traceability is a hard requirement.

That’s why we’ve introduced a new parameter in Tensorlake’s StructuredExtractionOptions:


Code:
StructuredExtractionOptions(
    schema_name="ExampleSchema",
    json_schema=ExampleSchema,
    provide_citations=True
)

When provide_citations=True, every extracted field includes:

  • Page number
  • Bounding box (bbox) coordinates

This means structured outputs are no longer just machine-readable; they’re auditable, verifiable, and traceable back to the source document.


Traceable Context Means Trustworthy RAG​


In many workflows, β€œclose enough” isn’t good enough. Teams need confidence that extracted values align with the document’s ground truth. Let’s look at where this matters most:

  • Banking & Finance: Auditors need to understand exactly which account, statement, or transaction produced a reported number. If an account balance doesn’t reconcile, citations let you trace back to the precise page and bounding box where the discrepancy originates. No more guesswork in backtracking totals.
  • Fraud Detection: When anomalies appear in reported values, bounding-box citations provide the evidence trail. Investigators can quickly verify whether a suspicious number came from an altered document, a duplicated entry, or a genuine filing.
  • Healthcare & Forms Processing: At UCLA, teams processing medical referral forms wanted faster verification of ground truth. With citations, a structured field (like β€œreferral date” or β€œdoctor’s signature”) can point directly to the page span and bounding box where it was found, cutting human review time dramatically.

In short:

Citations turn structured extraction into a compliance-grade tool.

Implement Citations with One Line of Code​


Let’s take a simple example: extracting transaction summaries from a bank statement.


Code:
from tensorlake.documentai import DocumentAI, StructuredExtractionOptions
from pydantic import BaseModel, Field
from typing import List

class Transaction(BaseModel):
    date: str = Field(description="Transaction date")
    description: str = Field(description="Transaction description")
    amount: float = Field(description="Transaction amount")

class BankStatement(BaseModel):
    transactions: List[Transaction]

doc_ai = DocumentAI()

structured_extraction_options = [
    StructuredExtractionOptions(
        schema_name="BankStatement",
        json_schema=BankStatement,
        provide_citations=True   # <-- new parameter
    )
]

result = doc_ai.parse_and_wait(
    file="https://tlake.link/documents/bank-statement",
    structured_extraction_options=structured_extraction_options
)

print(result.structured_data[0].data)

The returned JSON now looks like this:


Code:
"transactions": [
{
    "Date": "08/24",
    "Date_citation": [
    {
        "page_number": 1,
        "x1": 59,
        "x2": 135,
        "y1": 448,
        "y2": 482
    }
    ],
    "amount": "50.00",
    "amount_citation": [
    {
        "page_number": 1,
        "x1": 515,
        "x2": 585,
        "y1": 447,
        "y2": 482
    }
    ],
    "descriptions": "ATM CASH DEPOSIT, ***** 30073995581 AUT 082220 ATM CASH DEPOSIT 550 LONG BEACH BLVD LONG BEACH * NY",
    "descriptions_citation": [
    {
        "page_number": 1,
        "x1": 135,
        "x2": 515,
        "y1": 447,
        "y2": 482
    }
    ]
}

Each field is now annotated with a citation: the page number and bounding-box coordinates.

If you use our Tensorlake Cloud Playground, you can even get the visual bounding-boxes labeled for each extracted bit of information

Example of Tensorlake structured data extraction with citations, showing JSON output linked to highlighted fields on a TD Bank statement. A $50.00 transaction is mapped from the document to the JSON citation with bounding box coordinates.

From Data to Evidence​

β€œIn insurance, structured outputs power our workflows, but people still verify. With field-level citations, reviewers can jump from a data row straight to the exact COI or endorsement language. That’s the difference between β€˜parsed’ and provable.”

β€” Jesse McClure, CTO and Co-Founder, Sublynk

Citations aren’t just nice-to-have, our customers across industries know that they unlock new workflows:

  • Audit-ready outputs: Every number is backed by ground-truth evidence.
  • Automated review: Flag discrepancies automatically and point reviewers directly to the source.
  • Explainability in RAG/Agents: Don’t just return answersβ€”return the highlighted document snippets.
  • UI Enhancements: Build document viewers that highlight the exact fields extracted.

The benefit is twofold: engineers can build more reliable systems and stakeholders (auditors, compliance teams, regulators) get confidence and transparency.

Try Structured Extraction Citations Now​


You can try provide_citations=True today in both the Tensorlake Playground and the API/Python SDK.


If you have any questions or feedback, we'd love to hear from you! Join our Slack and let us know how you're using citations.

Traceability Built In​


With the new provide_citations parameter, structured extraction becomes not only machine-readable but also evidence-backed.

Every field can now point back to its exact source location in the document, making Tensorlake the foundation for audit-ready, compliance-grade, and fraud-resistant AI workflows.

Start using it today. In production AI, traceability isn’t optional.

Continue reading...
 


Join 𝕋𝕄𝕋 on Telegram
Channel PREVIEW:
Back
Top