BOOTING NEURAL FEED…
NEWSBOX v0.2 · NEON SPONSOR ↗
← WSZYSTKIE NEWSY
Tech & Dev 75% CONFIDENCE Dev.to Top 15 czerwca 2026 01:05

Extracting structured data from messy text: what worked for me

AUTHOR · zhongqiyue

I spent a good two weeks last quarter building an invoice extraction pipeline for our accounting team. The emails came in all shapes: some with PDF attachments, others with plain text tables, a few with scanned images that had been OCR'd into garbled nonsense. My job was to pull out vendor name, invoice number, date, and total amount. At first I thought, "Regex, obviously." I wrote patterns for date formats, dollar amounts, and common invoice prefixes. It worked on the first ten samples. Then the real data came. One vendor sent invoices with "Invoice #" and another used "Ref:". Dates were mm/d

I spent a good two weeks last quarter building an invoice extraction pipeline for our accounting team. The emails came in all shapes: some with PDF attachments, others with plain text tables, a few with scanned images that had been OCR'd into garbled nonsense. My job was to pull out vendor name, invoice number, date, and total amount. At first I thought, "Regex, obviously." I wrote patterns for date formats, dollar amounts, and common invoice prefixes. It worked on the first ten samples. Then the real data came. One vendor sent invoices with "Invoice #" and another used "Ref:". Dates were mm/dd/yyyy, dd.mm.yyyy, or even "March 5, 2023". Regex broke fast. I tried spaCy next. Training a custom NER model for four fields seemed reasonable. I manually labelled 200 invoices using Prodigy (the team had a license). The model got to ~85% F1, but then a new vendor showed up with a different layout and accuracy dropped to 60%. Retraining every week wasn't sustainable. The approach that finally stuck: few-shot LLM extraction I realised I didn't need a full-fledged model. I just needed something that could read instructions and follow examples. LLMs (even small ones) are surprisingly good at this when you provide a clear system prompt and a handful of examples. I built a simple pipeline in Python using langchain with OpenAI's gpt-3.5-turbo (later I switched to a local Llama 3 model via Ollama to cut costs). The core is a chain that takes the raw text and a schema, and returns JSON. The key is the prompt design. Here's what I settled on: from langchain_core.prompts import ChatPromptTemplate system = """ You are a data extraction assistant. Extract the following fields from the invoice text: - vendor_name: the company name that issued the invoice - invoice_number: the unique identifier for the invoice - date: the invoice date in YYYY-MM-DD format - total: the total amount due as a number (no currency symbol) If a field cannot be found, use null. Return only valid JSON, no extra text. """ prompt = ChatPromptTemplate . from_messages ([ ( " system " , system ), ( " human " , " Text: {text} " ) ]) chain = prompt | llm | JsonOutputParser () result = chain . invoke ({ " text " : raw_invoice_text }) I also added few-shot examples inside the prompt for edge cases (e.g., when the total is split across lines). Instead of hardcoding every pattern, I let the LLM figure out the variations. Trade-offs I hit Cost : GPT-3.5-turbo costs around $0.002 per call. For 10,000 invoices/month that's $20 – fine for us. But if you're doing millions, it adds up. I tested a local Llama 3 8B quantized model, which was free but slower (about 5–10 seconds per invoice vs 1–2 seconds for GPT). Latency : Real-time? Not great. We batch processed invoices nightly, so latency was fine. For a live form, you'd want to cache or use a smaller model. Accuracy : The LLM approach got ~95% on our test set. But it occasionally hallucinated values (e.g., making up a vendor name from a footnote). I added a validation step: if the total doesn't match a regex \d+\.\d{2} pattern, flag it for human review. Prompt injection : Malicious text could trick the LLM. We sanitise inputs and limit max tokens. When NOT to use this If you have a small, consistent set of formats, regex or a well-trained spaCy model will be faster and cheaper. LLM extraction shines when you can't control the input variability and don't have the time/resources to train custom models. Also, if your text is extremely long (like a 50-page document), LLM context windows become a problem. Chunking strategies add complexity. For invoices, most fit within 4k tokens, so it was fine. One tool that simplified my life In production, we ended up using a service that wraps this exact approach with ready-made connectors for email and PDF parsing. The setup is just a config file pointing to https://ai.interwestinfo.com/ and mapping fields – but the underlying technique is the same few-shot extraction I described. You can absolutely build

CZYTAJ ŹRÓDŁOWY ARTYKUŁ → WIĘCEJ Z TECH & DEV