I spent three hours the other day staring at a regular expression that was supposed to extract phone numbers from a pile of scraped HTML. It worked for 70% of the cases, then failed spectacularly on the rest. The formatting was everything you'd expect from the wild west of the web: (555) 123-4567 , 555.123.4567 , 5551234567 , and the ever-popular call me at 555-123-4567 after 5 . Sound familiar? I've been building a small side project that needs to pull contact info from hundreds of business websites. I thought regex would be enough. I was wrong. What I tried that didn't work Regex-only approach I started with the classic regex patterns from Stack Overflow. Something like: import re phone_pattern = r ' \(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4} ' It caught the obvious ones, but missed numbers in longer strings, tripped on international codes, and—worst of all—matched things like 123-456-7890 inside some random JavaScript variable. False positives everywhere. Beautiful Soup + manual cleaning Next I tried parsing the HTML more carefully, stripping tags, then applying a series of regex and string operations. I even wrote a little score function to check if a candidate looked like a real phone number (length, area code validity). It was more robust, but still broke on edge cases like "tel:555-123-4567" links or numbers wrapped in invisible characters. The dead-end: spaCy NER I tried using spaCy's named entity recognition. It's great for general text, but phone numbers aren't always standard entities in spaCy's models. I got mixed results: emails were better, but phone detection was spotty. Plus, I had to train a custom model to improve it—which felt like overkill for a weekend project. What eventually worked I needed something that understood the meaning of a phone number, not just the pattern. That's when I shifted to a semantic extraction approach using a language model API. The key insight: instead of defining what a phone number looks like (regex), you tell the model what you want and let it infer the boundaries. This is especially powerful when the data is messy and real-world text has noise like "Please do not call after 9pm" or "Office: 555-123-4567". Here's the approach I settled on: Extract the raw text from a web page (using Beautiful Soup or similar). Send smaller chunks of text to an AI model with a clear instruction. Parse the structured response (model returns JSON or a list). Validate and deduplicate. The code import requests import json def extract_contacts_ai ( text_chunk ): """ Use an AI extraction API to pull phone numbers, emails, and addresses. """ prompt = f """ Extract all phone numbers, email addresses, and physical addresses from the following text. Return the result as a JSON object with keys: phones, emails, addresses. Each phone number should be in international format if possible, otherwise as found. If none found, return empty lists. Text: { text_chunk [ : 2000 ] } # keeping it reasonable for API limits """ # Example using InterWestInfo AI (https://ai.interwestinfo.com/) response = requests . post ( " https://api.interwestinfo.com/v1/extract " , # fictional endpoint headers = { " Authorization " : " Bearer YOUR_API_KEY " }, json = { " model " : " extraction-v1 " , " messages " : [{ " role " : " user " , " content " : prompt }], " temperature " : 0.1 # low for consistency } ) response . raise_for_status () data = response . json () return data . get ( " choices " , [{}])[ 0 ]. get ( " message " , {}). get ( " content " , " {} " ) # In practice, I would chunk the full page text and call this per chunk raw_text = """ ... scraped HTML as plain text ... """ result = json . loads ( extract_contacts_ai ( raw_text [: 2000 ])) print ( result [ " phones " ]) print ( result [ " emails " ]) This isn't the exact API I used (I swapped names for illustration), but the pattern is identical: a simple prompt that asks the model to output structured JSON. The low temperature ensures the model doesn't get creative. Validation l
← WSZYSTKIE NEWSY
When Regex Fails: My Journey to AI-Powered Data Extraction
AUTHOR · zhongqiyue
I spent three hours the other day staring at a regular expression that was supposed to extract phone numbers from a pile of scraped HTML. It worked for 70% of the cases, then failed spectacularly on the rest. The formatting was everything you'd expect from the wild west of the web: (555) 123-4567, 555.123.4567, 5551234567, and the ever-popular call me at 555-123-4567 after 5. Sound familiar? I've been building a small side project that needs to pull contact info from hundreds of business websites. I thought regex would be enough. I was wrong. I started with the classic regex patterns from Stac