Last month I needed to scrape product details from 30 different e-commerce sites. Each site used its own HTML structure, class names changed weekly, and some were just plain inconsistent. I had two options: write a mountain of brittle CSS selectors or try something I’d been avoiding—letting an LLM figure out the extraction. Here’s what I learned the hard way, including the code that actually worked and the cases where I should have just stuck with BeautifulSoup. The problem that broke my scraper I was building a price comparison tool for niche outdoor gear. The data I needed was simple: product name, price, availability, and a few specs. But the sources ranged from massive marketplaces to small family-run shops. Every time a site pushed a new template, my carefully built regex broke. I spent more time maintaining scrapers than actually using the data. A typical selector for a price field looked like this: import re import requests from bs4 import BeautifulSoup response = requests . get ( ' https://example.com/product/123 ' ) soup = BeautifulSoup ( response . text , ' html.parser ' ) # This selector changed three times in two weeks price_element = soup . select_one ( ' span.price--current > span.value ' ) if not price_element : price_element = soup . find ( ' div ' , class_ = re . compile ( r ' price.* ' )) I was debugging selectors more than I was analyzing prices. Something had to change. What I tried that didn’t work First I tried using XPath with fuzzy matching. That helped a little, but still required per-site rules. Then I reached for machine learning—training a small model on HTML structure. Overkill for a side project, and I didn’t have labeled data for each site. I looked at commercial scraping services, but they were either too expensive or required sending my data through their pipelines, which felt like over-sharing for a small personal tool. Then I heard about people using LLMs to parse unstructured data directly from raw HTML or even just the visible text. I was skeptical—LLMs are slow, expensive, and hallucinate. But the pain was real, so I gave it a shot. The approach that eventually worked Instead of writing selectors per site, I started sending the raw HTML (or a trimmed version) to an LLM with a simple instruction: “Extract the product name, price, and availability status. Return JSON.” Here’s the core function I ended up with: import json from openai import OpenAI import requests client = OpenAI () def extract_product_data ( html_snippet : str ) -> dict : prompt = f """ You are a data extraction assistant. From the following HTML, extract: - product_name (string) - price (string, include currency symbol if present) - in_stock (boolean) Return only valid JSON with no extra text. HTML: { html_snippet [ : 4000 ] } """ # Truncated to reduce tokens response = client . chat . completions . create ( model = " gpt-4o-mini " , # Cheaper and fast enough messages = [{ " role " : " user " , " content " : prompt }], temperature = 0 , response_format = { " type " : " json_object " } ) return json . loads ( response . choices [ 0 ]. message . content ) To use it, I just fetch the page and pass a cleaned snippet (removing scripts, styles, and navigation elements to keep token count low). import re def clean_html ( raw_html : str ) -> str : # Remove script and style tags cleaned = re . sub ( r ' <script[^>]*>.*?</script> ' , '' , raw_html , flags = re . DOTALL ) cleaned = re . sub ( r ' <style[^>]*>.*?</style> ' , '' , cleaned , flags = re . DOTALL ) return cleaned [: 5000 ] # Keep first 5000 chars as context Then I called: raw = requests . get ( ' https://example.com/product/123 ' ). text snippet = clean_html ( raw ) data = extract_product_data ( snippet ) print ( data ) # {'product_name': 'Trail Pro Jacket', 'price': '$89.99', 'in_stock': True} It worked surprisingly well—on maybe 80% of the pages. The LLM could find the price even when it was buried in a table or formatted with weird
← WSZYSTKIE NEWSY
Why I ditched regex scrapers for an LLM parser (and when you shouldn't)
AUTHOR · zhongqiyue
Last month I needed to scrape product details from 30 different e-commerce sites. Each site used its own HTML structure, class names changed weekly, and some were just plain inconsistent. I had two options: write a mountain of brittle CSS selectors or try something I’d been avoiding—letting an LLM figure out the extraction. Here’s what I learned the hard way, including the code that actually worked and the cases where I should have just stuck with BeautifulSoup. I was building a price comparison tool for niche outdoor gear. The data I needed was simple: product name, price, availability, and a