Tavis Lochhead,Co-Founder of KadoaThe earnings call isn't the first place a company's momentum appears. The hiring page is. Or the pricing grid. Or a quiet update to a product catalog that goes live three weeks before any press release.
Research teams have always known that. What's changed is that you can now monitor those signals across entire sectors without a ten-person engineering squad behind every scraper. AI data extraction is the reason.
AI data extraction uses machine learning and large language models to turn unstructured web content into structured datasets research teams can act on. Web data for hedge funds, asset managers, and quant teams becomes structured signals (hiring, pricing, corporate communications, sentiment) weeks before those signals reach vendor feeds or earnings guidance.
Research stacks have been built for decades on the same four inputs: earnings reports, regulatory filings, sell-side analyst coverage, and market data feeds. Each is useful. No single one tells the whole story.
A lot of what moves a company's trajectory appears on the public web first. A jobs page signals a new product line months before the strategy update. A pricing change flags demand pressure before the quarter closes. A regional landing page quietly confirms a market expansion that hasn't been announced yet. These signals exist, they're public, and they appear ahead of the disclosure calendar.
The problem has never been absence of signal. It's been extraction: thousands of sources, inconsistent layouts, constantly changing structure, no common schema. Manually, this kind of coverage doesn't scale. With AI data extraction, it can.
AI data extraction is the use of machine learning and large language models to identify relevant information on a web page, PDF, or other document, interpret the unstructured content, and convert it into a structured dataset a research team can query. For investment research, typical sources include:
What makes this different from traditional rule-based scraping is that AI extraction can interpret context and adapt when layouts change. A rule-based scraper asks "is there a div with this exact CSS class at this path?" and fails the moment the site's designer renames the class. An AI-native extractor asks "what on this page looks like a job posting?" and keeps working across the next redesign. That difference makes continuous monitoring of thousands of sources a realistic engineering commitment instead of a permanent firefight.
There's a broader way to see why this matters now. The most capable LLMs today reason well about what they already know, but they can't tell you today's price, this week's hiring update, or yesterday's filing unless they are given that context. Extraction supplies the knowledge layer that reasoning alone doesn't produce. For research teams, that's what lets the same LLM that competently summarizes a 10-K become useful for monitoring a live thesis.
Three structural problems inside research operations are pushing this adoption, and none of them are new. What's new is that the tooling can now handle them at scale. AI usage is now broadly entrenched across the buy side: every surveyed fund manager uses AI systems in research, portfolio optimization, or trading to a moderate or large extent. With usage saturated, the differentiation is in what feeds those systems.
AI data extraction handles each of these by automating collection, standardizing the output, and monitoring for change continuously. The result is an alternative data pipeline that complements traditional feeds rather than competing with them. Around 77% of investment advisers now use both vendor-provided alt data and in-house sources, so direct extraction supplements rather than replaces external solutions.
A subtler driver is worth naming: Neudata's 2026 alternative-data market analysis indicates that AI isn't yet changing how much funds spend on alternative data so much as how they want to consume it. Buyer preference has shifted toward raw, structured, machine-readable datasets that feed proprietary AI and LLM workflows, rather than vendor-packaged AI features. That preference maps exactly to what AI-native extraction produces.
Most of the signal that research teams act on falls into four categories. Each has a different update cadence and lead time to disclosure.
Job boards function almost like a real-time strategy disclosure. Hiring velocity, department mix, geographic footprint, and skill composition all change before earnings guidance reflects them.
A spike in AI-agent engineering hires at a CRM or ERP vendor is usually a strategic product bet that hasn't been announced yet. A freeze across engineering and sales points to cost pressure before the quarterly reset. Geographic role concentration in a new city often flags a market entry weeks before the press release.
Product catalogs run on a daily clock, not a quarterly one. The useful signals are price volatility by SKU, discount frequency and depth, inventory and stock-availability patterns, and new product listings.
Retail price cuts across a category usually point to demand weakness or excess inventory. A sustained price increase often signals supply constraints weeks before industry reports confirm it. New product pages going live, even with no announcement, tend to precede formal launches.
Press releases, investor relations updates, and product announcements carry strategic context that rarely fits into structured feeds. The useful fields are announcement type, counterparty, geography, and any numeric commitment (spend, headcount, capacity).
Partnership announcements, new market entries, and capacity investments often appear here first. Reading them at scale, across a sector, produces a continuously updated view of competitive positioning that no single analyst team can match manually.
Review platforms and public discussions carry early signals on product quality, brand momentum, and customer experience. The useful signals are rating trends, complaint theme clusters, review volume spikes, and sentiment shifts by product line.
A sustained rating decline across multiple review sites often precedes revenue softness in subsequent quarters. A sharp rise in complaints about a specific feature can expose a quality issue before it reaches returns or warranty data.
Kadoa's investment-research signals guide covers adjacent categories like store footprint and product-update tracking, with worked examples for each.
A useful research pipeline isn't one big model. It's a sequence of steps, each with a specific output, that together convert raw pages into output a quant model or an analyst dashboard can consume.
The typical flow runs through six stages:
The AI model work is roughly 30% of what gets a pipeline to institutional-grade reliability. The other 70% is orchestration, data validation, error handling and human-in-the-loop tooling for the cases automation can't resolve. That ratio is why evaluating an extraction platform should focus more on the production layers than on which model sits underneath.
Kadoa's platform runs this as an Agentic ETL architecture: AI agents generate deterministic extraction code per source, not black-box LLM calls on every page. The generated code is versioned, so a given run is reproducible against a specific extractor revision. Workflows are auditable end-to-end, with source grounding that links each data point back to its origin.
When a site's structure changes, the self-healing layer detects the break, regenerates the selectors, validates the output against historical patterns, and ships a new extractor version before resuming. If automated recovery fails, the platform notifies the workflow owner rather than letting the pipeline silently drift.
Beyond resilience, four pipeline outputs matter specifically for institutional use:
These four properties determine whether a dataset can sit inside a quant model or a regulated research workflow at all.
The institutional value of self-healing isn't recovery itself. It's that failures get paged rather than silently drifting off-spec, which is the difference between a pipeline you can trust at scale and one that quietly poisons downstream models.
Delivery runs through Snowflake, S3, REST APIs, webhooks, WebSocket streams, spreadsheets, and MCP (Model Context Protocol) connectors for AI agent workflows, which means the extracted data arrives where research teams already work.
The benefits cluster in four places.
Reliability matters as much as speed here. Data quality is the most-cited post-onboarding complaint among institutional data buyers (around 42% flag it as the top issue with existing datasets), which is why AI extraction with deterministic code generation, per-field source grounding, and automated validation holds up at scale. It's what keeps trust in the pipeline when the source count grows from 50 to 2,000.
Document-heavy extraction has its own reliability bar. For filings, transcripts, and prospectuses, reliability depends on cross-document reconciliation: totals tying across statements, period-over-period values rolling forward without breaks, and entity names resolving to the same issuer regardless of which filing they appear in.
AI data extraction has made a real category of problems tractable, but it hasn't made them disappear. Four concerns remain.
Research stacks are converging on three data layers: traditional market data, vendor alternative data, and internal web-derived signals. The third layer is where most of the differentiation will come from over the next two to three years, and it's increasingly visible across the broader 2026 alt-data trend landscape.
A few specific shifts worth tracking:
A secondary shift is emerging: the systems consuming extracted data are increasingly AI agents, not just human analysts. Research copilots, automated monitoring tools, and agent-driven workflows all need the same input: structured, current web signals they can reason over. Every one of those agents multiplies demand for the upstream extraction pipeline that supplies them.
Roughly 93% of firms plan to increase AI budgets in 2026, and 89% plan to grow alt data budgets. But only 31% have moved AI-processed data directly into investment strategies, a figure that has roughly doubled from 14% a year earlier. That rate of change, not the current numbers, is what research teams will be working through over the next 18 to 24 months.
The hiring page, the pricing grid, and the quiet catalog update aren't hard to monitor at scale anymore. At 90% alt data adoption, the edge no longer comes from accessing those signals. It comes from extracting, structuring, and shipping them faster than the next fund.
The practical starting point is one signal category and a narrow thesis: 10 to 20 target companies, one source type, a pilot measured against whether the data adds signal to existing research. The questions worth pressing in any vendor evaluation are point-in-time fidelity, source grounding per field, and how the platform handles silent drift. From there, the pipeline expands by source, by category, and by integration point.
For research teams setting this up in production, the Kadoa investment research platform covers source identification, AI-native extraction, normalization, monitoring, and delivery into the tools your analysts and models already use. Any platform that keeps every extraction auditable, traceable to its source, and compliant with institutional policy fits the same pattern. That combination, more than the specific vendor, is what makes the pipeline hold up at institutional scale.
AI data extraction is the use of machine learning and large language models to identify relevant information in unstructured content (web pages, documents, filings) and convert it into structured datasets. Unlike rule-based scrapers, AI extractors interpret context and can adapt when source layouts change.
AI data extraction interprets fields semantically (a job posting, a price, a date) and adapts when site layouts change. Traditional scrapers rely on fixed HTML rules and break whenever the structure shifts. For research use, this lowers maintenance cost and makes multi-thousand-source monitoring practical.
The most common categories are hiring and workforce data from job postings, product and pricing data from catalogs, corporate communications from press releases and investor relations pages, and consumer sentiment from reviews and public discussions. Each has a distinct update cadence and lead time relative to disclosure.
It can be, when the platform is designed for it. Institutional use requires robots.txt enforcement, PII filtering, MNPI screening, GDPR and CCPA handling of personal data, audit trails, and source grounding for every data point. Platforms with built-in policy engines and approval workflows reduce legal review cycles from weeks to days.
Most firms use both. Around 77% of investment advisers use a mix of in-house and vendor data. Vendor feeds give broad, reliable coverage of commoditized signals. Direct AI extraction fills gaps where the fund needs a proprietary schema, a custom source list, or a refresh cadence the vendor doesn't support.
Pick one signal category tied to an active thesis, identify 10 to 20 target companies, and run a pilot for a few weeks. Measure against a baseline: does the extracted data flag thesis-relevant events that existing research missed, and is the lead time meaningful enough to act on? If the pilot shows signal, expand by source and by category from there rather than trying to cover everything at once.

Tavis is a Co-Founder of Kadoa with expertise in product development and web technologies. He focuses on making complex data workflows simple and efficient.
How hedge funds and asset managers use web scraping to extract proprietary signals from public websites for investment decisions.
What hedge funds actually need to build at scale: the signals worth tracking, the pipeline that holds up, and the compliance layer that doesn't block every new source.
AI and compute scale are making it possible to source public data at scale without a large team or an expensive vendor contract.