How we crawled company data from the web
Abstract: With only the company name available, how would you learn more about a given business—or even gather leads to get in touch with that business? We were approached by a client to build a comprehensive database not only covering certain companies’ individual characteristics, but also extending their data with information from unstructured sources in the WWW. Using cloud technologies and web-scraping frameworks, we built a data-mart that provided an extensive overview of different companies.
Our customer set out to build a huge collection of companies with information such as addresses, annual revenues, and leads. Their goal was to extend the data using previously-unknown, unstructured sources on the web—enabling their M&A department to retain a list of pre-filtered, pre-validated candidate companies to put in a manual validation process.
They soon realized there is no single free or paid source on the web that can deliver enough information to support this process. Each source covers a specific domain, with many failing to provide adequate information on specific companies.
At SPRYFOX, we saw this challenge as a federated search opportunity. Since on-demand retrieval wasn’t an option for the data we were looking for, we broke down our approach into several steps.
First, we agreed on a common bootstrapping technique by defining the available input data for the federated search algorithm. From there, we could evaluate valuable company data sources. The web, after all, is full of providers offering company data at various rates and plans. Our research led us to the right combination of category-focused specialists and generalists—covering huge geographical regions as an added bonus. We then designed a cloud-based processing pipeline where the data could be continually cleansed and enriched. Using filtering and scoring algorithms, we reduced the number of companies to a reasonable amount, helping M&A agents focus on important targets as a result.
For our customer, we decided to employ a set of AWS services. Lambda enabled the continuous retrieval of fresh data from adopted company sources, also allowing us to process that data.
Step functions, meanwhile, orchestrated the data-processing workflow, ingesting the resulting data into DynamoDB as a persistence layer. EventBridge, SNS, and CloudWatch helped to wire the services while monitoring possible failures.
The end result was exactly what we were looking for: a web application, natively built with AWS services, that gave our customer the required interface. Today this customer can score and filter company data right down to the most relevant details.