Design a web crawler that fetches every page on en.wikipedia.org exactly 1 time. You have 10,000 servers you can use and you are not allowed to fetch a URL more than once. If a URL fails to be fetched (because of a timeout or server failure), it can be discarded.
Requirements
Functional
- Download all (6.7m pages) URLs from en.wikipedia.com
- Only download each URL once
- Using 10k 2-core servers
- Only processing on the content is extract URLs otherwise persist the content to local storage
- Don’t crawl images
- Only crawl English Wikipedia
- Minimize network traffic between each server.
Non-functional
- No need to self-rate limit
- Fetch the content as fast as possible
High-level Analysis
How much storage will we need to store 6,700,000 URLs and their HTML documents?
The average internet URL length is 66 characters. Since we don’t need to track the domain name or HTTPS prefix, we will round down to 60 characters.
60 characters = 60 bytes
60 bytes * 6,700,000 URLs = 400000000 bytes or 400 Megabytes
The average HTML page is about 100kb.
100 kilobytes * 6,700,000 documents = 670 Gigabytes
(from: https://leetcode.com/discuss/interview-question/723588/designing-a-distributed-web-crawler)
URL Frontier
A URL Frontier acts as a queue system, effectively deciding the order in which pages are crawled.
Use a centralized database or a distributed system like Apache Cassandra to maintain a list of URLs to be crawled. This database should be designed to handle high read/write throughput. Implement a mechanism to ensure each URL is unique in the frontier. This can be achieved by using a hash table or a set or a bloom filter.
Here are the key components and considerations in designing a URL Frontier:
- Queue Management:
- Multiple queues can be used to prioritize URLs based on certain criteria (like domain, page importance, update frequency).
- Priority queue algorithms can ensure that more important URLs are crawled first.
- URL Deduplication:
- Use a distributed cache or a bloom filter to store visited URLs efficiently and check against this before crawling a URL.
- Can be implemented using hash tables, bloom filters, or other data structures that efficiently check if a URL has already been added.
- URL Normalization:
- Standardizes URLs to a consistent format, ensuring that different forms of the same URL are recognized as identical.
- Includes converting to lowercase, removing “www,” and handling redirects.
- Politeness Policy:
- Ensures the crawler respects site-specific constraints, like crawl-delay defined in robots.txt.
- Helps in maintaining a good relationship with web servers and avoiding IP bans.
- Robots.txt Compliance:
- The Frontier must check robots.txt files and filter out URLs disallowed by them.
- This requires fetching and parsing robots.txt for each domain.
DNS Resolver
DNS resolver is indeed an important component of a web crawler, especially if the crawler performs frequent DNS lookups. DNS (Domain Name System) resolution is the process of translating a domain name (like www.example.com
) into an IP address that can be used to make network requests.
Crawler Engine(Coordinator)
Crawler coordinator works as a scheduler to fetch URLs from URL frontier and arrange the URL visit taks for worker nodes. It puts all the visiting tasks to a distributed messaged queue such that worker nodes can fetch the URL and continue the visiting.
Message Queue
Implement a load balancer or a distributed queue system (like Kafka) to evenly distribute URLs among crawler worker nodes. Each node independently fetches URLs requests from message queue.
Crawler Instance(Worker nodes)
Deploy multiple crawler nodes (servers) that can operate independently. Each node is responsible for fetching web pages, extracting links, and passing new URLs back to a central system.
Since the crawl is restricted to “en.wikipedia.org”, configure the crawler to ignore links that lead outside this domain. Each node should use concurrent requests to maximize efficiency but also implement rate limiting to comply with Wikipedia’s policies and avoid overloading their servers. Ensure the crawler respects the directives in Wikipedia’s robots.txt
file. Adhere to legal and ethical norms, especially regarding data usage and storage.
If a URL fails to be fetched due to timeout or server failure, mark it as ‘failed’ in the database and do not attempt to refetch. Implement a heartbeat system where each node regularly updates its status. If a node fails, its queue of URLs can be returned to the central database for redistribution.
Each node should parse the HTML content of fetched pages to extract new URLs.
URL Filter
URL filter receives URLs parsed from content fetching, filter out any URLs that lead outside of “en.wikipedia.org” or are duplicates of already discovered URLs, and save URLs into URL store.
Content & Metadata Storage
Store the fetched content in a distributed file system or a big data storage solution like Hadoop HDFS or Amazon S3. Store metadata (like URL, fetch status, timestamp) in a distributed database for easy access and management. Check metadata store before saving the content to content store to avoid duplication.
- Employ a distributed database system to handle large volumes of data.
- Use database sharding to distribute data across multiple servers.
- Implement efficient indexing strategies for quick retrieval of stored data.
Monitoring and Logging:
- Monitoring System: Set up a monitoring system to track the progress, performance metrics, and error rates of the crawl.
- Logging: Log detailed information about the crawling process for debugging and analysis purposes.
Ethical Considerations
- Responsible Crawling: Always prioritize the stability and performance of Wikipedia’s servers over the crawler’s data collection speed.
- Transparency: Be transparent about the crawler’s activities and purpose, especially if the data collected will be used for public or commercial purposes.
- Compliance with Legal and Ethical Standards: Ensure that the crawling activities comply with legal requirements and ethical standards, respecting user privacy and copyright laws.
robots.txt
File
Here is a simple example of what a robots.txt
file might look like:
User-agent: *
Disallow: /private/
Disallow: /temp/
Allow: /
Sitemap: http://www.example.com/sitemap.xml
Handling Retries for Failed/Errored URLs:
- Retry Logic:
- Implement a retry mechanism for transient errors (like network issues or server errors). The number of retries can be limited to avoid endless loops.
- Exponential Backoff:
- Use an exponential backoff strategy for retries, gradually increasing the wait time between retries to reduce load on the server and network.
- Error Categorization:
- Categorize errors to decide which ones are worth retrying. For example, a 404 Not Found might not be retried, but a 503 Service Unavailable might be.
- Logging and Monitoring:
- Log failures and monitor error rates to identify and address systemic issues.
- Feedback Loop:
- Use feedback from the Content Fetcher to update the URL Frontier about the status of each URL, ensuring that the system is aware of which URLs need to be retried.