Personal project · 2025
Wikipedia Scraper
Async crawler with 100 concurrent workers, O(1) URL deduplication, and a 20-second global deadline.
Python · Asyncio · Aiohttp · BeautifulSoup
Challenge
Efficiently crawling large-scale websites requires balancing speed with resource management under strict time constraints.
Approach
Built a high-concurrency async crawler with 100 workers, O(1) URL deduplication, and a global 20-second deadline using Python's asyncio and aiohttp.
What it does
- 100 Concurrent Workers. Saturates network bandwidth and masks I/O latency with massive parallelism.
- 20s Deadline Enforcement. Global deadline propagation cancels all pending tasks exactly at the time limit.
- URL Deduplication. Hash set guarantees O(1) lookup time, preventing redundant processing and infinite loops.
- Non-Blocking Architecture. Full async event loop with robust link normalization and protocol handling.