Personal project · 2025

Wikipedia Scraper

Async crawler with 100 concurrent workers, O(1) URL deduplication, and a 20-second global deadline.

Python · Asyncio · Aiohttp · BeautifulSoup

Challenge

Efficiently crawling large-scale websites requires balancing speed with resource management under strict time constraints.

Built a high-concurrency async crawler with 100 workers, O(1) URL deduplication, and a global 20-second deadline using Python's asyncio and aiohttp.

100 Concurrent Workers. Saturates network bandwidth and masks I/O latency with massive parallelism.
20s Deadline Enforcement. Global deadline propagation cancels all pending tasks exactly at the time limit.
URL Deduplication. Hash set guarantees O(1) lookup time, preventing redundant processing and infinite loops.
Non-Blocking Architecture. Full async event loop with robust link normalization and protocol handling.