Author
Aditya Sundar - Waseda University
Abstract
This project develops a web crawler designed to systematically explore websites and generate comprehensive URL lists, regardless of sitemap availability. The crawler handles both static and dynamic content, tracks external references, and operates efficiently on limited hardware.
Key Capabilities:
- Asynchronous crawling with configurable concurrency
- BFS-based URL discovery for systematic coverage
- External reference tracking and analysis
- Dynamic content support via Playwright
- SQLite-based persistent storage
1. Introduction
Web crawling enables automated systems to browse websites and gather data. While sitemaps provide structured URL lists, not all websites have them. This crawler:
- Works on sites with or without sitemaps
- Generates external sitemaps (URLs referencing other domains)
- Runs on low-spec hardware without proxies
- Handles large-scale websites efficiently
2. Methodology
2.1 Technology Stack
| Technology | Purpose |
|---|
| Python | Primary language |
| aiohttp | Asynchronous HTTP requests |
| BeautifulSoup | HTML parsing and link extraction |
| asyncio | Concurrent code execution |
| SQLite | Persistent URL storage |
| Playwright | Dynamic content rendering |
| fake_useragent | Bot detection avoidance |
2.2 Crawler Workflow
- Initialization: Normalize start URL, check robots.txt, create database tables
- URL Fetching: Asynchronous batch requests with delays
- Link Extraction: Parse anchor tags, normalize URLs
- Data Storage: SQLite for URLs, errors, external domains
- Concurrency: Configurable limit (default: 100 concurrent)
- Completion: Stop when no new URLs or limit reached
2.3 Robots.txt Compliance
def check_robots_txt(url, timeout=5, retry_count=3):
robots_url = f"{parsed_url.scheme}://{parsed_url.netloc}/robots.txt"
rp.set_url(robots_url)
for attempt in range(retry_count):
try:
response = requests.get(robots_url, timeout=timeout)
if response.status_code < 400:
rp.parse(response.text.splitlines())
return
except requests.Timeout:
print(f"Timeout, retrying...")
# Default to allowing crawl if robots.txt unavailable
rp.parse(['User-agent: *', 'Disallow:'])
2.4 BFS Algorithm
The crawler uses Breadth-First Search to explore URLs:
- Processes all URLs at current depth before moving deeper
- Ensures systematic coverage level by level
- Allows resumption if interrupted
2.5 Error Handling
| Error Type | Action |
|---|
| 404, 403, 400 | Skip (no retry) |
| 429 | Wait for Retry-After header |
| 500, 502, 503 | Retry with delay |
| Timeout | Retry up to 3 times |
3. Results
3.1 Crawling Statistics
| Website Type | Internal URLs | External References |
|---|
| Entertainment Site 1 | 47,870 | 7,010,999 |
| Gaming Site 2 | 143,058 | 3,027,460 |
| Press Release Site | 99,243 | 2,136,887 |
| News Site 6 | 117,164 | 1,895,515 |
| Government Site 1 | 28,841 | 55,700 |
| Education Site 1 | 50,458 | 28,328 |
Totals:
- Internal URLs: 1,121,407
- External References: 18,957,040
External to Internal URL Ratio by Website Type
Entertainment/news sites have much higher external reference ratios than government sites, reflecting different content strategies.
3.2 Error Analysis
| Error Type | Count |
|---|
| Status 404 | 43,611 |
| Status 503 | 29,555 |
| Decoding errors | 3,271 |
| Robots.txt restrictions | 2,696 |
| Other errors | 124,261 |
| Max retry reached | 11,994 |
Actual skipped links: 62,050
3.3 SEO Insights
- Media sites: High external reference ratio may increase credibility
- Government sites: Focus on internal linking is appropriate
- Balanced approach recommended for most sites
4. Optimizations
Rate Limiting
connector = aiohttp.TCPConnector(limit=100) # Adjustable
Lower limits prevent 429 errors but slow crawling.
Dynamic Content (SPAs)
Playwright handles JavaScript-rendered content:
async def scrape_all_links(start_url):
async with async_playwright() as p:
browser = await p.chromium.launch(headless=False)
page = await browser.new_page()
URL Normalization
- Remove fragments
- Unify schemes (HTTPS only)
- Remove trailing slashes
- Database prevents duplicate crawling
5. Future Directions
- Proxy support: Reduce IP blocking risk
- Better hardware: Handle larger crawls
- Scrapy integration: Leverage robust framework features
- Graph analysis: Map website relationships using embeddings
References
- aiohttp - Asynchronous HTTP client/server for Python
- BeautifulSoup - HTML/XML parsing library
- Playwright - Browser automation framework
- SQLite - Lightweight database engine