Web Crawler for Data Collection

Author
Abstract
1. Introduction
2. Methodology
2.1 Technology Stack
2.2 Crawler Workflow
2.3 Robots.txt Compliance
2.4 BFS Algorithm
2.5 Error Handling
3. Results
3.1 Crawling Statistics
3.2 Error Analysis
3.3 SEO Insights
4. Optimizations
Rate Limiting
Dynamic Content (SPAs)
URL Normalization
5. Future Directions
References

Author

Aditya Sundar - Waseda University

Abstract

This project develops a web crawler designed to systematically explore websites and generate comprehensive URL lists, regardless of sitemap availability. The crawler handles both static and dynamic content, tracks external references, and operates efficiently on limited hardware. Key Capabilities:

Asynchronous crawling with configurable concurrency
BFS-based URL discovery for systematic coverage
External reference tracking and analysis
Dynamic content support via Playwright
SQLite-based persistent storage

1. Introduction

Web crawling enables automated systems to browse websites and gather data. While sitemaps provide structured URL lists, not all websites have them. This crawler:

Works on sites with or without sitemaps
Generates external sitemaps (URLs referencing other domains)
Runs on low-spec hardware without proxies
Handles large-scale websites efficiently

2. Methodology

2.1 Technology Stack

Technology	Purpose
Python	Primary language
aiohttp	Asynchronous HTTP requests
BeautifulSoup	HTML parsing and link extraction
asyncio	Concurrent code execution
SQLite	Persistent URL storage
Playwright	Dynamic content rendering
fake_useragent	Bot detection avoidance

2.2 Crawler Workflow

Initialization: Normalize start URL, check robots.txt, create database tables
URL Fetching: Asynchronous batch requests with delays
Link Extraction: Parse anchor tags, normalize URLs
Data Storage: SQLite for URLs, errors, external domains
Concurrency: Configurable limit (default: 100 concurrent)
Completion: Stop when no new URLs or limit reached

2.3 Robots.txt Compliance

def check_robots_txt(url, timeout=5, retry_count=3):
    robots_url = f"{parsed_url.scheme}://{parsed_url.netloc}/robots.txt"
    rp.set_url(robots_url)
    
    for attempt in range(retry_count):
        try:
            response = requests.get(robots_url, timeout=timeout)
            if response.status_code < 400:
                rp.parse(response.text.splitlines())
                return
        except requests.Timeout:
            print(f"Timeout, retrying...")
    
    # Default to allowing crawl if robots.txt unavailable
    rp.parse(['User-agent: *', 'Disallow:'])

2.4 BFS Algorithm

The crawler uses Breadth-First Search to explore URLs:

Processes all URLs at current depth before moving deeper
Ensures systematic coverage level by level
Allows resumption if interrupted

2.5 Error Handling

Error Type	Action
404, 403, 400	Skip (no retry)
429	Wait for Retry-After header
500, 502, 503	Retry with delay
Timeout	Retry up to 3 times

3. Results

3.1 Crawling Statistics

Website Type	Internal URLs	External References
Entertainment Site 1	47,870	7,010,999
Gaming Site 2	143,058	3,027,460
Press Release Site	99,243	2,136,887
News Site 6	117,164	1,895,515
Government Site 1	28,841	55,700
Education Site 1	50,458	28,328

Totals:

Internal URLs: 1,121,407
External References: 18,957,040

External to Internal URL Ratio by Website Type

Entertainment/news sites have much higher external reference ratios than government sites, reflecting different content strategies.

3.2 Error Analysis

Error Type	Count
Status 404	43,611
Status 503	29,555
Decoding errors	3,271
Robots.txt restrictions	2,696
Other errors	124,261
Max retry reached	11,994

Actual skipped links: 62,050

3.3 SEO Insights

Media sites: High external reference ratio may increase credibility
Government sites: Focus on internal linking is appropriate
Balanced approach recommended for most sites

4. Optimizations

Rate Limiting

connector = aiohttp.TCPConnector(limit=100)  # Adjustable

Lower limits prevent 429 errors but slow crawling.

Dynamic Content (SPAs)

Playwright handles JavaScript-rendered content:

async def scrape_all_links(start_url):
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=False)
        page = await browser.new_page()

URL Normalization

Remove fragments
Unify schemes (HTTPS only)
Remove trailing slashes
Database prevents duplicate crawling

5. Future Directions

Proxy support: Reduce IP blocking risk
Better hardware: Handle larger crawls
Scrapy integration: Leverage robust framework features
Graph analysis: Map website relationships using embeddings

References

aiohttp - Asynchronous HTTP client/server for Python
BeautifulSoup - HTML/XML parsing library
Playwright - Browser automation framework
SQLite - Lightweight database engine

Go WebSocket Proxy (Nov 2025)Face & Emotion (Sep 2024)

⌘I

Getting Started

Quick Start Guide

Pricing & Plans

Live Captions & Webinars

PC Voice Translation

Subtitles, Minutes & Dictionary

Mobile App

Admin Features

SSO Configuration

Virtual Office

Productivity Management

Support & FAQ

Research

Hiring

Legal & Security

Author

Abstract

1. Introduction

2. Methodology

2.1 Technology Stack

2.2 Crawler Workflow

2.3 Robots.txt Compliance

2.4 BFS Algorithm

2.5 Error Handling

3. Results

3.1 Crawling Statistics

3.2 Error Analysis

3.3 SEO Insights

4. Optimizations

Rate Limiting

Dynamic Content (SPAs)

URL Normalization

5. Future Directions

References

Getting Started

Quick Start Guide

Pricing & Plans

Live Captions & Webinars

PC Voice Translation

Subtitles, Minutes & Dictionary

Mobile App

Admin Features

SSO Configuration

Virtual Office

Productivity Management

Support & FAQ

Research

Hiring

Legal & Security

​Author

​Abstract

​1. Introduction

​2. Methodology

​2.1 Technology Stack

​2.2 Crawler Workflow

​2.3 Robots.txt Compliance

​2.4 BFS Algorithm

​2.5 Error Handling

​3. Results

​3.1 Crawling Statistics

​3.2 Error Analysis

​3.3 SEO Insights

​4. Optimizations

​Rate Limiting

​Dynamic Content (SPAs)

​URL Normalization

​5. Future Directions

​References

Author

Abstract

1. Introduction

2. Methodology

2.1 Technology Stack

2.2 Crawler Workflow

2.3 Robots.txt Compliance

2.4 BFS Algorithm

2.5 Error Handling

3. Results

3.1 Crawling Statistics

3.2 Error Analysis

3.3 SEO Insights

4. Optimizations

Rate Limiting

Dynamic Content (SPAs)

URL Normalization

5. Future Directions

References