Skip to main content

Author

Aditya Sundar - Waseda University

Abstract

This project develops a web crawler designed to systematically explore websites and generate comprehensive URL lists, regardless of sitemap availability. The crawler handles both static and dynamic content, tracks external references, and operates efficiently on limited hardware. Key Capabilities:
  • Asynchronous crawling with configurable concurrency
  • BFS-based URL discovery for systematic coverage
  • External reference tracking and analysis
  • Dynamic content support via Playwright
  • SQLite-based persistent storage

1. Introduction

Web crawling enables automated systems to browse websites and gather data. While sitemaps provide structured URL lists, not all websites have them. This crawler:
  • Works on sites with or without sitemaps
  • Generates external sitemaps (URLs referencing other domains)
  • Runs on low-spec hardware without proxies
  • Handles large-scale websites efficiently

2. Methodology

2.1 Technology Stack

TechnologyPurpose
PythonPrimary language
aiohttpAsynchronous HTTP requests
BeautifulSoupHTML parsing and link extraction
asyncioConcurrent code execution
SQLitePersistent URL storage
PlaywrightDynamic content rendering
fake_useragentBot detection avoidance

2.2 Crawler Workflow

  1. Initialization: Normalize start URL, check robots.txt, create database tables
  2. URL Fetching: Asynchronous batch requests with delays
  3. Link Extraction: Parse anchor tags, normalize URLs
  4. Data Storage: SQLite for URLs, errors, external domains
  5. Concurrency: Configurable limit (default: 100 concurrent)
  6. Completion: Stop when no new URLs or limit reached

2.3 Robots.txt Compliance

def check_robots_txt(url, timeout=5, retry_count=3):
    robots_url = f"{parsed_url.scheme}://{parsed_url.netloc}/robots.txt"
    rp.set_url(robots_url)
    
    for attempt in range(retry_count):
        try:
            response = requests.get(robots_url, timeout=timeout)
            if response.status_code < 400:
                rp.parse(response.text.splitlines())
                return
        except requests.Timeout:
            print(f"Timeout, retrying...")
    
    # Default to allowing crawl if robots.txt unavailable
    rp.parse(['User-agent: *', 'Disallow:'])

2.4 BFS Algorithm

The crawler uses Breadth-First Search to explore URLs:
  • Processes all URLs at current depth before moving deeper
  • Ensures systematic coverage level by level
  • Allows resumption if interrupted

2.5 Error Handling

Error TypeAction
404, 403, 400Skip (no retry)
429Wait for Retry-After header
500, 502, 503Retry with delay
TimeoutRetry up to 3 times

3. Results

3.1 Crawling Statistics

Website TypeInternal URLsExternal References
Entertainment Site 147,8707,010,999
Gaming Site 2143,0583,027,460
Press Release Site99,2432,136,887
News Site 6117,1641,895,515
Government Site 128,84155,700
Education Site 150,45828,328
Totals:
  • Internal URLs: 1,121,407
  • External References: 18,957,040
URL Ratio Chart

External to Internal URL Ratio by Website Type

Entertainment/news sites have much higher external reference ratios than government sites, reflecting different content strategies.

3.2 Error Analysis

Error TypeCount
Status 40443,611
Status 50329,555
Decoding errors3,271
Robots.txt restrictions2,696
Other errors124,261
Max retry reached11,994
Actual skipped links: 62,050

3.3 SEO Insights

  • Media sites: High external reference ratio may increase credibility
  • Government sites: Focus on internal linking is appropriate
  • Balanced approach recommended for most sites

4. Optimizations

Rate Limiting

connector = aiohttp.TCPConnector(limit=100)  # Adjustable
Lower limits prevent 429 errors but slow crawling.

Dynamic Content (SPAs)

Playwright handles JavaScript-rendered content:
async def scrape_all_links(start_url):
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=False)
        page = await browser.new_page()

URL Normalization

  • Remove fragments
  • Unify schemes (HTTPS only)
  • Remove trailing slashes
  • Database prevents duplicate crawling

5. Future Directions

  1. Proxy support: Reduce IP blocking risk
  2. Better hardware: Handle larger crawls
  3. Scrapy integration: Leverage robust framework features
  4. Graph analysis: Map website relationships using embeddings

References

  1. aiohttp - Asynchronous HTTP client/server for Python
  2. BeautifulSoup - HTML/XML parsing library
  3. Playwright - Browser automation framework
  4. SQLite - Lightweight database engine