~/tutorials/python-web-scraping-with-beautifulsoup-60-line-hn-scraper
§ POST · MAY 11, 2026 v1.0

Python web scraping with BeautifulSoup: 60-line HN scraper

A real Python BeautifulSoup tutorial: scrape Hacker News front page in 60 lines, handle pagination, save to CSV, and the polite-scraping rules I follow.
Ryan CallowayStaff contributor
  11 min read

By Ryan Calloway. Updated May 2026.

“Scrape the Hacker News front page” stops being an interview question the day you keep BeautifulSoup, requests, and a CSS-selector cheat sheet in muscle memory. This tutorial gives you a runnable script that pulls every story from the HN front page (rank, title, URL, score, user, age, comment count), paginates across the first three pages, and saves to CSV in about 60 lines of Python 3.13. The same shape — fetch(), parse(), save() — is the one I copy into every new scraping project.

Quick answer: pip install beautifulsoup4 requests lxml, fetch with requests.get(url).text, parse with BeautifulSoup(html, "lxml"), select with CSS like soup.select("tr.athing.submission"), pull text with .get_text(strip=True), attributes with el["href"], and add a 1–2 second sleep between requests plus a descriptive User-Agent. The rest is selectors and politeness.

What you’ll learn

Prerequisites

python3 -m venv .venv
source .venv/bin/activate          # Windows: .venv\Scripts\activate
pip install --upgrade pip
pip install beautifulsoup4 lxml requests

On Linux, if lxml fails to build, install libxml2-dev and libxslt-dev from your package manager first. On macOS and Windows the wheel ships precompiled. If you hit ModuleNotFoundError after install, see the 5-minute Python fix; it is almost always the wrong interpreter.

The 60-line Hacker News scraper

This is the working template. Save as hn.py and run it.

"""Scrape Hacker News front pages and save to CSV."""
import csv
import time
from pathlib import Path

import requests
from bs4 import BeautifulSoup

BASE = "https://news.ycombinator.com/news"
HEADERS = {
    "User-Agent": "hn-research-bot/1.0 (contact: you@example.com)",
    "Accept-Language": "en-US,en;q=0.9",
}

def fetch(url: str) -> str:
    r = requests.get(url, headers=HEADERS, timeout=20)
    r.raise_for_status()
    return r.text

def parse(html: str) -> list[dict]:
    soup = BeautifulSoup(html, "lxml")
    rows: list[dict] = []
    for athing in soup.select("tr.athing.submission"):
        title_el = athing.select_one("span.titleline > a")
        if title_el is None:
            continue
        sub_row = athing.find_next_sibling("tr")
        score   = sub_row.select_one("span.score")
        user    = sub_row.select_one("a.hnuser")
        age     = sub_row.select_one("span.age > a")
        last_a  = sub_row.select("td.subtext a")[-1]
        rank    = athing.select_one("span.rank")
        comments_txt = last_a.get_text(strip=True)
        rows.append({
            "rank": int(rank.get_text(strip=True).rstrip(".")) if rank else None,
            "title": title_el.get_text(strip=True),
            "url": title_el["href"],
            "score": int(score.get_text().split()[0]) if score else 0,
            "user": user.get_text(strip=True) if user else None,
            "age": age.get_text(strip=True) if age else None,
            "comments": int(comments_txt.split()[0]) if comments_txt[:1].isdigit() else 0,
        })
    return rows

def save(rows: list[dict], path: Path) -> None:
    if not rows:
        return
    with path.open("w", newline="", encoding="utf-8") as f:
        w = csv.DictWriter(f, fieldnames=list(rows[0]))
        w.writeheader()
        w.writerows(rows)

if __name__ == "__main__":
    all_rows: list[dict] = []
    for page in range(1, 4):                  # 3 pages = ~90 stories
        all_rows.extend(parse(fetch(f"{BASE}?p={page}")))
        time.sleep(1.5)                       # be polite
    save(all_rows, Path("hn_front.csv"))
    print(f"saved {len(all_rows)} stories to hn_front.csv")

One run produces an 85–90 row CSV (HN serves 30 stories per page; the last page is sometimes short). The first site you scrape is rarely going to be HN, but the fetch / parse / save shape generalizes to every public page I have scraped this decade.

How the selectors work

HN’s HTML is old-school table layout: each story is two table rows. The first has class athing submission and contains the title and link inside span.titleline. The second is the next sibling tr and holds td.subtext with score, user, age, and comment link.

The DevTools workflow is the fastest way to find selectors: open the page, right-click the element you want, “Inspect”, then right-click in the element panel and “Copy” → “Copy selector”. Chrome will hand you a brittle path like tr:nth-child(3) > td:nth-child(3) > span.titleline > a. Trim aggressively. Anything with nth-child breaks the moment the page reorders. Keep the shortest selector that still uniquely identifies the kind of element you want — for HN titles, span.titleline > a is enough.

CSS selectors beat find and find_all

BeautifulSoup’s .find() and .find_all() were the main API in 2012. In 2026 use .select() and .select_one() with CSS. Same syntax as DevTools, less verbose, and supports descendant, child, and attribute selectors out of the box.

# Old style
container = soup.find("div", {"class": "container"})
articles  = container.find_all("article", class_="post")

# CSS selectors
articles  = soup.select("div.container article.post")

The official BeautifulSoup CSS selector docs list every supported feature; almost all of CSS Selectors Level 4 works thanks to the soupsieve backend. Pseudo-classes like :has(), :not(), and :nth-of-type() are all available.

Getting text and attributes

el = soup.select_one("a.hnuser")

# Visible text, leading/trailing whitespace stripped
text = el.get_text(strip=True)

# Attribute access
href = el["href"]                # raises KeyError if missing
href = el.get("href")            # returns None if missing
href = el.get("href", "")        # returns default if missing

# Multi-node text with newline separator
para = soup.select_one("article").get_text("\n", strip=True)

Prefer el.get(attr) over el[attr] when the attribute might be missing; otherwise you get a KeyError that looks like a BeautifulSoup bug but is a typo in your selector. Prefer .get_text(strip=True) over .text; .text leaves ragged whitespace from the HTML formatting.

Pagination: scraping more than the front page

HN paginates with ?p=N. The script above loops range(1, 4), sleeping 1.5 seconds between fetches. For other sites the pattern is one of three: a query parameter (?page=2, ?offset=20), a “next” link in the DOM, or an infinite-scroll JSON endpoint. Stop conditions matter: an empty result, a 404, or the disappearance of the “next” link. Watch for any of the three.

def crawl(start: str, max_pages: int = 50) -> list[dict]:
    rows: list[dict] = []
    seen: set[str] = set()
    for page in range(1, max_pages + 1):
        new_rows = parse(fetch(f"{start}?p={page}"))
        if not new_rows:                      # empty page = stop
            break
        before = len(seen)
        for r in new_rows:
            if r["url"] not in seen:
                seen.add(r["url"])
                rows.append(r)
        if len(seen) == before:               # all duplicates = stop
            break
        time.sleep(1.5)
    return rows

The seen set gives free deduplication and a second stop condition: if a page returns nothing new, the site is looping you back to the same data and you stop.

Saving results: CSV first, SQLite later

Start with CSV. Three lines, opens in Excel, pipes into pandas.read_csv, diffs cleanly in git.

import csv
with open("hn_front.csv", "w", newline="", encoding="utf-8") as f:
    w = csv.DictWriter(f, fieldnames=list(rows[0]))
    w.writeheader()
    w.writerows(rows)

When you hit ~100k rows, need deduplication across runs, or want to query, graduate to SQLite. The sqlite3 module is in the stdlib; the setup is eight lines.

import sqlite3
con = sqlite3.connect("hn.db")
con.execute("""
  CREATE TABLE IF NOT EXISTS stories (
    url TEXT PRIMARY KEY, title TEXT, score INTEGER,
    user TEXT, comments INTEGER, fetched_at TEXT
  )
""")
con.executemany(
    "INSERT OR IGNORE INTO stories VALUES (?, ?, ?, ?, ?, datetime('now'))",
    [(r["url"], r["title"], r["score"], r["user"], r["comments"]) for r in rows],
)
con.commit()

The INSERT OR IGNORE on the URL primary key gives you idempotent re-runs for free.

requests vs httpx: which one to use

Both fetch HTML; both work fine with BeautifulSoup. I default to requests 2.32+ for tutorials and small scripts because it has the largest install base and the deepest Stack Overflow trail. I switch to httpx 0.28+ for production scrapers that need any of: HTTP/2, async, connection pooling across thousands of requests, or strict timeout-per-stage controls.

Need Use
Hello-world tutorial / one script requests
Async fan-out (100+ requests in flight) httpx.AsyncClient
HTTP/2 servers httpx
Login session reuse requests.Session or httpx.Client
Mocked unit tests Either; respx for httpx, responses for requests

The API surface is intentionally similar enough that swapping is a one-line change in 90% of cases.

When BeautifulSoup is the wrong tool

BeautifulSoup parses the HTML the server sent. Three cases break that assumption.

1. JavaScript-rendered pages. Many 2026 sites ship a near-empty HTML shell and fill content with JavaScript after load. requests.get sees the empty shell; soup.select(".job-card") returns []. Before reaching for Playwright, open DevTools → Network and filter for XHR/Fetch. The data the page renders is almost always coming from a JSON endpoint. Hit that endpoint directly and the scrape becomes one requests.get with no parsing at all.

2. There is an official API. For Hacker News specifically, the official Firebase API returns the full top-stories list as JSON. Use it for production. The HTML scrape in this tutorial is a teaching example because it exercises the selectors; for a real scheduled job, the API is faster, more reliable, and explicitly sanctioned.

3. The page has CAPTCHA or aggressive bot detection. Treat that as the site’s request to use the API instead. Sometimes rotating proxies and slower request rates get through; more often you were about to violate the terms of service.

Polite scraping: the 5 rules

  1. Read /robots.txt first. The Robots Exclusion Protocol was finally standardized as RFC 9309 in 2022. Use the stdlib urllib.robotparser to check whether your User-Agent is allowed on a path before fetching it.
  2. Set a descriptive User-Agent with a contact email. Many sites block unknown clients with a 403; a real UA plus a way to reach you also lets webmasters tell you to stop instead of just IP-banning.
  3. Rate-limit yourself. One request per second is my floor for sites I do not own. I drop to 0.3 only after watching a handful of runs go through clean.
  4. Respect 429 Too Many Requests. Read the Retry-After header if present; otherwise back off 60 seconds and try again. Loop indefinitely on 429s and you will get IP-banned.
  5. Do not hammer on crashes. If your code throws, fix the bug before re-running. A scraper in a retry loop on a 500 will hit the site harder than the original run.
from urllib.robotparser import RobotFileParser

rp = RobotFileParser()
rp.set_url("https://news.ycombinator.com/robots.txt")
rp.read()
assert rp.can_fetch("hn-research-bot/1.0", "https://news.ycombinator.com/news")

Common mistakes

FAQ

Which is better, BeautifulSoup or Scrapy?

BeautifulSoup plus a for loop for anything under 1,000 pages and a single domain. Scrapy when you have dozens of selectors, item pipelines, distributed politeness controls, and multiple domains. If the tutorial you are reading starts with a Scrapy project skeleton and you wanted one page’s table, stop; BeautifulSoup is fine.

How do I scrape a site that requires login?

POST the login form with requests.Session() or httpx.Client() to preserve the session cookie, then reuse the same client for all subsequent requests. For SPA auth (JWT in localStorage, OAuth redirects), Playwright’s persistent context is usually the simplest path.

Is web scraping legal?

It depends on the site’s terms of service, the jurisdiction, and whether the data is public. In the US, scraping publicly accessible pages for research is typically permitted under hiQ v. LinkedIn, but the question of TOS-bound enforceability is unsettled. I am not a lawyer; when in doubt, use the official API.

How do I handle pagination?

Find the page query parameter (?p=2, ?page=2, ?offset=20) and loop. Stop on the first empty page, the first 404, or the first time your dedup set stops growing. The crawl() helper above does all three.

Why is soup.select() returning an empty list?

Three usual suspects: the page is JavaScript-rendered (check Network tab for the JSON behind the page), the selector is wrong (run it in DevTools console with document.querySelectorAll(...) first), or the server returned a different page to your User-Agent (print(r.text[:500]) to see what you got).

Should I worry about CAPTCHAs?

Assume they are the site asking you to use the API instead. Sometimes rotating proxies and slower request rates get through; more often you were about to violate the TOS. Check for an official API first.

Sources and further reading

If the scraped records become a data pipeline, the dataclasses vs Pydantic v2 guide covers validation for the structured rows you parse.

esc