By Ryan Calloway. Updated May 2026.
“Scrape the Hacker News front page” stops being an interview question the day you keep BeautifulSoup, requests, and a CSS-selector cheat sheet in muscle memory. This tutorial gives you a runnable script that pulls every story from the HN front page (rank, title, URL, score, user, age, comment count), paginates across the first three pages, and saves to CSV in about 60 lines of Python 3.13. The same shape — fetch(), parse(), save() — is the one I copy into every new scraping project.
Quick answer: pip install beautifulsoup4 requests lxml, fetch with requests.get(url).text, parse with BeautifulSoup(html, "lxml"), select with CSS like soup.select("tr.athing.submission"), pull text with .get_text(strip=True), attributes with el["href"], and add a 1–2 second sleep between requests plus a descriptive User-Agent. The rest is selectors and politeness.
What you’ll learn
- How to scrape the live Hacker News front page in ~60 lines
- The CSS-selector workflow that beats
findandfind_allin 2026 - How to paginate, deduplicate, and save to CSV (then SQLite when you outgrow it)
- The polite-scraping rules I keep on a sticky note: User-Agent,
robots.txt, rate limit, 429 backoff - When BeautifulSoup is the wrong tool, and what to reach for instead
Prerequisites
- Python 3.13 (3.12 also fine; the floor is 3.10 for the
list[dict]annotations used below) - A virtual environment — see the Python virtual environment guide if you don’t have one
- Three packages:
beautifulsoup44.12+ (the API),lxml5.x (the fast C parser underneath), andrequests2.32+ (HTTP) - Browser DevTools (Chrome, Firefox, or Safari) to inspect selectors
python3 -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install --upgrade pip
pip install beautifulsoup4 lxml requests
On Linux, if lxml fails to build, install libxml2-dev and libxslt-dev from your package manager first. On macOS and Windows the wheel ships precompiled. If you hit ModuleNotFoundError after install, see the 5-minute Python fix; it is almost always the wrong interpreter.
The 60-line Hacker News scraper
This is the working template. Save as hn.py and run it.
"""Scrape Hacker News front pages and save to CSV."""
import csv
import time
from pathlib import Path
import requests
from bs4 import BeautifulSoup
BASE = "https://news.ycombinator.com/news"
HEADERS = {
"User-Agent": "hn-research-bot/1.0 (contact: you@example.com)",
"Accept-Language": "en-US,en;q=0.9",
}
def fetch(url: str) -> str:
r = requests.get(url, headers=HEADERS, timeout=20)
r.raise_for_status()
return r.text
def parse(html: str) -> list[dict]:
soup = BeautifulSoup(html, "lxml")
rows: list[dict] = []
for athing in soup.select("tr.athing.submission"):
title_el = athing.select_one("span.titleline > a")
if title_el is None:
continue
sub_row = athing.find_next_sibling("tr")
score = sub_row.select_one("span.score")
user = sub_row.select_one("a.hnuser")
age = sub_row.select_one("span.age > a")
last_a = sub_row.select("td.subtext a")[-1]
rank = athing.select_one("span.rank")
comments_txt = last_a.get_text(strip=True)
rows.append({
"rank": int(rank.get_text(strip=True).rstrip(".")) if rank else None,
"title": title_el.get_text(strip=True),
"url": title_el["href"],
"score": int(score.get_text().split()[0]) if score else 0,
"user": user.get_text(strip=True) if user else None,
"age": age.get_text(strip=True) if age else None,
"comments": int(comments_txt.split()[0]) if comments_txt[:1].isdigit() else 0,
})
return rows
def save(rows: list[dict], path: Path) -> None:
if not rows:
return
with path.open("w", newline="", encoding="utf-8") as f:
w = csv.DictWriter(f, fieldnames=list(rows[0]))
w.writeheader()
w.writerows(rows)
if __name__ == "__main__":
all_rows: list[dict] = []
for page in range(1, 4): # 3 pages = ~90 stories
all_rows.extend(parse(fetch(f"{BASE}?p={page}")))
time.sleep(1.5) # be polite
save(all_rows, Path("hn_front.csv"))
print(f"saved {len(all_rows)} stories to hn_front.csv")
One run produces an 85–90 row CSV (HN serves 30 stories per page; the last page is sometimes short). The first site you scrape is rarely going to be HN, but the fetch / parse / save shape generalizes to every public page I have scraped this decade.
How the selectors work
HN’s HTML is old-school table layout: each story is two table rows. The first has class athing submission and contains the title and link inside span.titleline. The second is the next sibling tr and holds td.subtext with score, user, age, and comment link.
The DevTools workflow is the fastest way to find selectors: open the page, right-click the element you want, “Inspect”, then right-click in the element panel and “Copy” → “Copy selector”. Chrome will hand you a brittle path like tr:nth-child(3) > td:nth-child(3) > span.titleline > a. Trim aggressively. Anything with nth-child breaks the moment the page reorders. Keep the shortest selector that still uniquely identifies the kind of element you want — for HN titles, span.titleline > a is enough.
CSS selectors beat find and find_all
BeautifulSoup’s .find() and .find_all() were the main API in 2012. In 2026 use .select() and .select_one() with CSS. Same syntax as DevTools, less verbose, and supports descendant, child, and attribute selectors out of the box.
# Old style
container = soup.find("div", {"class": "container"})
articles = container.find_all("article", class_="post")
# CSS selectors
articles = soup.select("div.container article.post")
The official BeautifulSoup CSS selector docs list every supported feature; almost all of CSS Selectors Level 4 works thanks to the soupsieve backend. Pseudo-classes like :has(), :not(), and :nth-of-type() are all available.
Getting text and attributes
el = soup.select_one("a.hnuser")
# Visible text, leading/trailing whitespace stripped
text = el.get_text(strip=True)
# Attribute access
href = el["href"] # raises KeyError if missing
href = el.get("href") # returns None if missing
href = el.get("href", "") # returns default if missing
# Multi-node text with newline separator
para = soup.select_one("article").get_text("\n", strip=True)
Prefer el.get(attr) over el[attr] when the attribute might be missing; otherwise you get a KeyError that looks like a BeautifulSoup bug but is a typo in your selector. Prefer .get_text(strip=True) over .text; .text leaves ragged whitespace from the HTML formatting.
Pagination: scraping more than the front page
HN paginates with ?p=N. The script above loops range(1, 4), sleeping 1.5 seconds between fetches. For other sites the pattern is one of three: a query parameter (?page=2, ?offset=20), a “next” link in the DOM, or an infinite-scroll JSON endpoint. Stop conditions matter: an empty result, a 404, or the disappearance of the “next” link. Watch for any of the three.
def crawl(start: str, max_pages: int = 50) -> list[dict]:
rows: list[dict] = []
seen: set[str] = set()
for page in range(1, max_pages + 1):
new_rows = parse(fetch(f"{start}?p={page}"))
if not new_rows: # empty page = stop
break
before = len(seen)
for r in new_rows:
if r["url"] not in seen:
seen.add(r["url"])
rows.append(r)
if len(seen) == before: # all duplicates = stop
break
time.sleep(1.5)
return rows
The seen set gives free deduplication and a second stop condition: if a page returns nothing new, the site is looping you back to the same data and you stop.
Saving results: CSV first, SQLite later
Start with CSV. Three lines, opens in Excel, pipes into pandas.read_csv, diffs cleanly in git.
import csv
with open("hn_front.csv", "w", newline="", encoding="utf-8") as f:
w = csv.DictWriter(f, fieldnames=list(rows[0]))
w.writeheader()
w.writerows(rows)
When you hit ~100k rows, need deduplication across runs, or want to query, graduate to SQLite. The sqlite3 module is in the stdlib; the setup is eight lines.
import sqlite3
con = sqlite3.connect("hn.db")
con.execute("""
CREATE TABLE IF NOT EXISTS stories (
url TEXT PRIMARY KEY, title TEXT, score INTEGER,
user TEXT, comments INTEGER, fetched_at TEXT
)
""")
con.executemany(
"INSERT OR IGNORE INTO stories VALUES (?, ?, ?, ?, ?, datetime('now'))",
[(r["url"], r["title"], r["score"], r["user"], r["comments"]) for r in rows],
)
con.commit()
The INSERT OR IGNORE on the URL primary key gives you idempotent re-runs for free.
requests vs httpx: which one to use
Both fetch HTML; both work fine with BeautifulSoup. I default to requests 2.32+ for tutorials and small scripts because it has the largest install base and the deepest Stack Overflow trail. I switch to httpx 0.28+ for production scrapers that need any of: HTTP/2, async, connection pooling across thousands of requests, or strict timeout-per-stage controls.
| Need | Use |
|---|---|
| Hello-world tutorial / one script | requests |
| Async fan-out (100+ requests in flight) | httpx.AsyncClient |
| HTTP/2 servers | httpx |
| Login session reuse | requests.Session or httpx.Client |
| Mocked unit tests | Either; respx for httpx, responses for requests |
The API surface is intentionally similar enough that swapping is a one-line change in 90% of cases.
When BeautifulSoup is the wrong tool
BeautifulSoup parses the HTML the server sent. Three cases break that assumption.
1. JavaScript-rendered pages. Many 2026 sites ship a near-empty HTML shell and fill content with JavaScript after load. requests.get sees the empty shell; soup.select(".job-card") returns []. Before reaching for Playwright, open DevTools → Network and filter for XHR/Fetch. The data the page renders is almost always coming from a JSON endpoint. Hit that endpoint directly and the scrape becomes one requests.get with no parsing at all.
2. There is an official API. For Hacker News specifically, the official Firebase API returns the full top-stories list as JSON. Use it for production. The HTML scrape in this tutorial is a teaching example because it exercises the selectors; for a real scheduled job, the API is faster, more reliable, and explicitly sanctioned.
3. The page has CAPTCHA or aggressive bot detection. Treat that as the site’s request to use the API instead. Sometimes rotating proxies and slower request rates get through; more often you were about to violate the terms of service.
Polite scraping: the 5 rules
- Read
/robots.txtfirst. The Robots Exclusion Protocol was finally standardized as RFC 9309 in 2022. Use the stdliburllib.robotparserto check whether your User-Agent is allowed on a path before fetching it. - Set a descriptive User-Agent with a contact email. Many sites block unknown clients with a 403; a real UA plus a way to reach you also lets webmasters tell you to stop instead of just IP-banning.
- Rate-limit yourself. One request per second is my floor for sites I do not own. I drop to 0.3 only after watching a handful of runs go through clean.
- Respect
429 Too Many Requests. Read theRetry-Afterheader if present; otherwise back off 60 seconds and try again. Loop indefinitely on 429s and you will get IP-banned. - Do not hammer on crashes. If your code throws, fix the bug before re-running. A scraper in a retry loop on a 500 will hit the site harder than the original run.
from urllib.robotparser import RobotFileParser
rp = RobotFileParser()
rp.set_url("https://news.ycombinator.com/robots.txt")
rp.read()
assert rp.can_fetch("hn-research-bot/1.0", "https://news.ycombinator.com/news")
Common mistakes
- Parsing with the stdlib
html.parserinstead oflxml. The BeautifulSoup docs note thathtml.parseris more permissive in ways that can quietly miss tags on malformed HTML. Installlxmland pass"lxml"as the parser. - Indexing attributes instead of using
.get().el["data-id"]raisesKeyErrorif the attribute is missing.el.get("data-id")returnsNone. - Using
.textinstead of.get_text(strip=True)..textleaves ragged whitespace from HTML formatting; the diff in your output is a nightmare to clean later. - Treating relative URLs as absolute. HN serves item links as
item?id=12345, nothttps://.... Useurllib.parse.urljoin(BASE, href)before saving. - No
time.sleepbetween requests. Even small sites rate-limit. The fix is one line. - No
raise_for_status(). A scraper that silently parses HN’s 503 maintenance page produces zero rows and no error.
FAQ
Which is better, BeautifulSoup or Scrapy?
BeautifulSoup plus a for loop for anything under 1,000 pages and a single domain. Scrapy when you have dozens of selectors, item pipelines, distributed politeness controls, and multiple domains. If the tutorial you are reading starts with a Scrapy project skeleton and you wanted one page’s table, stop; BeautifulSoup is fine.
How do I scrape a site that requires login?
POST the login form with requests.Session() or httpx.Client() to preserve the session cookie, then reuse the same client for all subsequent requests. For SPA auth (JWT in localStorage, OAuth redirects), Playwright’s persistent context is usually the simplest path.
Is web scraping legal?
It depends on the site’s terms of service, the jurisdiction, and whether the data is public. In the US, scraping publicly accessible pages for research is typically permitted under hiQ v. LinkedIn, but the question of TOS-bound enforceability is unsettled. I am not a lawyer; when in doubt, use the official API.
How do I handle pagination?
Find the page query parameter (?p=2, ?page=2, ?offset=20) and loop. Stop on the first empty page, the first 404, or the first time your dedup set stops growing. The crawl() helper above does all three.
Why is soup.select() returning an empty list?
Three usual suspects: the page is JavaScript-rendered (check Network tab for the JSON behind the page), the selector is wrong (run it in DevTools console with document.querySelectorAll(...) first), or the server returned a different page to your User-Agent (print(r.text[:500]) to see what you got).
Should I worry about CAPTCHAs?
Assume they are the site asking you to use the API instead. Sometimes rotating proxies and slower request rates get through; more often you were about to violate the TOS. Check for an official API first.
Sources and further reading
- BeautifulSoup 4 documentation — the canonical reference, kept current
- lxml documentation — the C parser BeautifulSoup uses under the hood
- Requests documentation — the standard HTTP library for Python
- httpx documentation — async-first alternative with HTTP/2 support
- Hacker News official Firebase API — JSON, no scraping required
- RFC 9309: Robots Exclusion Protocol — the standard the
robots.txtfile follows - What’s New in Python 3.13 — changelog for the version this code targets
If the scraped records become a data pipeline, the dataclasses vs Pydantic v2 guide covers validation for the structured rows you parse.