You want to pull data from websites. Maybe you need product prices, news headlines, or contact information. Doing it by hand is slow. You copy, paste, repeat. That is boring and error-prone. There is a better way. Python gives you tools that do this work automatically. These tools are called libraries. They are like pre-built boxes of functions that save you from writing everything from scratch.
I have been scraping websites for years. When I started, I felt overwhelmed. Every website looked different. Some loaded content with JavaScript. Some blocked my requests. I learned one library at a time. Now I can handle almost any site. Let me walk you through the six libraries that matter most. I will explain them simply. I will show you code that works. You do not need to be a genius. You just need patience and curiosity.
The first library you will meet is requests. It is the simplest way to download a web page. Think of it as a driver that fetches a file from the internet. You give it a URL. It sends a message to the server. The server sends back the page content. That is all.
Here is how you use it:
import requests
url = "https://example.com"
response = requests.get(url)
print(response.text)
That little block downloads the homepage of example.com and prints its HTML. HTML is the code that describes the look and structure of a web page. But it is just text. You need to parse it to find the data you want.
requests handles many tricky parts automatically. It manages cookies, so you can stay logged in. It follows redirects if a page moved. It can handle secure connections (HTTPS) without extra work. I once spent hours trying to fix a scraper that failed because the site required a custom header. Then I learned to pass headers like this:
headers = {"User-Agent": "Mozilla/5.0"}
response = requests.get(url, headers=headers)
That made the site think a real browser was visiting. Suddenly everything worked. So requests is your starting point. Use it whenever you need to get raw data from a web page or an API.
Now you have the HTML. How do you find the specific information inside it? That is where BeautifulSoup comes in. It takes messy HTML and turns it into a tidy tree. You can search for tags, classes, or IDs. It is like a map of the page.
I remember my first real project. I had to scrape a list of book titles from an online store. The HTML was full of <div> and <span> elements. My eyes hurt looking at it. Beautiful Soup made it easy:
from bs4 import BeautifulSoup
html = "<html><body><div class='book'>Title: <span>Python 101</span></div></body></html>"
soup = BeautifulSoup(html, "html.parser")
title = soup.find("span").text
print(title) # Python 101
The real web is messier. You might want all links on a page:
links = soup.find_all("a")
for link in links:
print(link.get("href"))
Beautiful Soup can also use CSS selectors like soup.select(".price"). This is powerful. You can be very specific about what you want. The library is forgiving. Even if the HTML has missing closing tags, it still parses.
You always use Beautiful Soup together with requests or another downloader. First fetch, then parse. That combination covers 80% of scraping tasks. The code is short and easy to read.
When you need to scrape many pages, you want speed. Scrapy is built for that. It is not just a library. It is a framework. It handles hundreds of requests per second. It follows links automatically. It deals with retries and errors. You create a “spider” – a class that says where to start and what to do with each page.
I used Scrapy to build a price comparison tool. I had to scrape ten thousand product pages every day. Doing it with requests would have taken hours. Scrapy did it in minutes.
Here is a minimal spider:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = ["https://quotes.toscrape.com"]
def parse(self, response):
for quote in response.css("div.quote"):
yield {
"text": quote.css("span.text::text").get(),
"author": quote.css("small.author::text").get(),
}
next_page = response.css("li.next a::attr(href)").get()
if next_page:
yield response.follow(next_page, self.parse)
This spider starts on one page. It extracts each quote using CSS selectors. Then it looks for a “next” link and follows it. Scrapy schedules those requests efficiently. It uses asynchronous networking, so it never waits idle.
You run the spider from the command line:
scrapy runspider myspider.py -o quotes.json
That saves all quotes into a JSON file. Scrapy can output CSV, XML, or even send data to a database. It handles authentication, proxies, and custom settings. For large projects, this is the tool you want.
Some websites load content after the page is displayed. You see a blank page at first, then JavaScript fills in the data. requests and Beautiful Soup cannot see that content because it is not in the original HTML. You need a real browser. Selenium is a library that automates browsers like Chrome or Firefox.
I once tried to scrape a travel site that showed flight prices only after you clicked a button. My scraper kept getting empty lists. Then I switched to Selenium. It opened a real browser window, waited for the JavaScript to run, and then I could grab the prices.
Here is a simple Selenium script:
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome()
driver.get("https://example.com")
element = driver.find_element(By.CLASS_NAME, "price")
print(element.text)
driver.quit()
Selenium can wait for elements to appear:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
wait = WebDriverWait(driver, 10)
price = wait.until(EC.presence_of_element_located((By.CLASS_NAME, "price")))
print(price.text)
This makes it reliable for dynamic pages. The downside is speed. Selenium opens a full browser window. It uses more memory. For a handful of pages it is fine, but not for thousands. Use it only when you have no other choice.
If you need pure speed for parsing, look at lxml. It is a C library with Python bindings. It parses HTML and XML very fast. You can use XPath, which is a powerful query language for selecting nodes.
I used lxml when I had to parse a huge XML file with millions of records. Beautiful Soup took too long. lxml did it in seconds.
Here is how you use lxml with HTML:
from lxml import html
import requests
page = requests.get("https://example.com")
tree = html.fromstring(page.content)
titles = tree.xpath("//h2/text()")
print(titles)
XPath looks intimidating but is simple: //h2 means “any h2 element anywhere”, /text() means “get the text”. You can get very precise, like //div[@class='product']/span[@class='price']/text(). This is often shorter than Beautiful Soup’s method calls.
lxml also integrates with Beautiful Soup. You can tell Beautiful Soup to use lxml as its parser:
soup = BeautifulSoup(html_content, "lxml")
That gives you Beautiful Soup’s convenience with lxml’s speed. For everyday scraping, that combination is my default.
Modern web applications are complex. They use single-page applications (SPAs) where everything is rendered by JavaScript. Selenium works, but it is heavy. Playwright is a newer tool that does the same thing, but faster and with a nicer API.
I switched to Playwright for scraping a dashboard that required mouse clicks and drags. Playwright can simulate those interactions exactly.
Here is how you start:
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto("https://example.com")
content = page.inner_text(".result")
print(content)
browser.close()
Playwright also waits for elements automatically if you tell it to:
page.wait_for_selector(".price", timeout=5000)
price = page.inner_text(".price")
It supports multiple browsers: Chromium, Firefox, WebKit. You can intercept network requests, take screenshots, or generate PDFs. Playwright is the evolution of browser automation. It is the tool I reach for when a site is heavily dynamic.
You might wonder which library to choose for your project. Let me give you a simple rule.
Start with requests and BeautifulSoup. This pair handles most static websites. If the site uses JavaScript to load important content, try lxml first – sometimes the data is already hidden in the HTML as JSON. If not, move to Playwright or Selenium. Use Playwright if you want a modern, fast tool. Use Selenium if you already have it installed and need quick results.
For big projects, use Scrapy. It manages everything. You can even integrate Selenium or Playwright inside Scrapy for the tricky parts.
Remember to be respectful. Websites have rules. Check the robots.txt file. Do not overload the server with too many requests. Add delays. Use a User-Agent that identifies your scraper politely. Some sites block scrapers aggressively. You may need to rotate IP addresses using proxies, but that is an advanced topic.
I want to show you a complete example that mixes two libraries. Imagine you want to scrape news articles from a site that loads comments via JavaScript after the page loads. You can use requests to get the article text, then use Playwright to wait for comments.
import requests
from bs4 import BeautifulSoup
from playwright.sync_api import sync_playwright
# Step 1: Get the article
article_url = "https://example.com/article"
response = requests.get(article_url)
soup = BeautifulSoup(response.text, "html.parser")
title = soup.find("h1").text
body = soup.find("div", class_="content").text
print("Title:", title)
# Step 2: Get comments using Playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto(article_url)
# Wait for comment section to load
page.wait_for_selector(".comment-list")
comments = page.inner_text(".comment-list")
print("Comments:", comments)
browser.close()
This combines the speed of requests for static content and the power of Playwright for dynamic parts. You get the best of both worlds.
I have scraped hundreds of websites over the years. Every project taught me something new. Once I tried to scrape a site that required you to scroll down to load more items. Playwright handled it perfectly with page.evaluate("window.scrollTo(0, document.body.scrollHeight)"). Another time, a site returned data in JSON format inside a script tag. I used BeautifulSoup to find the script, then json.loads() to parse it.
There is always a way. The libraries I described are your Swiss Army knife. They cover downloading, parsing, and automation. Learn one at a time. Build small projects. Scrape your own blog, or a list of books from a public domain site.
You do not need to be a programmer to start. Copy the code examples. Run them. See what happens. Change the URLs. Modify the selectors. Break things and fix them. That is how you learn.
I still keep a cheat sheet of the most common patterns. Here is one that I use daily:
import requests
from bs4 import BeautifulSoup
def scrape_table(url):
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
table = soup.find("table")
rows = table.find_all("tr")
data = []
for row in rows:
cols = row.find_all("td")
data.append([col.text.strip() for col in cols])
return data
That function grabs a table from any page. It works on many sites with minimal changes. You can adapt it to your needs.
Now you have the six libraries. requests for fetching pages. BeautifulSoup for parsing. Scrapy for scale. Selenium and Playwright for browser automation. lxml for speed. Each has a place.
Do not be afraid to mix them. Start simple. Use requests and BeautifulSoup. If you hit a wall, add one more tool. The internet is full of structured data waiting to be collected. You have the power to collect it.
Go ahead. Try it. Pick a website you like. Write a few lines of code. Extract a headline. Smile when it works. That is the feeling of control over the digital world. I remember that feeling, and you will too.