0%

Fast Reddit Scraping

Today I’m going to walk you through the process of scraping search results from Reddit using Python. We’re going to write a simple program that performs a keyword search and extracts useful information from the search results. Then we’re going to improve our program’s performance by taking advantage of parallel processing.

Tools

We’ll be using the following Python 3 libraries to make our job easier:

multiprocessing comes with Python 3 by default as far as I know, but you may need to install the others manually using a package manager such as PIP:

1
2
3
pip3 install beautifulsoup4
pip3 install requests
pip3 install lxml

Old Reddit

Before we begin, I want to point out that we’ll be scraping the old Reddit, not the new one. That’s because the new site loads more posts automatically when you scroll down:


The problem is that it’s not possible to simulate this scroll-down action using a simple tool like Requests. We’d need to use something like Selenium for that kind of thing. As a workaround, we’re going to use the old site which is easier to crawl using the links located on the navigation panel:

Scraper v1 - Program Arguments

Let’s start by making our program accept some arguments that will allow us to customize our search. Here are some useful parameters:

  • keyword to search
  • subreddit restriction (optional)
  • date restriction (optional)

Let’s say we want to search for the keyword “web scraping”. In this case, the URL we want to go is:
https://old.reddit.com/search?q=%22web+scraping%22

If we want to limit our search with a particular subreddit such as “r/Python”, then our URL will become:
https://old.reddit.com/r/Python/search?q=%22web+scraping%22&restrict_sr=on

Finally, the URL is going to look like one of the following if we want to search for the posts submitted in the last year:
https://old.reddit.com/search?q=%22web+scraping%22&t=year
https://old.reddit.com/r/Python/search?q=%22web+scraping%22&restrict_sr=on&t=year

The following is the initial version of our program that builds and prints the appropriate URL according to the program arguments:

scraper.py (v1)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import argparse

SITE_URL = 'https://old.reddit.com/'

if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--keyword', type=str, help='keyword to search')
parser.add_argument('--subreddit', type=str, help='optional subreddit restriction')
parser.add_argument('--date', type=str, help='optional date restriction (day, week, month or year)')
args = parser.parse_args()
if args.subreddit == None:
searchUrl = SITE_URL + 'search?q="' + args.keyword + '"'
else:
searchUrl = SITE_URL + 'r/' + args.subreddit + '/search?q="' + args.keyword + '"&restrict_sr=on'
if args.date == 'day' or args.date == 'week' or args.date == 'month' or args.date == 'year':
searchUrl += '&t=' + args.date
print('Search URL:', searchUrl)

Now we can run our program as follows:

1
python3 scraper.py --keyword="dave weckl" --subreddit="drums" --date="month"

Scraper v2 - Collecting Search Results

If you take a look at the page source, you’ll notice that all the post results are stored in <div>s with a search-result-link class. Also note that unless it’s the last page, there will be an <a> tag with a <rel> attribute equal to nofollow next. That’s how we’ll know when to stop advancing to the next page.

Therefore using the URL we built from the program arguments, we can collect the post sections from all pages with a simple function that we’ll call getSearchResults. Here’s the second version of our program:

scraper.py (v2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
from bs4 import BeautifulSoup
import argparse
import requests

SITE_URL = 'https://old.reddit.com/'
REQUEST_AGENT = 'Mozilla/5.0 Chrome/47.0.2526.106 Safari/537.36'

def createSoup(url):
return BeautifulSoup(requests.get(url, headers={'User-Agent':REQUEST_AGENT}).text, 'lxml')

def getSearchResults(searchUrl):
posts = []
while True:
resultPage = createSoup(searchUrl)
posts += resultPage.findAll('div', {'class':'search-result-link'})
footer = resultPage.findAll('a', {'rel':'nofollow next'})
if footer:
searchUrl = footer[-1]['href']
else:
return posts

if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--keyword', type=str, help='keyword to search')
parser.add_argument('--subreddit', type=str, help='optional subreddit restriction')
parser.add_argument('--date', type=str, help='optional date restriction (day, week, month or year)')
args = parser.parse_args()
if args.subreddit == None:
searchUrl = SITE_URL + 'search?q="' + args.keyword + '"'
else:
searchUrl = SITE_URL + 'r/' + args.subreddit + '/search?q="' + args.keyword + '"&restrict_sr=on'
if args.date == 'day' or args.date == 'week' or args.date == 'month' or args.date == 'year':
searchUrl += '&t=' + args.date
posts = getSearchResults(searchUrl)
print('Search URL:', searchUrl, '\nFound', len(posts), 'posts.')

Scraper v3 - Parsing Post Data

Now that we have a bunch of posts in the form of a bs4.element.Tag array, we can extract useful information by parsing each element of this array further. We can extract information such as:

Information Source
date datetime attribute of the <time> tag
title <a> tag with search-title class
score <span> tag with search-score class
author <a> tag with author class
subreddit <a> tag with search-subreddit-link class
URL href attribute of the <a> tag with search-comments class
# of comments text field of the <a> tag with search-comments class

We’re also going to create a container object to store the extracted data and save it as a JSON file (product.json). We’ll load this file in the beginning of our program which may contain data from other keyword searches. When we’re done scraping the current keyword, we’ll append the new content to the existing data. Here’s the third version of our program:

scraper.py (v3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
from datetime import datetime
from bs4 import BeautifulSoup
import argparse
import requests
import json
import re

SITE_URL = 'https://old.reddit.com/'
REQUEST_AGENT = 'Mozilla/5.0 Chrome/47.0.2526.106 Safari/537.36'

def createSoup(url):
return BeautifulSoup(requests.get(url, headers={'User-Agent':REQUEST_AGENT}).text, 'lxml')

def getSearchResults(searchUrl):
posts = []
while True:
resultPage = createSoup(searchUrl)
posts += resultPage.findAll('div', {'class':'search-result-link'})
footer = resultPage.findAll('a', {'rel':'nofollow next'})
if footer:
searchUrl = footer[-1]['href']
else:
return posts

def parsePosts(posts, product, keyword):
for post in posts:
time = post.find('time')['datetime']
date = datetime.strptime(time[:19], '%Y-%m-%dT%H:%M:%S')
title = post.find('a', {'class':'search-title'}).text
score = post.find('span', {'class':'search-score'}).text
score = int(re.match(r'[+-]?\d+', score).group(0))
author = post.find('a', {'class':'author'}).text
subreddit = post.find('a', {'class':'search-subreddit-link'}).text
commentsTag = post.find('a', {'class':'search-comments'})
url = commentsTag['href']
numComments = int(re.match(r'\d+', commentsTag.text).group(0))
product[keyword].append({'title':title, 'url':url, 'date':str(date),
'score':score, 'author':author, 'subreddit':subreddit})
return product

if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--keyword', type=str, help='keyword to search')
parser.add_argument('--subreddit', type=str, help='optional subreddit restriction')
parser.add_argument('--date', type=str, help='optional date restriction (day, week, month or year)')
args = parser.parse_args()
if args.subreddit == None:
searchUrl = SITE_URL + 'search?q="' + args.keyword + '"'
else:
searchUrl = SITE_URL + 'r/' + args.subreddit + '/search?q="' + args.keyword + '"&restrict_sr=on'
if args.date == 'day' or args.date == 'week' or args.date == 'month' or args.date == 'year':
searchUrl += '&t=' + args.date
try:
product = json.load(open('product.json'))
except FileNotFoundError:
print('WARNING: Database file not found. Creating a new one...')
product = {}
print('Search URL:', searchUrl)
posts = getSearchResults(searchUrl)
print('Started scraping', len(posts), 'posts.')
keyword = args.keyword.replace(' ', '-')
product[keyword] = []
product = parsePosts(posts, product, keyword)
with open('product.json', 'w', encoding='utf-8') as f:
json.dump(product, f, indent=4, ensure_ascii=False)

Now we can search for different keywords by running our program multiple times. The extracted data will be appended to the product.json file after each execution.

Scraper v4 - Scraping Comments

So far we’ve been able to scrape information from the post results easily, since this information is available in a given results page. But we might also want to scrape comment information which cannot be accessed from the results page. We must instead parse the comment page of each indiviadual post using the URL that we previously extract in our parsePosts funciton.

If you take a close look at the HTML source of a comment page such as this one, you’ll see that the comments are located inside a <div> with a sitetable nestedlisting class. Each comment inside this <div> is stored in another <div> with a data-type attribute equal to comment. From there, we can obtain some useful information such as:

Information Source
# of replies data-replies attribute
author <a> tag with author class inside the <p> tag with tagline class
date datetime attribute in the <time> tag inside the <p> tag with tagline class
comment ID name attribute in the <a> tag inside the <p> tag with parent class
parent ID <a> tag with the data-event-action attribute equal to parent
text text field of the <div> tag with md class
score text field of the <span> tag with score unvoted class

Let’s create a new function called parseComments and call it from our parsePosts function so that we can get the comment data along with the post data:

scraper.py (v4 - partial)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
def parseComments(commentsUrl):
commentTree = {}
commentsPage = createSoup(commentsUrl)
commentsDiv = commentsPage.find('div', {'class':'sitetable nestedlisting'})
comments = commentsDiv.findAll('div', {'data-type':'comment'})
for comment in comments:
numReplies = int(comment['data-replies'])
tagline = comment.find('p', {'class':'tagline'})
author = tagline.find('a', {'class':'author'})
author = "[deleted]" if author == None else author.text
date = tagline.find('time')['datetime']
date = datetime.strptime(date[:19], '%Y-%m-%dT%H:%M:%S')
commentId = comment.find('p', {'class':'parent'}).find('a')['name']
content = comment.find('div', {'class':'md'}).text.replace('\n','')
score = comment.find('span', {'class':'score unvoted'})
score = 0 if score == None else int(re.match(r'[+-]?\d+', score.text).group(0))
parent = comment.find('a', {'data-event-action':'parent'})
parentId = parent['href'][1:] if parent != None else ''
parentId = '' if parentId == commentId else parentId
commentTree[commentId] = {'author':author, 'reply-to':parentId, 'text':content,
'score':score, 'num-replies':numReplies, 'date':str(date)}
return commentTree

def parsePosts(posts, product, keyword):
for post in posts:
time = post.find('time')['datetime']
date = datetime.strptime(time[:19], '%Y-%m-%dT%H:%M:%S')
title = post.find('a', {'class':'search-title'}).text
score = post.find('span', {'class':'search-score'}).text
score = int(re.match(r'[+-]?\d+', score).group(0))
author = post.find('a', {'class':'author'}).text
subreddit = post.find('a', {'class':'search-subreddit-link'}).text
commentsTag = post.find('a', {'class':'search-comments'})
url = commentsTag['href']
numComments = int(re.match(r'\d+', commentsTag.text).group(0))
commentTree = {} if numComments == 0 else parseComments(url)
product[keyword].append({'title':title, 'url':url, 'date':str(date), 'score':score,
'author':author, 'subreddit':subreddit, 'comments':commentTree})
return product

Scraper v5 - Multiprocessing

Our program is functionally complete at this point. However, it runs a bit slowly because all the work is done serially by a single process. We can improve the performance by handling the posts by multiple processes using the Process and Manager objects from the multiprocessing library.

The first thing we need to do is to rename the parsePosts function and make it handle only a single post. To do that, we’re simply going to remove the for statement. We also need to change the function parameters a little bit. Instead of passing our original product object, we’ll pass a list object to append the results obtained by the current process.

scraper.py (v5 - partial)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
def parsePost(post, results):
time = post.find('time')['datetime']
date = datetime.strptime(time[:19], '%Y-%m-%dT%H:%M:%S')
title = post.find('a', {'class':'search-title'}).text
score = post.find('span', {'class':'search-score'}).text
score = int(re.match(r'[+-]?\d+', score).group(0))
author = post.find('a', {'class':'author'}).text
subreddit = post.find('a', {'class':'search-subreddit-link'}).text
commentsTag = post.find('a', {'class':'search-comments'})
url = commentsTag['href']
numComments = int(re.match(r'\d+', commentsTag.text).group(0))
commentTree = {} if numComments == 0 else parseComments(url)
results.append({'title':title, 'url':url, 'date':str(date), 'score':score,
'author':author, 'subreddit':subreddit, 'comments':commentTree})

results is actually a multiprocessing.managers.ListProxy object that we can use to accumulate the output generated by all processes. We’ll later convert it to a regular list and save it in our product. Our main script will now look like as follows:

scraper.py (v5 - partial)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--keyword', type=str, help='keyword to search')
parser.add_argument('--subreddit', type=str, help='optional subreddit restriction')
parser.add_argument('--date', type=str, help='optional date restriction (day, week, month or year)')
args = parser.parse_args()
if args.subreddit == None:
searchUrl = SITE_URL + 'search?q="' + args.keyword + '"'
else:
searchUrl = SITE_URL + 'r/' + args.subreddit + '/search?q="' + args.keyword + '"&restrict_sr=on'
if args.date == 'day' or args.date == 'week' or args.date == 'month' or args.date == 'year':
searchUrl += '&t=' + args.date
try:
product = json.load(open('product.json'))
except FileNotFoundError:
print('WARNING: Database file not found. Creating a new one...')
product = {}
print('Search URL:', searchUrl)
posts = getSearchResults(searchUrl)
print('Started scraping', len(posts), 'posts.')
keyword = args.keyword.replace(' ', '-')
results = Manager().list()
jobs = []
for post in posts:
job = Process(target=parsePost, args=(post, results))
jobs.append(job)
job.start()
for job in jobs:
job.join()
product[keyword] = list(results)
with open('product.json', 'w', encoding='utf-8') as f:
json.dump(product, f, indent=4, ensure_ascii=False)

This simple technique alone will greatly speed-up the performance. For instance when I perform a search involving 163 posts in my machine, the serial version of the program takes 150 seconds to execute, corresponding to approximately 1 post per second. On the other hand, the parallel version only takes 15 seconds to execute (~10 posts per second) which is 10x faster.

You can check out the complete source code on Github. Also, make sure to subscribe to get updates on my future articles.