Building a resume-job matching service
(Part 1)

A personal project in web scraping, ML/NLP, and cloud deployment

Part 2 Summary

I describe my implementation of a web scraper to retreive job details from a given user query on linkedin.com, including best practices for navigating security and anti-scraping measures.

Reminder, full project code can be found on my github repository.

Basics of web scraping

As I discussed in the overview of this project (see Part 1), my first goal is to implement a web scraper that can retreive job listings from the LinkedIn job board for any given user query. Let's begin with the scraping basics, and later delve into how to navigate anti-scraping policies. At its core, my web scraper will use the requests package to send and receive requests from web pages, and BeautifulSoup to process the retrieved html:


import requests
from bs4 import BeautifulSoup

response = requests.get(url, cookies=cookies, headers=headers, proxies=proxies, timeout=10)
soup = BeautifulSoup(response.text, "html.parser")
# do something here with soup to extract the data you want
									

Obtaining the correct URL, headers, cookies, and setting appropriate proxies will be the key to successful web scraping. Let's look at them in order.

Setting the URL

My web scraper uses 3 URLs: (1) the base page job search URL for LinkedIn, (2) the API URL which retrieves more results after the base page, and (3) the individual URLs of each job listing. The base page URL is accessed with:


job_title = "Data scientist"
location = "Chicago"
post_time = 2.5
base_url = ("https://www.linkedin.com/jobs/search?keywords"
			f"={job_title}"
			f"&location={location}"
			f"&f_TPR=r{int(post_time*86400)}"
			"&position=1&pageNum=0")
									

Here, post_time is a float that describes how far in days to look back, but as you can see from the URL, LinkedIn actually provides second-level granularity (i.e. &f_TPR=r86400 returns jobs posted in the past 24 hours, &f_TPR=r10 returns jobs posted in the last 10 seconds). The base URL only shows the first page of job results, roughly 10-20; to return more results, we need to call the LinkedIn API URL which would be activated in the user scrolled down the page to load more results. By examining the network traffic on the base URL during scrolling, I found the API URL to be the following:


API_url = ("https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords="
			f"{job_title}"
			f"&location={location}"
			f"&f_TPR=r{int(post_time*86400)}"
			f"&start={start}")										
									

With the job_title, location, and post_time the same as before, but with the start parameter an int describing the starting position of the retrieval (e.g. start=10 returns the job listing starting from the 10th job of the search results). So by iterating through a changing start value, we can continually retrieve more and more job postings.

The third URL to retrieve is the URL of the job listing itself, which will be necessary to access in order to obtain the full job description. From the original LinkedIn search page, we can only retrieve the job cards with information like job title, company, and location. I'll go more in detail later about how to find the listing URL.

Getting cookies

Cookies are needed in order to access the API URL from the base page. I made a simple script to get cookies from the initial visit to the job search page:


def get_fresh_cookies(base_url: str, user_agent: str):
	session = requests.Session() # Create session to store cookies
	headers = {"User-Agent": user_agent}

	# Make request to url
	response = session.get(base_url, headers=headers)
	if response.status_code != 200:
		return None # problem making the request

	# Extract cookies
	cookies = session.cookies.get_dict()
	return cookies		

from fake_useragent import UserAgent
user_agent = UserAgent().random
cookies = get_fresh_cookies(base_url, user_agent)
									

Here, I use the UserAgent package to create a randomized user, mimicking a real browser, which helps fool LinkedIn into thinking that the scraper is a real user. Here are some example outputs from the UserAgent():


'Mozilla/5.0 (iPhone; CPU iPhone OS 18_3_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/18.3 Mobile/15E148 Safari/604.1'
'Mozilla/5.0 (Linux; Android 10; K) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/133.0.0.0 Mobile Safari/537.36'
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/133.0.0.0 Safari/537.36'
									

Setting the headers

Now to access the LinkedIn job API, we can set the headers. One of the headers is the session ID, which we obtained from getting the cookies.


headers = {
	"User-Agent": user_agent, # our random UserAgent
	"Csrf-Token": cookies["JSESSIONID"],
	"Referer": base_url, # base search URL
	}
									

Setting proxies to avoid detection

Setting proxies is an important part of detection avoidance; they obscure your real IP and make the API believe it is being accessed from a different IP. For small amounts of web scraping this is problably not needed, but making lots of requests in a short amount of time from the same IP could result in your IP being blocked. I'm using IPRoyal to obtain rotating residential proxies: these are residential IPs that automatically change with each new access. So far, I've found that with just 1 GB of bandwidth (about $2-3), I've been able to make a few thousand LinkedIn requests, so the cost is very low for this application.

Your proxy provider will give you the proper proxy information; the format looked something like this:


# proxy setup from IPRoyal
proxy = 'geo.iproyal.com:(REDACT)'
proxy_auth = '(REDACT):(REDACT)_country-us'
proxies = {
'http': f'socks5://{proxy_auth}@{proxy}',
'https': f'socks5://{proxy_auth}@{proxy}'
}
									

Putting it all together

To recap from above, I've found the proper URL formats to make LinkedIn job searches, retreived cookies, headers, and set proxies, and now we can use request.get() to retrieve job data. I've also implemented a few features to avoid detection, namely:

  • Use a random UserAgent for each session
  • Use rotating residential proxies for each new API call (see below)
  • Also, add a random time interval between calls to avoid "robotic" user behavior.

Now to retrieve the job information itself. The scraping process really depends on the website you access, but in the case of the LinkedIn jobs page, here's how it looks:


response = requests.get(url, headers=headers, proxies=proxies, cookies=cookies, timeout=10) # make request from API URL
soup = BeautifulSoup(response.text, "html.parser") # extract html
job_cards = soup.find_all("div", class_="base-card") # this gets all the job cards

if not job_cards: 
	return None # no jobs found

# Get job posting data from each posting:
jobs = []
for job in job_cards:
	title_elem = job.find("h3", class_="base-search-card__title")
	company_elem = job.find("h4", class_="base-search-card__subtitle")
	location_elem = job.find("span", class_="job-search-card__location")
	time_elem = job.find("time")
	link_elem = job.find("a", class_="base-card__full-link")  # link to job posting

	title = title_elem.text.strip() if title_elem else "N/A"
	company = company_elem.text.strip() if company_elem else "N/A"
	job_location = location_elem.text.strip() if location_elem else "N/A"
	job_link = link_elem.get("href") if link_elem else "N/A"
	job_time = time_elem.text.strip() if time_elem else "N/A"
	job_des = scrape_job_description(job_link) if link_elem else "N/A"
	jobs.append({"title": title, 
				"company": company, 
				"link": job_link, 
				"posted time": job_time, 
				"description": job_des,
				"location": job_location,
				})
									

To summarize the above, we use BeautifulSoup to find the html elements that contain the data we want, and extract each element with .text.strip(). However, extracting the job description requires a little more effort. Using the link to the job listing, we need to make a new request and extract the job description from the new html:

				
def scrape_job_description_single(url: str):
	headers = {"User-Agent": UserAgent().random, # make new random user for each job
				"Referer": "https://linkedin.com",
				}

	try:
		# Make request to full job posting page
		response = requests.get(url, headers=headers, proxies=proxies, timeout=15)

		if response.status_code != 200:
			print(f"Failed to retrieve job details: {response.status_code}")
			return None
		else:
			soup = BeautifulSoup(response.text, "html.parser")
			job_description_elem = soup.find("div", class_="description__text")
			job_des = extract_clean_text(job_description_elem)
			return job_des
	except Exception as e: 
		print(e)
		return None

def extract_clean_text(element):
    # Extracts clean text with spacing preserved from bs4.element.tag
    block_tags = {'p', 'li', 'br'}
    texts = []

    for child in tag.descendants:
        if child.name == 'br':
            texts.append('\n')
        elif child.name in block_tags:
            texts.append('\n' + child.get_text() + '\n')

    # Join, normalize newlines/spaces
    combined = ''.join(texts)
    lines = [line.strip() for line in combined.splitlines()]
    result = '\n'.join(line for line in lines if line)

    return result
									

Making the request to the job listing URL is easy, and there's just a bit of processing to do on the retrieved text to form a readable job description, as often the job descriptions might contain multiple paragraph, list, and line break elements.

Summary

What have we done so far? Used requests to call the LinkedIn API after properly formatting the cookies, headers, and proxies for the request; used BeautifulSoup to find specific html classes to extract information like job title, company, and location; and finally made a small function to extract the job description from the full job link.

With this method we can quickly scrape hundreds of job listings from LinkedIn based on the user's query (job title, location, posted date). How can we sort through all this job data to find the most relevant jobs for the user?

Let's look at job-resume matching in Part 3