Scraping the Online Job Posting Data: 'Indeed.com'

p.s., This is still a work-in-progress.

I am scraping the search results for job postings that require skills related to artificial intelligence; e.g., experience and knowledge on machine learning, linear algebra, data analytics, etc. The search results in 11,807 pages in each of which there are ten job postings (approximately 118,070 job openings for A.I. alone in the United States as of October 2, 2019).

Terminal and tmux:

Using a remote server for the parsing process.

ssh <account_ID>@<Server_IP_Address>  ## Connect to a server
tmux attach -t <Session Number>  ## Attach a tmux session
R  ## Access R

Loading libraries in R:

library(data.table)
library(tidyverse)
library(rvest)
library(xml2)
library(stringi)

Checking the total Number of Pages:

num_pages <- 11807 # last page of results
full_data <- vector("list")

get_job_postings <- function(url){
  url <- paste0("https://www.indeed.com/jobs?q=artificial%20intelligence", 
                "&start=", i)  ## Can be adjusted depending on which search results are to be scraped.
  page <- xml2::read_html(url)
  href_links <- page %>% 
    html_nodes("div") %>%
    html_nodes(xpath = '//*[@data-tn-element="jobTitle"]') %>%
    html_attr("href") %>% paste0("https://indeed.com",.)  
  
}
for(i in 1:num_pages){
  url <- paste0("https://www.indeed.com/jobs?q=artificial%20intelligence", 
                "&start=", i)
  page <- xml2::read_html(url)
  
  href_links <- page %>% 
    html_nodes("div") %>%
    html_nodes(xpath = '//*[@data-tn-element="jobTitle"]') %>%
    html_attr("href") %>% paste0("https://indeed.com",.)
  
  subdata <- data.table()
  for(j in seq_along(href_links)) {
    page2 <- xml2::read_html(href_links[j])
  
    Sys.sleep(2.5) # To prevent the platform from blocking us.
    
    job_title <- page2 %>%  # [1] Job Title
      html_nodes(xpath = '//*[@class="icl-u-xs-mb--xs icl-u-xs-mt--none  jobsearch-JobInfoHeader-title"]') %>%
      html_text() %>%
      stri_trim_both()
    
    company_name <- page2 %>%  # [2] Company Name
      html_nodes(xpath = '//*[@class="jobsearch-CompanyAvatar-companyLink"]') %>%
      html_text() %>%
      paste0(., "")
    
    job_location <- page2 %>%  # [3] Job Location
      html_nodes(xpath = '//*[@class=" jobsearch-CompanyInfoWithoutHeaderImage jobsearch-CompanyInfoWithReview"]') %>%
      html_text() %>%
      paste0(., "")
    
    company_rating <- page2 %>%  # [4] Rating
      html_nodes("a") %>%
      html_nodes(xpath = '//*[@class="icl-Ratings-starsCountWrapper icl-Ratings-link"]') %>%
      html_attr("aria-label") %>%
      paste0(., "")
    
    company_review <- page2 %>%  # [5] Number of Reviews
      html_nodes(xpath = '/html/body/div[1]/div[2]/div[3]/div/div/div[1]/div[1]/div[1]/div[1]/div/div/div[2]/div/a/div[2]') %>%
      html_text() %>%
      paste0(., "")
    
    job_description <- page2 %>%
      html_nodes(xpath = '//*[@id="jobDescriptionText"]') %>% 
      html_text() %>%
      stri_trim_both()
    
    subdata <- rbind(subdata, safely(data.table(date_crawled = Sys.time(),
                                                job_title,
                                                company_name,
                                                job_location,
                                                job_rating,
                                                job_review,
                                                href_links = href_links[j])))
    
    # print(paste0("For-loop in for-loop #", j,"."))
  }
  full_data[[i]] <- subdata
  print(paste0("Iteration ", i," completed."))
}
print("Iteration completed.")
full_data = rbindlist(full_data)
write.csv(x = full_data, file = "/mnt/admin_jaewon_02/AI_job_posting_data/AI_job_posting_data.csv")

What the Data Looks Like:

id date_crawled job_title company_name job_location company_rating company_review href_links
1 10/2/19 13:05 VP of Artificial Intelligence Samsung SDS America Samsung SDS America6,867 reviews-San Jose, CA 95134 4 out of 5 6,867 reviews https://indeed.com/rc/clk?jk=29e6b0cfd4f1ad6e&fccid=da3c7fed78dd1607&vjs=3
2 10/2/19 13:05 Artificial Intelligence Solution Architect Avanade Avanade243 reviews-Baltimore, MD 3.7 out of 5 243 reviews https://indeed.com/rc/clk?jk=c8b32b06ac7c5f37&fccid=5386281035076fdf&vjs=3
3 10/2/19 13:05 Artificial Intelligence Intern Tractor Supply Company Tractor Supply Company3,297 reviews-Brentwood, TN 37027 3.5 out of 5 3,297 reviews https://indeed.com/rc/clk?jk=4f9e3676b2e5e4c8&fccid=11196309d222f1c1&vjs=3
4 10/2/19 13:05 Associate - Artificial Intelligence & Semantics Morgan Stanley Morgan Stanley3,506 reviews-Maryland 3.9 out of 5 3,506 reviews https://indeed.com/rc/clk?jk=63e1414a8823a284&fccid=0c39fb2c91742dcf&vjs=3
5 10/2/19 13:06 Backend Software Engineer, Artificial Intelligence IBM IBM27,074 reviews-Cambridge, MA 02139 3.9 out of 5 27,074 reviews https://indeed.com/rc/clk?jk=c6bd906a6276d2d3&fccid=de71a49b535e21cb&vjs=3
6 10/2/19 13:06 Conversational Artificial Intelligence Designer PennyMac Loan Services, LLC PennyMac Loan Services, LLC267 reviews-Plano, TX 3.1 out of 5 267 reviews https://indeed.com/rc/clk?jk=371dc494bc960265&fccid=24c6c21cc329dea7&vjs=3
7 10/2/19 13:06 Artificial Intelligence Bank of America Bank of America28,108 reviews-Charlotte, NC 28255 3.8 out of 5 28,108 reviews https://indeed.com/rc/clk?jk=8b653788c51d5ef9&fccid=5bd99dfa21c8a490&vjs=3
8 10/2/19 13:06 Intern - Artificial Intelligence (AI) Alion Science and Technology Alion Science and Technology227 reviews-College Park, MD 3.5 out of 5 227 reviews https://indeed.com/rc/clk?jk=2638fb51ee02f970&fccid=1f295927bec6a974&vjs=3

Sample Job Description (Row 1: VP of Artificial Intelligence, Samsung SDS America):

“Vice President of Artificial Intelligence is an executive position for a result oriented R&D leader. He or she will lead R&D efforts of Artificial Intelligence Team (AIT) within Samsung SDS Research America (SDSRA) located in Silicon Valley. He or she will be responsible for creating the overall AI roadmap for AIT and building a cohesive and comprehensive AI strategy for execution as well as providing technical guidance.

AIT in SDSRA has been building AI productivity enhancement platform for the last two years and will be launching it early next year. The core mission of this position is the oversight of product development efforts for current AI platform as well as the incubation of new AI products. This role requires a close collaboration with Artificial Intelligence Research Center in Samsung SDS in Korea from an engineering perspective and with Samsung SDS America from going to market perspective…”

Created on Oct. 03, 2019



Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • External Post from diddigest.substack.com
  • Multi-Armed Bandit (MAB) as an Alternative to A/B Testing: a Simulation using R
  • Collection of Useful Packages, Cheat Sheets (R and Python), and Tips
  • Hand-Rolling OLS Using R
  • Places to visit in Seoul, South Korea