Ruby Gem Nokogiri XPath Example: Crawling Mashable And Techcrunch
In the past few days, I’ve been working on a web crawler for various wordpress websites. My goal was to get an title, author, date, and link to every single blog post on Mashable and Techcrunch. It took me a few tries of playing around with XPath to get the awesome Ruby Nokogiri gem to work, so I wanted to share my work with others. Here is how I crawled Mashable and Techcrunch:
You can also view these examples on github.
Mashable
=begin The Mashable crawler crawls Mashable and gets the Title, URL, Date, and Author for each post =end require 'open-uri' require 'nokogiri' require 'date' require 'cgi' private def get_post_details(doc) #an array of hashes with each posts title, url, date, author posts = [] doc.css('div.post_content').each do |post| details = post.xpath('.//a')[0].attributes() #get url url = details['href'].value() #verify that the url matches mashable.com and is not feedproxy.google.com next unless url =~ /mashable.com/ puts url #get and parse title title = details['title'].value() if details['title'] != nil title = title.gsub(/Permanent Link to /, '') title = CGI.escape( title ) #make sure title is encoded puts title #get published date next if post.xpath('.//time')[0] == nil #avoid 'sponsored posts' date_details = post.xpath('.//time')[0].attributes() date_string = date_details['datetime'].value() date = DateTime.parse(date_string) #creates a time object puts date.to_s #get author author_details = post.xpath('.//span')[0].children() author = author_details.children().inner_text() puts author #store the data in a hash posts << {'title' => title, 'date' => date, 'author' => author, 'url' => url} end posts #returns an array of hashes with details for each post end public def crawl_mashable #keeps track of page number page = 1 #loop through each page on Mashable, and get the author, title, url, and date for each post loop do #get post details per page url = "http://mashable.com/page/#{page}/" doc = Nokogiri::HTML(open(url)) posts = get_post_details(doc) p posts #you've crawled all the pages if the returned posts array is empty break if posts == [] #move on to the next page page += 1 end end
Techcrunch
=begin The Techcrunch crawler crawls Mashable and gets the Title, URL, Date, and Author for each post =end require 'open-uri' require 'nokogiri' require 'date' require 'cgi' def get_post_details(doc) #get title and url title_and_url = [] doc.css('h2.headline').each do |headline| details = headline.xpath('.//a')[0].attributes #url encoded url = CGI.escape( details['href'].value() ) #title parsing title = details['title'].value() title = title.gsub(/"/, ''') title = CGI.escape( title ) title_and_url << {'title' => title, 'url' => url} end #get author and date author_and_date = [] doc.css('div.publication-info').each do |post| #author author_details = post.xpath('.//div')[0].children begin author = author_details.children[0].content author = CGI.escape(author) rescue Exception author = author_details[0].content author = CGI.escape(author) end #post date / time date_details = post.xpath('.//div')[1].children date_string = date_details[0].content begin date = DateTime.parse(date_string) rescue Exception date = Time.now - 86400 end author_and_date << {'author' => author, 'date' => date} end #combine the title and url with author and date posts = [] title_and_url.each.with_index do |post, i| post.store('publication_id', '4f0f8f4b9ff088c9a1000002') post.update(author_and_date[i]) posts << post end #returns an array of hashes with details for each post posts end public def crawl_techcrunch #keeps track of Techcrunch page number page = 1 #loop through each page on Techcrunch and get the author, title, url, and date for each post loop do #get post details per page url = "http://techcrunch.com/page/#{page}/" doc = Nokogiri::HTML(open(url)) posts = get_post_details(doc) p posts #you've crawled all the pages if the returned posts array is empty break if posts == [] #move on to the next page page += 1 end end
Let me know if there is a better / simpler way to do any of these in the comments 🙂