RE: [Rails] How to scrape a page without knowing its html structure
I think you'll find you need to know _something_ about the page layout. If
there are a finite number of places you need to scrape from you could do
this pretty simply.
Assume you had a css selector to find the desired content in each URL of
interest, and it was stored in an active record (ish) model.
# ...
# lookup the selector
@selector = Selector.find_by_url @the_url_to_scrape
doc = Nokogiri::HTML(open(@the_url_to_scrape))
# Search for nodes by css
doc.css(@selector).each do |link|
puts link.content
end
#...
I did a write up on simple scraping with nokogiri and selectorgadget here:
http://joemcglynn.wordpress.com/2009/12/10/five-minute-introduction-to-nokog
iri/
--
You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group.
To post to this group, send email to rubyonrails-talk@googlegroups.com.
To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.
0 Comments:
Post a Comment
Subscribe to Post Comments [Atom]
<< Home