Re: [Rails] How to scrape a page without knowing its html structure
On Sat, Dec 12, 2009 at 2:56 AM, kalyan <kalyan.allampalli@gmail.com> wrote:
> I'm doing one module in my site, there I need to import user blog into
> my site. I can use RSS feeds to read the blog information but using
> RSS feeds I'm not getting entire information. So, I need to scrape the
> user blog page. How to scrape a pages without knowing its html
> structure of a page?
Unless you want the entire page, you need to know something about
the page structure.
Well. If the page is even reasonably marked up (DIVs/Ps-wise) and
you create an array of block elements, you *might* get away with the
assumption that the ones with significant amounts of text (for some
value of "significant") are the actual blog post.
Might. I'd imagine a lot more going into that heuristic, since you're
looking for an AI solution :-)
Good luck,
--
Hassan Schroeder ------------------------ hassan.schroeder@gmail.com
twitter: @hassan
--
You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group.
To post to this group, send email to rubyonrails-talk@googlegroups.com.
To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.
0 Comments:
Post a Comment
Subscribe to Post Comments [Atom]
<< Home