Saturday, December 12, 2009

Re: [Rails] How to scrape a page without knowing its html structure

On Sat, Dec 12, 2009 at 2:56 AM, kalyan <kalyan.allampalli@gmail.com> wrote:

> I'm doing one module in my site, there I need to import user blog into
> my site. I can use RSS feeds to read the blog information but using
> RSS feeds I'm not getting entire information. So, I need to scrape the
> user blog page. How to scrape a pages without knowing its html
> structure of a page?

Unless you want the entire page, you need to know something about
the page structure.

Well. If the page is even reasonably marked up (DIVs/Ps-wise) and
you create an array of block elements, you *might* get away with the
assumption that the ones with significant amounts of text (for some
value of "significant") are the actual blog post.

Might. I'd imagine a lot more going into that heuristic, since you're
looking for an AI solution :-)

Good luck,
--
Hassan Schroeder ------------------------ hassan.schroeder@gmail.com
twitter: @hassan

--

You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group.
To post to this group, send email to rubyonrails-talk@googlegroups.com.
To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.

0 Comments:

Post a Comment

Subscribe to Post Comments [Atom]

<< Home


Real Estate