Hello all,
I do have a task of archiving pages to extract data. Data related to
products rating (books, music, food ... etc) by users.
Pages come from different websites. I understand that I need to
implement a way to process each of the pages for these sites in a
different way. Mostly XML processing and regexp (any expert advice
here). However I have been reading about nutch as
crawler to be used for this task. I read that nutch can be an over
kill for simple tasks.

I tried the tutorial on the wiki site,
http://wiki.apache.org/nutch/NutchTutorial which resulted in a many
files, with no
source for the pages. I found that I need to read each segment
separately to retrieve the html/xhtml from
 http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch_segread

Know that I noticed that this simple task that is growing, I started
wondering if Nutch is what I need or if there's something I am
missing.

I did some reading about screen scraping and it looks like what I
want. Knowing that I am going to work on a project where crawling
the web and searching the contents is a requirement, I decided to go
with nutch to reuse the skills that I will acquire.

In other words, I prefer to use and utilize a framework that can be
used now (for simple tasks), and later for larger projects. What I
like to know
is, can nutch help in crawling multiple sites and store the pages to
extract data from ?


Thank you in advance.

Reply via email to