Hello all, I do have a task of archiving pages to extract data. Data related to products rating (books, music, food ... etc) by users. Pages come from different websites. I understand that I need to implement a way to process each of the pages for these sites in a different way. Mostly XML processing and regexp (any expert advice here). However I have been reading about nutch as crawler to be used for this task. I read that nutch can be an over kill for simple tasks.
I tried the tutorial on the wiki site, http://wiki.apache.org/nutch/NutchTutorial which resulted in a many files, with no source for the pages. I found that I need to read each segment separately to retrieve the html/xhtml from http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch_segread Know that I noticed that this simple task that is growing, I started wondering if Nutch is what I need or if there's something I am missing. I did some reading about screen scraping and it looks like what I want. Knowing that I am going to work on a project where crawling the web and searching the contents is a requirement, I decided to go with nutch to reuse the skills that I will acquire. In other words, I prefer to use and utilize a framework that can be used now (for simple tasks), and later for larger projects. What I like to know is, can nutch help in crawling multiple sites and store the pages to extract data from ? Thank you in advance.

