Lewis, thank you for your time. Since I am new to web crawling , I didn't include enough info to help you answering. I will try to give more details. I need to start crawling a website, reading the starting page. The page has a form that submits a category from a list (<option>) categories of products. The results for each category are in multiple pages and Each page contains a list of products. You can see in this when you search google, where it shows you the search results and more pages for additional results.
I need to parse each of these. I haven't done this before with crawling, but I can use regex or (x)html processing means to get the data I need (it can be external program). So this part is not an issue. The issue was saving the contents and finding them to make them available for parsing and extracting the data. I tried using nutch on an example sites, and create the files under "crawl" directory. The tutorial I was following is http://wiki.apache.org/nutch/NutchTutorial and it's for nutch-1.X Using "nutch readseg -dump", on each segment doesn't give me the (x)html page as is. This is an issue. I don't need to generate indexes for solr as I am not going to search those pages. I hope this explains. Thank you. On Thu, Apr 5, 2012 at 5:56 AM, Lewis John Mcgibbney <[email protected]> wrote: > Hi Mansour, > > On Wed, Apr 4, 2012 at 10:05 PM, Mansour Al Akeel <[email protected] >> wrote: > >> I understand that I need to >> implement a way to process each of the pages for these sites in a >> different way. Mostly XML processing and regexp (any expert advice >> here). > > > This is extremely vague, what kind of processing, what are you actually > aiming to do? > > >> However I have been reading about nutch as >> crawler to be used for this task. I read that nutch can be an over >> kill for simple tasks. >> > > I suppose you will soon find out.... Nutch is in my opinion THE leading > open source (web) crawler. If you don't want to be doing activities that > are classed under the umbrella topic of crawling, then don't use it. > > >> >> I tried the tutorial on the wiki site, >> http://wiki.apache.org/nutch/NutchTutorial which resulted in a many >> files, with no >> source for the pages. > > > This again is very vague. As I don't know what you're actually trying to do > it makes the task of providing an answer salightly difficult. Are you > trying to search these pages, classify them, run a custom program on some > raw (x)html or some other source? > > >> I found that I need to read each segment >> separately to retrieve the html/xhtml from >> http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch_segread >> > > Things have moved on considerably from Nutch 0.8... try here > http://wiki.apache.org/nutch/CommandLineOptions > and if what you are after is not in there then maybe we can discuss some > other method. > >> >> What I >> like to know >> is, can nutch help in crawling multiple sites and store the pages to >> extract data from ? >> >> Yes of course. Try looking at various directories produced as a result of > a small crawl. Once you have an idea of these you'll get a better picture > of what you can do processingwise. > > hth a bit > > -- > *Lewis*

