Re: Crawl and extract data

Lewis John Mcgibbney Thu, 05 Apr 2012 02:56:43 -0700

Hi Mansour,

On Wed, Apr 4, 2012 at 10:05 PM, Mansour Al Akeel <[email protected]
> wrote:


> I understand that I need to
> implement a way to process each of the pages for these sites in a
> different way. Mostly XML processing and regexp (any expert advice
> here).


This is extremely vague, what kind of processing, what are you actually
aiming to do?


> However I have been reading about nutch as
> crawler to be used for this task. I read that nutch can be an over
> kill for simple tasks.
>

I suppose you will soon find out.... Nutch is in my opinion THE leading
open source (web) crawler. If you don't want to be doing activities that
are classed under the umbrella topic of crawling, then don't use it.


>
> I tried the tutorial on the wiki site,
> http://wiki.apache.org/nutch/NutchTutorial which resulted in a many
> files, with no
> source for the pages.


This again is very vague. As I don't know what you're actually trying to do
it makes the task of providing an answer salightly difficult. Are you
trying to search these pages, classify them, run a custom program on some
raw (x)html or some other source?


> I found that I need to read each segment
> separately to retrieve the html/xhtml from
>  http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch_segread
>

Things have moved on considerably from Nutch 0.8... try here
http://wiki.apache.org/nutch/CommandLineOptions
and if what you are after is not in there then maybe we can discuss some
other method.

>
> What I
> like to know
> is, can nutch help in crawling multiple sites and store the pages to
> extract data from ?
>
> Yes of course. Try looking at various directories produced as a result of
a small crawl. Once you have an idea of these you'll get a better picture
of what you can do processingwise.

hth a bit

-- 
*Lewis*

Re: Crawl and extract data

Reply via email to