Re: Nutch 2.1 - fetching is not working (maybe broken generate?)

anupamk Sun, 16 Mar 2014 11:00:16 -0700

I m sorry for causing confusion.

Nutch 2.x fetch cycle is significantly different in terms of implementation
than Nutch 1.x

The idea is pretty much the same though.

So, the idea to check whether the each step works is as follows:

1. use generatorjob to generate a batch job that will be fetch topN = 100
links from your webtable
2. check your webtable and see which links have a generator mark -- if you
have generator mark on 100 links then your generate job is work
3. use fetcherjob to fetch -- once the job is completed see which links in
the webtable has fetched mark
4. use parserjob to parse -- and then check webtable is updated correctly or
not after the job
5. use updateDB job to update webtable -- and then check webtable is updated
correctly or not

you get the idea right ?

In nutch 1.x we don't use gora or have webtable and every operation relies
on file operations -- that's the reason you readseg and create segments etc.

The whole segment thing is no longer there in nutch2.x because you have
webtable in gora taking its place.

However, since I have only worked on Nutch1.x (and still do) I can't help
you on how to check webpage table for generator/fetcher/parser markers etc
...

But I am sure you will be able to documentation in the wiki on the commands
how to do it.

here are a few good place to get started

http://wiki.apache.org/nutch/Nutch2Crawling
http://nlp.solutions.asia/?p=232

--
View this message in context:
http://lucene.472066.n3.nabble.com/Nutch-2-1-fetching-is-not-working-maybe-broken-generate-tp4123813p4124590.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Nutch 2.1 - fetching is not working (maybe broken generate?)

Reply via email to