Hi Tony,

As i remember some phases in Nutch(INJECT, GENERATE, ...) set a specific
mark(marker field) - for example on inject phase "mk:_injmrk_" is set, for
GENERATE phase - "mk:_gnmrk_". It is also worth to point that phases
depends on results of execution of previous phases(e.g. FETCH will only
fetch urls that were successfully processed by GENERATE phase(gen mark is
set)). Check that you have such marks on your entries in collection. If you
have only inject mark it means that GENERATE phase didn't choose url to be
fetched. In this case you should check that you pass "curTime" parameter
with current timestamp after you did INJECT.

>From my experience - it is better to download Nutch sources and check what
it is doing from the code.

Hope that helps

Regards

Best Regards,
Dzmitry

On Fri, Jun 26, 2015 at 11:00 PM, Tony Colletti <[email protected]>
wrote:

> After searching your site and then having to resort to S/O, I've finally
> figured out how to create a full crawl using each command to the REST
> endpoint. However, I've noticed that after my final step is done
> (UPDATEDB), I check my db and there are many fields missing. The ones I'm
> most concerned about is the "status" and "baseUrl" field. I'm not even sure
> if the crawl is actually being executed or not. I'm assuming it's something
> I have wrong. I've followed the examples in this<
> https://docs.google.com/document/d/1OGg22ATohapP2ycewIaTcUnENc2FeyYzni0ED_Jjxz8/edit>
> document that I found on another mailing list topic. What am I doing wrong?
> I'm using Nutch 2.3 and tying it into MongoDB as my database.
>
> Also, I've found that even after just running the command to INJECT the
> seedlist, my db already has a new collection with information in it. That
> information is the same information in the end, so it never changes. But
> when checking the status of the other commands, they all say FINISHED and
> OK. What's going on?
>
> Thanks for the help!
>
> ~ Tony
>
>

Reply via email to