Re: skipping invalid segments

Cam Bazz Fri, 08 Jul 2011 14:08:06 -0700

Hello,

It appears that in my previous message I had ommitted to write -dir in
my message, but had actually written -dir in my console.


However, I have found out that I need to nutch parse
/home/crawl/segments/12345 before updating a db.

By the way: what exactly is a segment, and how is data stored under
this segment? I think it is a hadoop format.

Best Regards,
-C.B.

On Fri, Jul 8, 2011 at 11:00 PM, lewis john mcgibbney
<[email protected]> wrote:
> Hi C.B.,
>
> It looks like you may have simply missed the '-dir' when you were specifying
> your crawldb directory to be updated from the fetched segment. Have a look
> here [1]
>
> Can you please try and post your results.
>
> [1] http://wiki.apache.org/nutch/bin/nutch_updatedb
>
>
>
> On Fri, Jul 8, 2011 at 5:06 PM, Cam Bazz <[email protected]> wrote:
>
>> Hello,
>>
>> I tried to crawl manually, only a list of urls. I have issued the
>> following commands:
>>
>> bin/nutch inject /home/crawl/crawldb /home/urls
>>
>> bin/nutch generate /home/crawl/crawldb /home/crawl/segments
>>
>> bin/nutch fetch /home/crawl/segments/123456789
>>
>> bin/nutch updatedb /home/crawl/crawldb /home/crawl/segments/123456789
>> -noAdditions
>>
>> however for the last command: it skips the segment 12345789 saying it
>> is an invalid segment?
>>
>> This is exactly what I need (the -noAdditions flag) but it will not
>> updatedb. What might have done wrong?
>>
>> Best Regards,
>> -C.B.
>>
>
>
>
> --
> *Lewis*
>

Re: skipping invalid segments

Reply via email to