Hi Yossi,

> I think in the case that you interrupt the fetcher, you'll have the problem 
> that URLs
> that where scheduled to be fetched on the interrupted cycle will never be 
> fetched
> (because of NUTCH-1842).

Yes, but only if generate.update.crawldb is true which is not the case by 
default.
For the bin/crawl script which does not run fetchers in parallel the default is 
the best option.

Btw., NUTCH-1842 is fixed in master / 1.16.

Thanks,
Sebastian


On 11/19/18 2:02 PM, Yossi Tamari wrote:
> I think in the case that you interrupt the fetcher, you'll have the problem 
> that URLs that where scheduled to be fetched on the interrupted cycle will 
> never be fetched (because of NUTCH-1842).
> 
>       Yossi.
> 
>> -----Original Message-----
>> From: Markus Jelsma <markus.jel...@openindex.io>
>> Sent: 19 November 2018 14:52
>> To: user@nutch.apache.org
>> Subject: RE: RE: unexpected Nutch crawl interruption
>>
>> Hello Hany,
>>
>> That depends. If you interrupt the fetcher, the segment being fetched can be
>> thrown away. But if you interrupt updatedb, you can remove the temp directory
>> and must get rid of the lock file. The latter is also true if you interrupt 
>> the
>> generator.
>>
>> Regards,
>> Markus
>>
>>
>>
>> -----Original message-----
>>> From:hany.n...@hsbc.com <hany.n...@hsbc.com>
>>> Sent: Monday 19th November 2018 13:30
>>> To: user@nutch.apache.org
>>> Subject: RE: RE: unexpected Nutch crawl interruption
>>>
>>> This means there is nothing called corrupted db by any mean?
>>>
>>>
>>> Kind regards,
>>> Hany Shehata
>>> Solutions Architect, Marketing and Communications IT Corporate
>>> Functions | HSBC Operations, Services and Technology (HOST) ul.
>>> Kapelanka 42A, 30-347 Kraków, Poland
>>>
>> _________________________________________________________________
>> _
>>>
>>> Tie line: 7148 7689 4698
>>> External: +48 123 42 0698
>>> Mobile: +48 723 680 278
>>> E-mail: hany.n...@hsbc.com
>>>
>> _________________________________________________________________
>> _
>>> Protect our environment - please only print this if you have to!
>>>
>>>
>>> -----Original Message-----
>>> From: Semyon Semyonov [mailto:semyon.semyo...@mail.com]
>>> Sent: Monday, November 19, 2018 12:59 PM
>>> To: user@nutch.apache.org
>>> Subject: Re: RE: unexpected Nutch crawl interruption
>>>
>>> From the most recent updated crawldb.
>>>
>>>
>>> Sent: Monday, November 19, 2018 at 12:35 PM
>>> From: hany.n...@hsbc.com
>>> To: "user@nutch.apache.org" <user@nutch.apache.org>
>>> Subject: RE: unexpected Nutch crawl interruption Hello Semyon,
>>>
>>> Does it means that if I re-run crawl command it will continue from where it 
>>> has
>> been stopped from the previous run?
>>>
>>> Kind regards,
>>> Hany Shehata
>>> Solutions Architect, Marketing and Communications IT Corporate
>>> Functions | HSBC Operations, Services and Technology (HOST) ul.
>>> Kapelanka 42A, 30-347 Kraków, Poland
>>>
>> _________________________________________________________________
>> _
>>>
>>> Tie line: 7148 7689 4698
>>> External: +48 123 42 0698
>>> Mobile: +48 723 680 278
>>> E-mail: hany.n...@hsbc.com
>>>
>> _________________________________________________________________
>> _
>>> Protect our environment - please only print this if you have to!
>>>
>>>
>>> -----Original Message-----
>>> From: Semyon Semyonov [mailto:semyon.semyo...@mail.com]
>>> Sent: Monday, November 19, 2018 12:06 PM
>>> To: user@nutch.apache.org
>>> Subject: Re: unexpected Nutch crawl interruption
>>>
>>> Hi Hany,
>>>
>>> If you open the script code you will reach that line:
>>>
>>> # main loop : rounds of generate - fetch - parse - update for ((a=1; ; 
>>> a++)) with
>> number of break conditions.
>>>
>>> For each iteration it calls n-independent map jobs.
>>> If it breaks it stops.
>>> You should finish the loop either with manual nutch commands, or start with
>> the new call of crawl script using the past iteration crawldb.
>>> Semyon.
>>>
>>>
>>>
>>> Sent: Monday, November 19, 2018 at 11:41 AM
>>> From: hany.n...@hsbc.com
>>> To: "user@nutch.apache.org" <user@nutch.apache.org>
>>> Subject: unexpected Nutch crawl interruption Hello,
>>>
>>> What will happen if bin/crawl command is forced to be stopped by any
>> reason? Server restart....
>>>
>>> Kind regards,
>>> Hany Shehata
>>> Solutions Architect, Marketing and Communications IT Corporate
>>> Functions | HSBC Operations, Services and Technology (HOST) ul.
>>> Kapelanka 42A, 30-347 Kraków, Poland
>>>
>> _________________________________________________________________
>> _
>>>
>>> Tie line: 7148 7689 4698
>>> External: +48 123 42 0698
>>> Mobile: +48 723 680 278
>>> E-mail: hany.n...@hsbc.com<mailto:hany.n...@hsbc.com>
>>>
>> _________________________________________________________________
>> _
>>> Protect our environment - please only print this if you have to!
>>>
>>>
>>>
>>> -----------------------------------------
>>> SAVE PAPER - THINK BEFORE YOU PRINT!
>>>
>>> This E-mail is confidential.
>>>
>>> It may also be legally privileged. If you are not the addressee you may not
>> copy, forward, disclose or use any part of it. If you have received this 
>> message in
>> error, please delete it and all copies from your system and notify the sender
>> immediately by return E-mail.
>>>
>>> Internet communications cannot be guaranteed to be timely secure, error or
>> virus-free.
>>> The sender does not accept liability for any errors or omissions.
>>>
>>>
>>> ***************************************************
>>> This message originated from the Internet. Its originator may or may not be
>> who they claim to be and the information contained in the message and any
>> attachments may or may not be accurate.
>>> ****************************************************
>>>
>>>
>>>
>>>
>>> -----------------------------------------
>>> SAVE PAPER - THINK BEFORE YOU PRINT!
>>>
>>> This E-mail is confidential.
>>>
>>> It may also be legally privileged. If you are not the addressee you may not
>> copy, forward, disclose or use any part of it. If you have received this 
>> message in
>> error, please delete it and all copies from your system and notify the sender
>> immediately by return E-mail.
>>>
>>> Internet communications cannot be guaranteed to be timely secure, error or
>> virus-free.
>>> The sender does not accept liability for any errors or omissions.
>>>
>>>
>>> ***************************************************
>>> This message originated from the Internet. Its originator may or may not be
>> who they claim to be and the information contained in the message and any
>> attachments may or may not be accurate.
>>> ****************************************************
>>>
>>>
>>> -----------------------------------------
>>> SAVE PAPER - THINK BEFORE YOU PRINT!
>>>
>>> This E-mail is confidential.
>>>
>>> It may also be legally privileged. If you are not the addressee you
>>> may not copy, forward, disclose or use any part of it. If you have
>>> received this message in error, please delete it and all copies from
>>> your system and notify the sender immediately by return E-mail.
>>>
>>> Internet communications cannot be guaranteed to be timely secure, error or
>> virus-free.
>>> The sender does not accept liability for any errors or omissions.
>>>
> 

Reply via email to