Hello Yossi,

That should only be the case if the CrawlDB is updated by the generator, which 
is not a default.

Regards,
Markus

 
 
-----Original message-----
> From:Yossi Tamari <yossi.tam...@pipl.com>
> Sent: Monday 19th November 2018 14:04
> To: user@nutch.apache.org
> Subject: RE: RE: unexpected Nutch crawl interruption
> 
> I think in the case that you interrupt the fetcher, you'll have the problem 
> that URLs that where scheduled to be fetched on the interrupted cycle will 
> never be fetched (because of NUTCH-1842).
> 
>       Yossi.
> 
> > -----Original Message-----
> > From: Markus Jelsma <markus.jel...@openindex.io>
> > Sent: 19 November 2018 14:52
> > To: user@nutch.apache.org
> > Subject: RE: RE: unexpected Nutch crawl interruption
> > 
> > Hello Hany,
> > 
> > That depends. If you interrupt the fetcher, the segment being fetched can be
> > thrown away. But if you interrupt updatedb, you can remove the temp 
> > directory
> > and must get rid of the lock file. The latter is also true if you interrupt 
> > the
> > generator.
> > 
> > Regards,
> > Markus
> > 
> > 
> > 
> > -----Original message-----
> > > From:hany.n...@hsbc.com <hany.n...@hsbc.com>
> > > Sent: Monday 19th November 2018 13:30
> > > To: user@nutch.apache.org
> > > Subject: RE: RE: unexpected Nutch crawl interruption
> > >
> > > This means there is nothing called corrupted db by any mean?
> > >
> > >
> > > Kind regards,
> > > Hany Shehata
> > > Solutions Architect, Marketing and Communications IT Corporate
> > > Functions | HSBC Operations, Services and Technology (HOST) ul.
> > > Kapelanka 42A, 30-347 Kraków, Poland
> > >
> > _________________________________________________________________
> > _
> > >
> > > Tie line: 7148 7689 4698
> > > External: +48 123 42 0698
> > > Mobile: +48 723 680 278
> > > E-mail: hany.n...@hsbc.com
> > >
> > _________________________________________________________________
> > _
> > > Protect our environment - please only print this if you have to!
> > >
> > >
> > > -----Original Message-----
> > > From: Semyon Semyonov [mailto:semyon.semyo...@mail.com]
> > > Sent: Monday, November 19, 2018 12:59 PM
> > > To: user@nutch.apache.org
> > > Subject: Re: RE: unexpected Nutch crawl interruption
> > >
> > > From the most recent updated crawldb.
> > >
> > >
> > > Sent: Monday, November 19, 2018 at 12:35 PM
> > > From: hany.n...@hsbc.com
> > > To: "user@nutch.apache.org" <user@nutch.apache.org>
> > > Subject: RE: unexpected Nutch crawl interruption Hello Semyon,
> > >
> > > Does it means that if I re-run crawl command it will continue from where 
> > > it has
> > been stopped from the previous run?
> > >
> > > Kind regards,
> > > Hany Shehata
> > > Solutions Architect, Marketing and Communications IT Corporate
> > > Functions | HSBC Operations, Services and Technology (HOST) ul.
> > > Kapelanka 42A, 30-347 Kraków, Poland
> > >
> > _________________________________________________________________
> > _
> > >
> > > Tie line: 7148 7689 4698
> > > External: +48 123 42 0698
> > > Mobile: +48 723 680 278
> > > E-mail: hany.n...@hsbc.com
> > >
> > _________________________________________________________________
> > _
> > > Protect our environment - please only print this if you have to!
> > >
> > >
> > > -----Original Message-----
> > > From: Semyon Semyonov [mailto:semyon.semyo...@mail.com]
> > > Sent: Monday, November 19, 2018 12:06 PM
> > > To: user@nutch.apache.org
> > > Subject: Re: unexpected Nutch crawl interruption
> > >
> > > Hi Hany,
> > >
> > > If you open the script code you will reach that line:
> > >
> > > # main loop : rounds of generate - fetch - parse - update for ((a=1; ; 
> > > a++)) with
> > number of break conditions.
> > >
> > > For each iteration it calls n-independent map jobs.
> > > If it breaks it stops.
> > > You should finish the loop either with manual nutch commands, or start 
> > > with
> > the new call of crawl script using the past iteration crawldb.
> > > Semyon.
> > >
> > >
> > >
> > > Sent: Monday, November 19, 2018 at 11:41 AM
> > > From: hany.n...@hsbc.com
> > > To: "user@nutch.apache.org" <user@nutch.apache.org>
> > > Subject: unexpected Nutch crawl interruption Hello,
> > >
> > > What will happen if bin/crawl command is forced to be stopped by any
> > reason? Server restart....
> > >
> > > Kind regards,
> > > Hany Shehata
> > > Solutions Architect, Marketing and Communications IT Corporate
> > > Functions | HSBC Operations, Services and Technology (HOST) ul.
> > > Kapelanka 42A, 30-347 Kraków, Poland
> > >
> > _________________________________________________________________
> > _
> > >
> > > Tie line: 7148 7689 4698
> > > External: +48 123 42 0698
> > > Mobile: +48 723 680 278
> > > E-mail: hany.n...@hsbc.com<mailto:hany.n...@hsbc.com>
> > >
> > _________________________________________________________________
> > _
> > > Protect our environment - please only print this if you have to!
> > >
> > >
> > >
> > > -----------------------------------------
> > > SAVE PAPER - THINK BEFORE YOU PRINT!
> > >
> > > This E-mail is confidential.
> > >
> > > It may also be legally privileged. If you are not the addressee you may 
> > > not
> > copy, forward, disclose or use any part of it. If you have received this 
> > message in
> > error, please delete it and all copies from your system and notify the 
> > sender
> > immediately by return E-mail.
> > >
> > > Internet communications cannot be guaranteed to be timely secure, error or
> > virus-free.
> > > The sender does not accept liability for any errors or omissions.
> > >
> > >
> > > ***************************************************
> > > This message originated from the Internet. Its originator may or may not 
> > > be
> > who they claim to be and the information contained in the message and any
> > attachments may or may not be accurate.
> > > ****************************************************
> > >
> > >
> > >
> > >
> > > -----------------------------------------
> > > SAVE PAPER - THINK BEFORE YOU PRINT!
> > >
> > > This E-mail is confidential.
> > >
> > > It may also be legally privileged. If you are not the addressee you may 
> > > not
> > copy, forward, disclose or use any part of it. If you have received this 
> > message in
> > error, please delete it and all copies from your system and notify the 
> > sender
> > immediately by return E-mail.
> > >
> > > Internet communications cannot be guaranteed to be timely secure, error or
> > virus-free.
> > > The sender does not accept liability for any errors or omissions.
> > >
> > >
> > > ***************************************************
> > > This message originated from the Internet. Its originator may or may not 
> > > be
> > who they claim to be and the information contained in the message and any
> > attachments may or may not be accurate.
> > > ****************************************************
> > >
> > >
> > > -----------------------------------------
> > > SAVE PAPER - THINK BEFORE YOU PRINT!
> > >
> > > This E-mail is confidential.
> > >
> > > It may also be legally privileged. If you are not the addressee you
> > > may not copy, forward, disclose or use any part of it. If you have
> > > received this message in error, please delete it and all copies from
> > > your system and notify the sender immediately by return E-mail.
> > >
> > > Internet communications cannot be guaranteed to be timely secure, error or
> > virus-free.
> > > The sender does not accept liability for any errors or omissions.
> > >
> 
> 

Reply via email to