About the second question:
assume that I run nutch on a huge amount of files. one file's
parsing
fails,
lets say because the file was corrupted. I fix the file and now I
want to
re-crawl it.
If I just run crawling again, the file won't be fetched because of
the
interval.
I can try run nutch on this file aside, and then merge, but well,
it
doesn't
works as it should. while merging something goes wrong and I find
myself
with a smaller index than I had before.
So the question is - how can I recrawl a single url while running
nutch
crawl on an existing input.
Any other insights on these issues will be appreciated
Markus Jelsma-2 wrote
hi,
On Mon, 9 Apr 2012 22:43:44 -0700 (PDT), "nutch.buddy@"
<nutch.buddy@> wrote:
Hi
There are some scenarios of failure in nutch which I'm not sure
how
to
handle.
1. I run nutch on a huge amount of urls and some kind of OOM
exception if
thrown, or one of those "cannot allocate memory". The result is
that
my
segment is half complete.
A segment cannot be halve complete, it either fails or succeeds
entirely.
How can I recover from this? Do I have to recrawl all the urls
that
were in
the segment?
If so, how do I mark them for recrawl in crawldb?
You can just generate new segments or remove all directories from
the
segment except crawl_generate and start fetching again.
2. I run nutch on a huge amount of urls and some urls are not
parsed
sucessfully.
I get an index which has all the urls that worked and doesnt have
the
ones
that didnt work.
How can I handle them without having to recrawl the whole thing?
I don't understand.
Thanks.
--
View this message in context:
http://lucene.472066.n3.nabble.com/How-to-handle-failures-in-nutch-tp3898768p3898768.html
Sent from the Nutch - User mailing list archive at Nabble.com.
hi,
On Mon, 9 Apr 2012 22:43:44 -0700 (PDT), "nutch.buddy@"
<nutch.buddy@> wrote:
Hi
There are some scenarios of failure in nutch which I'm not sure
how
to
handle.
1. I run nutch on a huge amount of urls and some kind of OOM
exception if
thrown, or one of those "cannot allocate memory". The result is
that
my
segment is half complete.
A segment cannot be halve complete, it either fails or succeeds
entirely.
How can I recover from this? Do I have to recrawl all the urls
that
were in
the segment?
If so, how do I mark them for recrawl in crawldb?
You can just generate new segments or remove all directories from
the
segment except crawl_generate and start fetching again.
2. I run nutch on a huge amount of urls and some urls are not
parsed
sucessfully.
I get an index which has all the urls that worked and doesnt have
the
ones
that didnt work.
How can I handle them without having to recrawl the whole thing?
I don't understand.
Thanks.
--
View this message in context:
http://lucene.472066.n3.nabble.com/How-to-handle-failures-in-nutch-tp3898768p3898768.html
Sent from the Nutch - User mailing list archive at Nabble.com.
hi,
On Mon, 9 Apr 2012 22:43:44 -0700 (PDT), "nutch.buddy@"
<nutch.buddy@> wrote:
Hi
There are some scenarios of failure in nutch which I'm not sure
how
to
handle.
1. I run nutch on a huge amount of urls and some kind of OOM
exception if
thrown, or one of those "cannot allocate memory". The result is
that
my
segment is half complete.
A segment cannot be halve complete, it either fails or succeeds
entirely.
How can I recover from this? Do I have to recrawl all the urls
that
were in
the segment?
If so, how do I mark them for recrawl in crawldb?
You can just generate new segments or remove all directories from
the
segment except crawl_generate and start fetching again.
2. I run nutch on a huge amount of urls and some urls are not
parsed
sucessfully.
I get an index which has all the urls that worked and doesnt have
the
ones
that didnt work.
How can I handle them without having to recrawl the whole thing?
I don't understand.
Thanks.
--
View this message in context:
http://lucene.472066.n3.nabble.com/How-to-handle-failures-in-nutch-tp3898768p3898768.html
Sent from the Nutch - User mailing list archive at Nabble.com.
hi,
On Mon, 9 Apr 2012 22:43:44 -0700 (PDT), "nutch.buddy@"
<nutch.buddy@> wrote:
Hi
There are some scenarios of failure in nutch which I'm not sure
how
to
handle.
1. I run nutch on a huge amount of urls and some kind of OOM
exception if
thrown, or one of those "cannot allocate memory". The result is
that
my
segment is half complete.
A segment cannot be halve complete, it either fails or succeeds
entirely.
How can I recover from this? Do I have to recrawl all the urls
that
were in
the segment?
If so, how do I mark them for recrawl in crawldb?
You can just generate new segments or remove all directories from
the
segment except crawl_generate and start fetching again.
2. I run nutch on a huge amount of urls and some urls are not
parsed
sucessfully.
I get an index which has all the urls that worked and doesnt have
the
ones
that didnt work.
How can I handle them without having to recrawl the whole thing?
I don't understand.
Thanks.
--
View this message in context:
http://lucene.472066.n3.nabble.com/How-to-handle-failures-in-nutch-tp3898768p3898768.html
Sent from the Nutch - User mailing list archive at Nabble.com.
--
View this message in context:
http://lucene.472066.n3.nabble.com/How-to-handle-failures-in-nutch-tp3898768p3899044.html
Sent from the Nutch - User mailing list archive at Nabble.com.