Re: Some Dump Content Truncated/Corrupted

lewis john mcgibbney Wed, 03 Aug 2011 02:49:54 -0700

Hi James,

What exactly do you mean by when you refer to processing the file? Does this
mean manually observing results or is there some other analytics you were
doing on it?

I would comment that the truncation may be a result of one or more
properties within your nutch-site.xml configuration for example
http.content.limit, please have a look at the size of the individual pages
which have been truncated and determine whether you have accounted for this
within your config values. There may also be other values in nutch-site
which would cause truncation. Finally, and this is only a thought, but it
may be possible that there are certain parts of the segment which may have
become corrupted when parsing or something similar. Have you tried updating
your crawldb to see whether the segment is skipped and classified as
invalid?

hth

On Tue, Aug 2, 2011 at 9:56 PM, espeed <[email protected]> wrote:

> I created a dump file via...
>
> ./nutch readseg -dump crawl/segments/20110727204128 dump -nogenerate
> -noparse -noparsedata -noparsetex
>
> And when I was processing the file, I noticed that the Content section is
> truncated for several records.
>
> I tried grabbing the some of the truncated pages individually via...
>
> ./nutch readseg -get ./crawl/segments/20110727204128
> http://example.com/page.html >out.txt
>
> And when I opened them up in Emacs, there is a bunch of corrupted garbage
> where the Content truncates.
>
> What would cause this? Nutch says "Status: 33 (fetch_success)" so evidently
> thinks it downloaded it successfully.
>
> Thanks.
>
> - James
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Some-Dump-Content-Truncated-Corrupted-tp3220197p3220197.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

-- 
*Lewis*

Re: Some Dump Content Truncated/Corrupted

Reply via email to