Hi James, What exactly do you mean by when you refer to processing the file? Does this mean manually observing results or is there some other analytics you were doing on it?
I would comment that the truncation may be a result of one or more properties within your nutch-site.xml configuration for example http.content.limit, please have a look at the size of the individual pages which have been truncated and determine whether you have accounted for this within your config values. There may also be other values in nutch-site which would cause truncation. Finally, and this is only a thought, but it may be possible that there are certain parts of the segment which may have become corrupted when parsing or something similar. Have you tried updating your crawldb to see whether the segment is skipped and classified as invalid? hth On Tue, Aug 2, 2011 at 9:56 PM, espeed <[email protected]> wrote: > I created a dump file via... > > ./nutch readseg -dump crawl/segments/20110727204128 dump -nogenerate > -noparse -noparsedata -noparsetex > > And when I was processing the file, I noticed that the Content section is > truncated for several records. > > I tried grabbing the some of the truncated pages individually via... > > ./nutch readseg -get ./crawl/segments/20110727204128 > http://example.com/page.html >out.txt > > And when I opened them up in Emacs, there is a bunch of corrupted garbage > where the Content truncates. > > What would cause this? Nutch says "Status: 33 (fetch_success)" so evidently > thinks it downloaded it successfully. > > Thanks. > > - James > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Some-Dump-Content-Truncated-Corrupted-tp3220197p3220197.html > Sent from the Nutch - User mailing list archive at Nabble.com. > -- *Lewis*

