I created a dump file via...

./nutch readseg -dump crawl/segments/20110727204128 dump -nogenerate
-noparse -noparsedata -noparsetex

And when I was processing the file, I noticed that the Content section is
truncated for several records. 

I tried grabbing the some of the truncated pages individually via...

./nutch readseg -get ./crawl/segments/20110727204128
http://example.com/page.html >out.txt

And when I opened them up in Emacs, there is a bunch of corrupted garbage
where the Content truncates.

What would cause this? Nutch says "Status: 33 (fetch_success)" so evidently
thinks it downloaded it successfully.

Thanks.

- James 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Some-Dump-Content-Truncated-Corrupted-tp3220197p3220197.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to