I created a dump file via... ./nutch readseg -dump crawl/segments/20110727204128 dump -nogenerate -noparse -noparsedata -noparsetex
And when I was processing the file, I noticed that the Content section is truncated for several records. I tried grabbing the some of the truncated pages individually via... ./nutch readseg -get ./crawl/segments/20110727204128 http://example.com/page.html >out.txt And when I opened them up in Emacs, there is a bunch of corrupted garbage where the Content truncates. What would cause this? Nutch says "Status: 33 (fetch_success)" so evidently thinks it downloaded it successfully. Thanks. - James -- View this message in context: http://lucene.472066.n3.nabble.com/Some-Dump-Content-Truncated-Corrupted-tp3220197p3220197.html Sent from the Nutch - User mailing list archive at Nabble.com.

