readseg dump and non-ASCII characters

Michael Coffey Tue, 14 Nov 2017 17:23:07 -0800

Greetings Nutchlings,
I have been using readseg-dump successfully to retrieve content crawled by 
nutch, but I have one significant problem: many non-ASCII characters appear as 
'???' in the dumped text file. This happens fairly frequently in the headlines 
of news sites that I crawl, for things like quotes, apostrophes, and dashes.
Am I doing something wrong, or is this a known bug? I use a python utf8 
decoder, so it would be nice if everything were UTF8.
Here is the command that I use to dump each segment (using nutch 
1.12).bin/nutch readseg -dump  segPath destPath -noparse -noparsedata 
-noparsetext -nogenerate
It is so close to working perfectly!

readseg dump and non-ASCII characters

Reply via email to