Greetings Nutchlings,
I have been using readseg-dump successfully to retrieve content crawled by
nutch, but I have one significant problem: many non-ASCII characters appear as
'???' in the dumped text file. This happens fairly frequently in the headlines
of news sites that I crawl, for things like quotes, apostrophes, and dashes.
Am I doing something wrong, or is this a known bug? I use a python utf8
decoder, so it would be nice if everything were UTF8.
Here is the command that I use to dump each segment (using nutch
1.12).bin/nutch readseg -dumpĀ segPath destPath -noparse -noparsedata
-noparsetext -nogenerate
It is so close to working perfectly!