there is a good change that the dump is in a foreign language, however this depends on which language you consider as foreign and what language it actually is. AFAIK the encoding should be inferred directly from the page or document markup, however failing this there is a default fall back of windows1252... I think Lewis
On Monday, May 13, 2013, suzhaolong <[email protected]> wrote: > Hello everyone! > I have studied nutch 1.2 some days, my task is to get the body text of the > webpage. At last i get the text file- dump, and i wonder what is the > specific format of the dump text file ? it is utf-8 format ? because the > text is somekind of foreign language to me ,i don't know whether they are > messed up. > > thank you so much > > > > -- > View this message in context: http://lucene.472066.n3.nabble.com/NUTCH1-2-the-specific-format-of-the-dump-text-file-tp4062845.html > Sent from the Nutch - User mailing list archive at Nabble.com. > -- *Lewis*

