there is a good change that the dump is in a foreign language, however this
depends on which language you consider as foreign and what language it
actually is.
AFAIK the encoding should be inferred directly from the page or document
markup, however failing this there is a default fall back of windows1252...
I think
Lewis

On Monday, May 13, 2013, suzhaolong <[email protected]> wrote:
> Hello everyone!
> I have studied nutch 1.2 some days, my task is to get the body text of the
> webpage. At last i get the text file- dump, and i wonder what is the
> specific format of the dump text file ? it is utf-8 format ? because the
> text is somekind of foreign language to me ,i don't know whether they are
> messed up.
>
> thank you so much
>
>
>
> --
> View this message in context:
http://lucene.472066.n3.nabble.com/NUTCH1-2-the-specific-format-of-the-dump-text-file-tp4062845.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

-- 
*Lewis*

Reply via email to