Re: Encoding the content got from Fetcher
Yes, I tried in that configuration file setting with the latin encoding Windows-1250, but the value of this property does not affect to the encoding of the content (I also tried with unexistent encoding and the result is the same...) property nameparser.character.encoding.default/name valueWindows-1250/value descriptionThe character encoding to fall back to when no other information is available/description /property Has anyone had the same problem? (Hungarian o Polish people sure...) Thanks -- View this message in context: http://old.nabble.com/Encoding-the-content-got-from-Fetcher-tp26528468p26536269.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Encoding the content got from Fetcher
Santiago Pérez wrote: Yes, I tried in that configuration file setting with the latin encoding Windows-1250, but the value of this property does not affect to the encoding of the content (I also tried with unexistent encoding and the result is the same...) property nameparser.character.encoding.default/name valueWindows-1250/value descriptionThe character encoding to fall back to when no other information is available/description /property Has anyone had the same problem? (Hungarian o Polish people sure...) The appearance of characters that you quoted in your other email indicates that the problem may be the opposite - your pages seem to use UTF-8, and you are trying to convert them using Windows-1250 ... Try putting UTF-8 in this property, and see what happens. Generally speaking, pages should declare their encoding, either in HTTP headers or in meta tags, but often this declaration is either missing or completely wrong. Nutch uses ICU4J CharsetDetector plus its own heuristic (in util.EncodingDetector and in HtmlParser) that tries to detect character encoding if it's missing or even if it's wrong - but this is a tricky issue and sometimes results are unpredictable. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Encoding the content got from Fetcher
I had already tried with: property nameparser.character.encoding.default/name valueUTF-8/value descriptionThe character encoding to fall back to when no other information is available/description /property and System.out.println(content.toString()); is still the HTML code with the incorrect encoding... -- View this message in context: http://old.nabble.com/Encoding-the-content-got-from-Fetcher-tp26528468p26539695.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Encoding the content got from Fetcher
hi have you tried to change this property: parser.character.encoding.default Hej, I am a newbie in Nutch and I need some help with a problem because I do not find clear documentation. In crawling proccess when the each of the FetcherThread get the content, this is in formatted in a way which deletes the new line characters (\n) and transform useful characters in Spanish as á,é,Ã,ó,ú,ñ,ü in the default encoding like: Ã?á, Ã?ó, Ã?ÃÂ, Ã?ó, Ã?ú, Ã?ñ, Ã?ü. I would like to know if it is possible to set this default encoding (is UTF-8?) to the one that I need (ASCII I guess). Thanks in advance ;) -- View this message in context: http://old.nabble.com/Encoding-the-content-got-from-Fetcher-tp26528468p26528468.html Sent from the Nutch - User mailing list archive at Nabble.com.