Re: Encoding the content got from Fetcher

2009-11-27 Thread Santiago Pérez

Yes, I tried in that configuration file setting with the latin encoding
Windows-1250, but the value of this property does not affect to the encoding
of the content (I also tried with unexistent encoding and the result is the
same...)

property
  nameparser.character.encoding.default/name
  valueWindows-1250/value
  descriptionThe character encoding to fall back to when no other
information
  is available/description
/property

Has anyone had the same problem? (Hungarian o Polish people sure...)

Thanks
-- 
View this message in context: 
http://old.nabble.com/Encoding-the-content-got-from-Fetcher-tp26528468p26536269.html
Sent from the Nutch - User mailing list archive at Nabble.com.



Re: Encoding the content got from Fetcher

2009-11-27 Thread Andrzej Bialecki

Santiago Pérez wrote:

Yes, I tried in that configuration file setting with the latin encoding
Windows-1250, but the value of this property does not affect to the encoding
of the content (I also tried with unexistent encoding and the result is the
same...)

property
  nameparser.character.encoding.default/name
  valueWindows-1250/value
  descriptionThe character encoding to fall back to when no other
information
  is available/description
/property

Has anyone had the same problem? (Hungarian o Polish people sure...)


The appearance of characters that you quoted in your other email 
indicates that the problem may be the opposite - your pages seem to use 
UTF-8, and you are trying to convert them using Windows-1250 ... Try 
putting UTF-8 in this property, and see what happens.


Generally speaking, pages should declare their encoding, either in HTTP 
headers or in meta tags, but often this declaration is either missing 
or completely wrong. Nutch uses ICU4J CharsetDetector plus its own 
heuristic (in util.EncodingDetector and in HtmlParser) that tries to 
detect character encoding if it's missing or even if it's wrong - but 
this is a tricky issue and sometimes results are unpredictable.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Encoding the content got from Fetcher

2009-11-27 Thread Santiago Pérez

I had already tried with: 

property
  nameparser.character.encoding.default/name
  valueUTF-8/value
  descriptionThe character encoding to fall back to when no other
information
  is available/description
/property

and System.out.println(content.toString());
is still the HTML code with the incorrect encoding...
-- 
View this message in context: 
http://old.nabble.com/Encoding-the-content-got-from-Fetcher-tp26528468p26539695.html
Sent from the Nutch - User mailing list archive at Nabble.com.



Re: Encoding the content got from Fetcher

2009-11-26 Thread fadzi
hi

have you tried to change this property:

parser.character.encoding.default




 Hej,

 I am a newbie in Nutch and I need some help with a problem because I do
 not
 find clear documentation.

 In crawling proccess when the each of the FetcherThread get the content,
 this is in formatted in a way which deletes the new line characters (\n)
 and transform useful characters in Spanish as á,é,í,ó,ú,ñ,ü in the
 default
 encoding like: �¡, �³, �­, �³, �º, �±,
 �¼.

 I would like to know if it is possible to set this default encoding (is
 UTF-8?) to the one that I need (ASCII I guess).

 Thanks in advance ;)
 --
 View this message in context:
 http://old.nabble.com/Encoding-the-content-got-from-Fetcher-tp26528468p26528468.html
 Sent from the Nutch - User mailing list archive at Nabble.com.