Hi,

I use hbase-0.92.1 and do not have problem with utf-8 chars. What is exactly 
your problem?

Alex.


-----Original Message-----
From: Ake Tangkananond <[email protected]>
To: user <[email protected]>
Sent: Thu, Aug 9, 2012 11:12 am
Subject: Re: Nutch 2 encoding


Hi,

I'm debugging.

I inserted a code to print out the encoding here in HtmlParser:java
function getParse and it printed utf-8. So I think it might be the data
store problem. What else could be the cause? Could you advise what next I
should go for to have my Thai chars stored correctly in HBase? Can I
simply go with the latest version of HBase? (Not sure if it is compatible
with nutch 2.0)


byte[] contentInOctets = page.getContent().array();
      InputSource input = new InputSource(new
ByteArrayInputStream(contentInOctets));

      EncodingDetector detector = new EncodingDetector(conf);
      detector.autoDetectClues(page, true);
      detector.addClue(sniffCharacterEncoding(contentInOctets), "sniffed");
      String encoding = detector.guessEncoding(page, defaultCharEncoding);

      metadata.set(Metadata.ORIGINAL_CHAR_ENCODING, encoding);
      metadata.set(Metadata.CHAR_ENCODING_FOR_CONVERSION, encoding);

LOG.info("encoding : " + encoding);
      input.setEncoding(encoding);



Regards,
Ake Tangkananond



On 8/9/12 11:06 PM, "Ake Tangkananond" <[email protected]> wrote:

>Hi,
>
>Sorry for late reply. I was trying to figure out myself but seem no luck.
>
>I'm on Hbase with local deploy version 0.90.6, r1295128, the working
>version as said in Wiki:
>http://wiki.apache.org/nutch/Nutch2Tutorial
>
>
>Regards,
>Ake Tangkananond
>
>
>
>
>On 8/9/12 10:30 PM, "Ferdy Galema" <[email protected]> wrote:
>
>>It depends on the datastore and possibly the server? What store are you
>>using?
>>
>>On Thu, Aug 9, 2012 at 4:05 PM, Ake Tangkananond <[email protected]>
>>wrote:
>>
>>> Hi all,
>>>
>>> I just wonder if Nutch 2 is working fine with non english characters in
>>> your
>>> deployment? Thai language used to work fine for me in Nutch 1.5 but not
>>>in
>>> Nutch 2. Did I miss something. Anything I should check.
>>>
>>> Sorry for silly questions, but thank you in advance. ;-)
>>>
>>>
>>> Regards,
>>> Ake Tangkananond
>>>
>>>
>>>
>
>



 

Reply via email to