ahmed baseet schrieb:

public void postToSolrUsingSolrj(String rawText, String pageId) {

            doc.addField("features", rawText );

In the above the param rawText is just the html stripped off of all
its tags, js, css etc and pageId is the Url for that page. When I'm
using this for English pages its working perfectly fine but the
problem comes up when I'm trying to index some non-english pages.

Maybe you're constructing a string without specifying the encoding, so
Java uses your default platform encoding?

String(byte[] bytes)
  Constructs a new String by decoding the specified array of
  bytes using the platform's default charset.

String(byte[] bytes, Charset charset)
  Constructs a new String by decoding the specified array of bytes using
  the specified charset.

Now what I did is just extracted the raw text from that html page and
manually created an xml page like this

<?xml version="1.0" encoding="UTF-8"?>
<add>
  <doc>
    <field name="id">UTF2TEST</field>
    <field name="name">Test with some UTF-8 encoded characters</field>
    <field name="features">*some tamil unicode text here*</field>
   </doc>
</add>

and posted this from command line using the post.jar file. Now searching
gives me the result but unlike last time browser shows the indexed text in
tamil itself and not the raw unicode.

Now that's perfect, isn't it?

I tried doing something like this also,

// Encode in Unicode UTF-8
 utfEncodedText = new String(rawText.getBytes("UTF-8"));

but even this didn't help eighter.

No encoding specified, so the default platform encoding is used, which
is likely not what you want. Consider the following example:

package milu;
import java.nio.charset.Charset;
public class StringAndCharset {
  public static void main(String[] args) {
    byte[] bytes = { 'K', (byte) 195, (byte) 164, 's', 'e' };
    System.out.println(Charset.defaultCharset().displayName());
    System.out.println(new String(bytes));
    System.out.println(new String(bytes,  Charset.forName("UTF-8")));
  }
}

Output:

windows-1252
Käse (bad)
Käse (good)

Michael Ludwig

Reply via email to