Problem adding unicoded docs to Solr through SolrJ

ahmed baseet Wed, 29 Apr 2009 05:02:07 -0700

Hi All,
I'm trying to automate the process of posting xml s to Solr using Solrj.
Essentially I'm extracting the text from a given Url, then creating a
solrDoc and posting the same using the following function,


public void postToSolrUsingSolrj(String rawText, String pageId) {
        String url = "http://localhost:8983/solr";;
        CommonsHttpSolrServer server;

        try {
            // Get connection to Solr server
              server = new CommonsHttpSolrServer(url);

            // Set XMLResponseParser : Reqd for older version of Solr 1.3
            server.setParser(new XMLResponseParser());

            server.setSoTimeout(1000);  // socket read timeout
              server.setConnectionTimeout(100);
              server.setDefaultMaxConnectionsPerHost(100);
              server.setMaxTotalConnections(100);
              server.setFollowRedirects(false);  // defaults to false
              // allowCompression defaults to false.
              // Server side must support gzip or deflate for this to have
any effect.
              server.setAllowCompression(true);
              server.setMaxRetries(1); // defaults to 0.  > 1 not
recommended.

            // WARNING : this will delete all pre-existing Solr index
            //server.deleteByQuery( "*:*" );// delete everything!

            SolrInputDocument doc = new SolrInputDocument();
            doc.addField("id", pageId );
            doc.addField("features", rawText );


            // Add the docs to Solr Server
            server.add(doc);

            // Do commit the changes
            server.commit();

        }catch (Exception e) {}
    }

In the above the param rawText is just the html stripped off of all its
tags, js, css etc and pageId is the Url for that page. When I'm using this
for English pages its working perfectly fine but the problem comes up when
I'm trying to index some non-english pages. For them, say pages in tamil,
the encoding Unicode/Utf-8 seems to create some problem, because after
indexing some non-english pages when I'm trying to search those from solr
admin search interface, it gives the result but the content is not showing
in that language i.e tamil rather it just displays just some characters, i
think in unicode. The same thing worked fine for pages in English.

Now what I did is just extracted the raw text from that html page and
manually created an xml page like this

<?xml version="1.0" encoding="UTF-8"?>
<add>
  <doc>
    <field name="id">UTF2TEST</field>
    <field name="name">Test with some UTF-8 encoded characters</field>
    <field name="features">*some tamil unicode text here*</field>
   </doc>
</add>

and posted this from command line using the post.jar file. Now searching
gives me the result but unlike last time browser shows the indexed text in
tamil itself and not the raw unicode. So this clearly shows that the string
that I'm using to create the solrDoc seems to have some encoding issues,
right? Or something else? I tried doing something like this also,

// Encode in Unicode UTF-8
 utfEncodedText = new String(rawText.getBytes("UTF-8"));

but even this didn't help eighter.
Its seems some silly problem some where, which I'm not able to catch. :-)

I appreciate if some one can point me the bug...

Thanks,
Ahmed.

Problem adding unicoded docs to Solr through SolrJ

Reply via email to