Yeah, I just figured out that if I set

export JAVA_TOOL_OPTIONS="-Dfile.encoding=UTF8"

Everything works.  The OutputStreamWriter used by StreamingUpdateSolrServer 
uses the default encoding.  UTF-8 might be better, but maybe there are reasons 
not to hard-code it.

Thanks,
Hugh

On May 28, 2010, at 3:02 PM, Tim Gilbert wrote:

> I had a similar problem a few days ago and I found that the documents where 
> not being loaded correctly as UTF-8 into Solr.  In my case, the loader 
> program was a Java.jar I was executing from a cron job.  There I added this:
> 
> java -Dfile.encoding=UTF-8 -jar /home/tim/solr/bin/loadSiteSearch.jar
> 
> Then, within that program, I wrote function to take the strings I was loading 
> and expressly declare them as UTF-8 like this:
> 
> private String toUTF8(String value)
> {
>       return new String(value.getBytes(), "UTF-8");
> }
> 
> and that solved the problem for me.
> 
> Tim
> 
> -----Original Message-----
> From: Hugh Cayless [mailto:philomou...@gmail.com] 
> Sent: Friday, May 28, 2010 12:51 PM
> To: solr-user@lucene.apache.org
> Subject: SolrJ Unicode problem
> 
> Hi, I'm a solr newbie, and I'm hoping someone can point me in the right 
> direction.
> 
> I'm trying to index a bunch of documents with Greek text in them.  I can 
> successfully index documents by generating add xml and using curl to send 
> them to my server, but when I use solrj to create and send documents, the 
> encoding gets throughly messed up.
> 
> 
> Instead of the result (from an add doc posted with curl):
> 
> <result name="response" numFound="1" start="0">
>  <doc>
>    <str name="id">c.etiq.mom;;2077</str>
>    <str name="transcription">Της Βησο ς Χρη εις Πανοπολίτης</str>
>  </doc>
> </result>
> 
> I get (from a SolrInputDocument loaded with solrj):
> 
> <result name="response" numFound="1" start="0"> 
> <doc> 
>  <str name="id">c.etiq.mom;;2077</str> 
>  <str name="transcription">??? ???? ? ??? ??? ????�??????</str> 
> </doc> 
> </result>
> 
> I can confirm that the SolrInputDocument's transcription field contains Greek 
> text before I call .add(documents) on the StreamingUpdateSolrServer (i.e., I 
> can get Greek back out of it).  So I don't know what to do next.  Any ideas?
> 
> Thanks,
> Hugh

Reply via email to