Yeah, I just figured out that if I set export JAVA_TOOL_OPTIONS="-Dfile.encoding=UTF8"
Everything works. The OutputStreamWriter used by StreamingUpdateSolrServer uses the default encoding. UTF-8 might be better, but maybe there are reasons not to hard-code it. Thanks, Hugh On May 28, 2010, at 3:02 PM, Tim Gilbert wrote: > I had a similar problem a few days ago and I found that the documents where > not being loaded correctly as UTF-8 into Solr. In my case, the loader > program was a Java.jar I was executing from a cron job. There I added this: > > java -Dfile.encoding=UTF-8 -jar /home/tim/solr/bin/loadSiteSearch.jar > > Then, within that program, I wrote function to take the strings I was loading > and expressly declare them as UTF-8 like this: > > private String toUTF8(String value) > { > return new String(value.getBytes(), "UTF-8"); > } > > and that solved the problem for me. > > Tim > > -----Original Message----- > From: Hugh Cayless [mailto:philomou...@gmail.com] > Sent: Friday, May 28, 2010 12:51 PM > To: solr-user@lucene.apache.org > Subject: SolrJ Unicode problem > > Hi, I'm a solr newbie, and I'm hoping someone can point me in the right > direction. > > I'm trying to index a bunch of documents with Greek text in them. I can > successfully index documents by generating add xml and using curl to send > them to my server, but when I use solrj to create and send documents, the > encoding gets throughly messed up. > > > Instead of the result (from an add doc posted with curl): > > <result name="response" numFound="1" start="0"> > <doc> > <str name="id">c.etiq.mom;;2077</str> > <str name="transcription">Της Βησο ς Χρη εις Πανοπολίτης</str> > </doc> > </result> > > I get (from a SolrInputDocument loaded with solrj): > > <result name="response" numFound="1" start="0"> > <doc> > <str name="id">c.etiq.mom;;2077</str> > <str name="transcription">??? ???? ? ??? ??? ????�??????</str> > </doc> > </result> > > I can confirm that the SolrInputDocument's transcription field contains Greek > text before I call .add(documents) on the StreamingUpdateSolrServer (i.e., I > can get Greek back out of it). So I don't know what to do next. Any ideas? > > Thanks, > Hugh