Hi,

I've been working on adding some Solr-integration into my current project, but 
have run into a problem with non-ascii characters.

I send a document like the following:

---
<?xml version="1.0" encoding="UTF-8"?>
<add><doc>
  <field name="question_id">228</field>
  <field name="question_title">Vedhæft billede til min formular</field>
  <field name="userid">26</field>
  <field name="question_text">Jeg har lavet en side som skal info om 
værkstedet Badsetuen i Odense, som er under kraftig omlægning af kommunen - 
dvs nedskæring.
Jeg har her oprettet en formular hvor brugere kan sende en tekst på email om 
deres håndværk udført på stedet. 
Jeg mangler et felt til at vedhæfte billede http://www.badstuen.dannyboyd.dk/

Nogle ideer ?</field>
  <field name="question_date">2006-05-17T08:44:23Z</field>
  <field name="question_tags">Upload</field>
  <field name="question_tags">HTML</field>
  <field name="question_tags">Email</field>
  <field name="question_tags">Vedhæftning</field>
</doc></add>
---

But when I do a search like "/solr/select/?q=billede" (default search is the 
field "text" which is a multiValued copyField from question_title and 
question_text)

I will get the document back as

---
?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
 ...
</lst>
<result name="response" numFound="1" start="0">
 <doc>
  <date name="question_date">2006-05-17T08:44:23Z</date>
  <int name="question_id">228</int>
  <arr name="question_tags"><str>Upload</str><str>HTML</str><str>Email</str>
        <str>Vedhæftning</str></arr>
  <str name="question_text">Jeg har lavet en side som skal info om værkstedet 
Badsetuen i Odense, som er under kraftig omlægning af kommunen - dvs 
nedskæring.
Jeg har her oprettet en formular hvor brugere kan sende en tekst på email om 
deres håndværk udført på stedet. 
Jeg mangler et felt til at vedhæfte billede http://www.badstuen.dannyboyd.dk/

Nogle ideer ?</str>
  <str name="question_title">Vedhæft billede til min formular</str>
  <str name="userid">26</str>
 </doc>
</result>
</response>
---

Which is basicly the same text, but displayed as ISO-8859-1. How can this be? 
Do I have to send off some header saying it is UTF-8, or should I just send 
the data as UTF-8 (that produces the correct encoding in answers, but sounds 
like a silly way of doing it)

Any ideas?

Btw, the install-script listed at http://wiki.apache.org/solr/SolrTomcat is a 
bit wrong. Should I just contribute the fixes (new solr dir and name to 
fetch) to the wiki, or will any of you guys rather do it yourself?

Regards
 -fangel

Reply via email to