: We are running an instance of MediaWiki so the text goes through a
: couple of transformations: wiki markup -> html -> plain text.
: Its at this last step that I take a "snippet" and insert that into Solr.
        ...
: doc.addField("text_snippet_t", article.getSnippet(1000));

ok, well first off: that's the not the field we're you are having problems 
is it?  if i remember correctly from your previous posts, wasn't the 
response getting aborted in the middle of the Contents field?

: and a maximum of 1K chars if its bigger. I initialized this String
: from the DB by using the String constructor where I pass in the
: charset/collation
: 
: text = new String(textFromDB, "UTF-8");
: 
: So to the best of my knowledge, accessing a substring of a UTF-8
: encoded string should not break up the UTF-8 code point. Is that an

i assume you are correct, but you listed several steps of transformation 
above, are you certian they all work correctly and produce valid UTF-8?

this leads back to my suggestion before....

: > Can you put the orriginal (pre solr, pre solrj, raw untouched, etc...)
: > file that this solr doc came from online somewhere?
: >
: > What does your *indexing* code look like? ... Can you add some debuging to
: > the SolrJ client when you *add* this doc to print out exactly what those
: > 1000 characters are?


-Hoss

Reply via email to