: We are running an instance of MediaWiki so the text goes through a : couple of transformations: wiki markup -> html -> plain text. : Its at this last step that I take a "snippet" and insert that into Solr. ... : doc.addField("text_snippet_t", article.getSnippet(1000));
ok, well first off: that's the not the field we're you are having problems is it? if i remember correctly from your previous posts, wasn't the response getting aborted in the middle of the Contents field? : and a maximum of 1K chars if its bigger. I initialized this String : from the DB by using the String constructor where I pass in the : charset/collation : : text = new String(textFromDB, "UTF-8"); : : So to the best of my knowledge, accessing a substring of a UTF-8 : encoded string should not break up the UTF-8 code point. Is that an i assume you are correct, but you listed several steps of transformation above, are you certian they all work correctly and produce valid UTF-8? this leads back to my suggestion before.... : > Can you put the orriginal (pre solr, pre solrj, raw untouched, etc...) : > file that this solr doc came from online somewhere? : > : > What does your *indexing* code look like? ... Can you add some debuging to : > the SolrJ client when you *add* this doc to print out exactly what those : > 1000 characters are? -Hoss