Hmmm, maybe we should just strip the codepoints on all fields. We're already
doing it on content which is by far the largest field, all other fields are
tiny compared to this one. If we do it on all String fields then this would
also fix unknown fields added by custom plugins.
The part your refer to is for the solr field mapping code. Strip codepoints
before the mapping code or you'll end up with one stripped and one not if you
use copyFields in here.
On Wednesday 10 August 2011 14:32:28 Cam Bazz wrote:
> Hello,
>
> From SolrWriter.java:
>
> public void write(NutchDocument doc) throws IOException {
>
> final SolrInputDocument inputDoc = new SolrInputDocument();
>
> for(final Entry<String, NutchField> e : doc) {
> for (final Object val : e.getValue().getValues()) {
>
> // normalise the string representation for a Date
> Object val2 = val;
>
> if (val instanceof Date){
> val2 = DateUtil.getThreadLocalDateFormat().format(val);
> }
>
> if (e.getKey().equals("content")||e.getKey().equals("e_features"))
> { if(val!=null) {
> val2 = stripNonCharCodepoints((String)val);
> }
> }
>
> inputDoc.addField(solrMapping.mapKey(e.getKey()), val2,
> e.getValue().getWeight());
> String sCopy = solrMapping.mapCopyKey(e.getKey());
> if (sCopy != e.getKey()) {
> inputDoc.addField(sCopy, val);
> }
>
> }
> }
> inputDoc.setDocumentBoost(doc.getWeight());
> inputDocs.add(inputDoc);
> if (inputDocs.size() >= commitSize) {
> try {
> LOG.info("Adding " + Integer.toString(inputDocs.size()) + "
> documents"); solr.add(inputDocs);
> } catch (final SolrServerException e) {
> throw makeIOException(e);
> }
> inputDocs.clear();
> }
> }
>
>
> what is happening after inputDoc.addField.... ? I am getting exception
> while indexing e_features, because of UTF8 encoding error. previously
> we patched this problem because of content, and now i have another
> field called e_features, and I wanted to stripNonCharCodepoints from
> that s well, but I dont understand why we are doing the if (sCopy !=
> e.getKey()) { inputDoc.addField(sCopy, val);}
>
>
> Best Regards,
> C.B.
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350