Re: questions about solrwriter

Cam Bazz Tue, 16 Aug 2011 12:10:52 -0700

thank you that is what I have done.


On Wed, Aug 10, 2011 at 4:06 PM, Markus Jelsma
<[email protected]> wrote:
> Hmmm, maybe we should just strip the codepoints on all fields. We're already
> doing it on content which is by far the largest field, all other fields are
> tiny compared to this one. If we do it on all String fields then this would
> also fix unknown fields added by custom plugins.
>
> The part your refer to is for the solr field mapping code. Strip codepoints
> before the mapping code or you'll end up with one stripped and one not if you
> use copyFields in here.
>
>
> On Wednesday 10 August 2011 14:32:28 Cam Bazz wrote:
>> Hello,
>>
>> From SolrWriter.java:
>>
>>   public void write(NutchDocument doc) throws IOException {
>>
>>     final SolrInputDocument inputDoc = new SolrInputDocument();
>>
>>     for(final Entry<String, NutchField> e : doc) {
>>       for (final Object val : e.getValue().getValues()) {
>>
>>         // normalise the string representation for a Date
>>         Object val2 = val;
>>
>>         if (val instanceof Date){
>>           val2 = DateUtil.getThreadLocalDateFormat().format(val);
>>         }
>>
>>         if (e.getKey().equals("content")||e.getKey().equals("e_features"))
>> { if(val!=null) {
>>                       val2 = stripNonCharCodepoints((String)val);
>>               }
>>         }
>>
>>         inputDoc.addField(solrMapping.mapKey(e.getKey()), val2,
>> e.getValue().getWeight());
>>         String sCopy = solrMapping.mapCopyKey(e.getKey());
>>         if (sCopy != e.getKey()) {
>>               inputDoc.addField(sCopy, val);
>>         }
>>
>>       }
>>     }
>>     inputDoc.setDocumentBoost(doc.getWeight());
>>     inputDocs.add(inputDoc);
>>     if (inputDocs.size() >= commitSize) {
>>       try {
>>         LOG.info("Adding " + Integer.toString(inputDocs.size()) + "
>> documents"); solr.add(inputDocs);
>>       } catch (final SolrServerException e) {
>>         throw makeIOException(e);
>>       }
>>       inputDocs.clear();
>>     }
>>   }
>>
>>
>> what is happening after inputDoc.addField.... ? I am getting exception
>> while indexing e_features, because of UTF8 encoding error. previously
>> we patched this problem because of content, and now i have another
>> field called e_features, and I wanted to stripNonCharCodepoints from
>> that s well, but I dont understand why we are doing the   if (sCopy !=
>> e.getKey()) { inputDoc.addField(sCopy, val);}
>>
>>
>> Best Regards,
>> C.B.
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>

Re: questions about solrwriter

Reply via email to