Hi,

I have Solr 5.5.0 configured with UIMA and Tika. I am facing issues when I
am doing atomic updates for the documents already indexed.

<updateRequestProcessorChain name="uima" default="true">
    <processor
class="org.apache.solr.uima.processor.UIMAUpdateRequestProcessorFactory">
      <lst name="uimaConfig">
        <lst name="runtimeParameters">
          <int name="ngramsize">3</int>
        </lst>
        <!-- analysisEngine must contain an AE descriptor inside the
specified path in the classpath.  -->
        <str name="analysisEngine"><Path to my Analysis Engine></str>
        <!-- Set to true if you want to continue indexing even if text
processing fails.
             Default is false. That is, Solr throws RuntimeException and
             never indexed documents entirely in your session. -->
        <bool name="ignoreErrors">true</bool>
        <!-- This is optional. It is used for logging when text processing
fails.
             If logField is not specified, uniqueKey will be used as
logField.
        <str name="logField">id</str>
        -->
        <!-- analyzeFields must contain the input fields that need to be
analyzed by UIMA. -->
        <lst name="analyzeFields">
          <bool name="merge">false</bool>
          <!-- wanted to use field 'text' but solr-uima has known bug for
'multiValued' types field parsing; hence using multiple fields -->
          <arr name="fields">
            <str>content</str>
            <str>title</str>
          </arr>
        </lst>
        <!-- Field mapping describes which features of which types should
go in a field. -->
        <lst name="fieldMappings">
          <lst name="type">
            <str name="name">org.apache.uima.TokenAnnotation</str>
            <lst name="mapping">
              <str name="feature">coveredText</str>
              <str name="field">posVals</str>
            </lst>
            <lst name="mapping">
              <str name="feature">posTag</str>
              <str name="field">posTags</str>
            </lst>
          </lst>
         </lst>
      </lst>
    </processor>
    <processor class="solr.LogUpdateProcessorFactory" />
    <processor class="solr.RunUpdateProcessorFactory" />
  </updateRequestProcessorChain>


<requestHandler name="/update" class="solr.UpdateRequestHandler">
    <lst name="defaults">
      <str name="update.chain">uima</str>
    </lst>
  </requestHandler>

// Tika configuration

<requestHandler name="/update/extract"
class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
    <lst name="defaults">
      <str name="fmap.Last-Modified">last_modified</str>
      <!--  ignore undeclared fields -->
      <str name="uprefix">ignored_</str>
      <str name="captureAttr">true</str>
    </lst>

    <!-- Optional: Specify a path to a tika configuration file. See the
Tika docs for details. -->
    <!--  <str name="tika.config">/my/path/to/tika.config</str> -->

    <!-- Optional: Specify one or more date formats to parse. See
DateUtil.DEFAULT_DATE_FORMATS for default date formats -->
    <lst name="date.formats">
      <str>yyyy-MM-dd</str>
    </lst>
  </requestHandler>


My schema has the fields 'title', 'content' which are used by UIMA and
copied to 'text' using copyField.

<field name="title" type="text_general" indexed="true" stored="true"
multiValued="true" termVectors="true" termPositions="true"
termOffsets="true" />
<field name="content" type="text_general" stored="true" multiValued="true"
termVectors="true" termPositions="true" termOffsets="true" />
<field name="text" type="text_general" indexed="true" stored="true"
multiValued="true" termVectors="true" termPositions="true"
termOffsets="true" />

<copyField source="title" dest="text"/>
   <copyField source="content" dest="text"/>

I tried removing the stored="true" for 'text' field. But no luck.

This link https://issues.apache.org/jira/browse/SOLR-8528 says it's fixed,
but I am still facing the issue.
Can someone please help me with this?

Thanks,
Srini

-- 
http://cheyuta-helpinghands.blogspot.com

Reply via email to