index multiple files into one index entity

Mark.Kappe Thu, 23 May 2013 03:17:43 -0700

Hello solr team,

I want to index multiple fields into one solr index entity, with the same id. 
We are using solr 4.1



I try it with following source fragment:

    public void addContentSet(ContentSet contentSet) throws 
SearchProviderException {

                                ...

            ContentStreamUpdateRequest csur = 
generateCSURequest(contentSet.getIndexId(), contentSet);
            String indexId = contentSet.getIndexId();

            ConcurrentUpdateSolrServer server = 
serverPool.getUpdateServer(indexId);
            server.request(csur);

                                ...
    }

    private ContentStreamUpdateRequest generateCSURequest(String indexId, 
ContentSet contentSet)
            throws IOException {
        ContentStreamUpdateRequest csur = new 
ContentStreamUpdateRequest(confStore.getExtractUrl());

        ModifiableSolrParams parameters = csur.getParams();
        if (parameters == null) {
            parameters = new ModifiableSolrParams();
        }

        parameters.set("literalsOverride", "false");

        // maps the tika default content attribute to the Attribute with name 
'fulltext'
        parameters.set("fmap.content", 
SearchSystemAttributeDef.FULLTEXT.getName());
        // create an empty content stream, this seams necessary for 
ContentStreamUpdateRequest
        csur.addContentStream(new ImaContentStream());

        for (Content content : contentSet.getContentList()) {
            csur.addContentStream(new ImaContentStream(content));
            // for each content stream add additional attributes
            parameters.add("literal." + 
SearchSystemAttributeDef.CONTENT_ID.getName(), 
content.getBinaryObjectId().toString());
            parameters.add("literal." + 
SearchSystemAttributeDef.CONTENT_KEY.getName(), content.getContentKey());
            parameters.add("literal." + 
SearchSystemAttributeDef.FILE_NAME.getName(), content.getContentName());
            parameters.add("literal." + 
SearchSystemAttributeDef.MIME_TYPE.getName(), content.getMimeType());
        }

        parameters.set("literal.id ", indexId);

        // adding some other attributes
        ...

        csur.setParams(parameters);

        return csur;
    }

During debugging I can see that the method 'server.request(csur)' read for each 
ImaContentStream the buffer.
When I'm looking on solr catalina log I see that the attached files reach the 
solr servlet.

INFO: Releasing directory:/data/V-4-1/master0/data/index
Apr 25, 2013 5:48:07 AM org.apache.solr.update.processor.LogUpdateProcessor 
finish
INFO: [master0] webapp=/solr-4-1 path=/update/extract 
params={literal.searchconnectortest15_c8150e41_cc49_4a ...... 
&literal.id=26afa5dc-40ad-442a-ac79-0e7880c06aa1& .....
{add=[26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910940958720), 
26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910971367424), 
26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910976610304), 
26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910983950336), 
26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910989193216), 
26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910995484672)]} 0 58


But only the latest in the content list will be indexed.


My schema.xml has the following field definitions:

    <field name="id" type="string" indexed="true" stored="true" required="true" 
/>
    <field name="content" type="text_general" indexed="false" stored="true" 
multiValued="true"/>

    <field name="contentkey" type="string" indexed="true" stored="true" 
multiValued="true"/>
    <field name="contentid" type="string" indexed="true" stored="true" 
multiValued="true"/>
    <field name="contentfilename " type="string" indexed="true" stored="true" 
multiValued="true"/>
    <field name="contentmimetype" type="string" indexed="true" stored="true" 
multiValued="true"/>

    <field name="fulltext" type="text_general" indexed="true" stored="true" 
multiValued="true"/>


I'm using the tika ExtractingRequestHandler which can extract binary files.



  <requestHandler name="/update/extract"
                  startup="lazy"
                  class="solr.extraction.ExtractingRequestHandler" >
    <lst name="defaults">
      <str name="lowernames">true</str>
      <str name="uprefix">ignored_</str>

      <!-- capture link hrefs but ignore div attributes -->
      <str name="captureAttr">true</str>
      <str name="fmap.a">links</str>
      <str name="fmap.div">ignored_</str>

    </lst>
  </requestHandler>

Is it possible to index multiple files with the same id?
It is necessary to implement my own RequestHandler?

With best regards Mark

index multiple files into one index entity

Reply via email to