Re: index multiple files into one index entity

Erick Erickson Thu, 23 May 2013 05:11:42 -0700

I just skimmed your post, but I'm responding to the last bit.

If you have <uniqueKey> defined as "id" in schema.xml then
no, you cannot have multiple documents with the same ID.
Whenever a new doc comes in it replaces the old doc with that ID.


You can remove the <uniqueKey> definition and do what you want,
but there are very few Solr installations with no <uniqueKey> and
it's probably a better idea to make your id's truly unique.

Best
Erick

On Thu, May 23, 2013 at 6:14 AM,  <mark.ka...@t-systems.com> wrote:
> Hello solr team,
>
> I want to index multiple fields into one solr index entity, with the same id. 
> We are using solr 4.1
>
>
> I try it with following source fragment:
>
>     public void addContentSet(ContentSet contentSet) throws 
> SearchProviderException {
>
>                                 ...
>
>             ContentStreamUpdateRequest csur = 
> generateCSURequest(contentSet.getIndexId(), contentSet);
>             String indexId = contentSet.getIndexId();
>
>             ConcurrentUpdateSolrServer server = 
> serverPool.getUpdateServer(indexId);
>             server.request(csur);
>
>                                 ...
>     }
>
>     private ContentStreamUpdateRequest generateCSURequest(String indexId, 
> ContentSet contentSet)
>             throws IOException {
>         ContentStreamUpdateRequest csur = new 
> ContentStreamUpdateRequest(confStore.getExtractUrl());
>
>         ModifiableSolrParams parameters = csur.getParams();
>         if (parameters == null) {
>             parameters = new ModifiableSolrParams();
>         }
>
>         parameters.set("literalsOverride", "false");
>
>         // maps the tika default content attribute to the Attribute with name 
> 'fulltext'
>         parameters.set("fmap.content", 
> SearchSystemAttributeDef.FULLTEXT.getName());
>         // create an empty content stream, this seams necessary for 
> ContentStreamUpdateRequest
>         csur.addContentStream(new ImaContentStream());
>
>         for (Content content : contentSet.getContentList()) {
>             csur.addContentStream(new ImaContentStream(content));
>             // for each content stream add additional attributes
>             parameters.add("literal." + 
> SearchSystemAttributeDef.CONTENT_ID.getName(), 
> content.getBinaryObjectId().toString());
>             parameters.add("literal." + 
> SearchSystemAttributeDef.CONTENT_KEY.getName(), content.getContentKey());
>             parameters.add("literal." + 
> SearchSystemAttributeDef.FILE_NAME.getName(), content.getContentName());
>             parameters.add("literal." + 
> SearchSystemAttributeDef.MIME_TYPE.getName(), content.getMimeType());
>         }
>
>         parameters.set("literal.id ", indexId);
>
>         // adding some other attributes
>         ...
>
>         csur.setParams(parameters);
>
>         return csur;
>     }
>
> During debugging I can see that the method 'server.request(csur)' read for 
> each ImaContentStream the buffer.
> When I'm looking on solr catalina log I see that the attached files reach the 
> solr servlet.
>
> INFO: Releasing directory:/data/V-4-1/master0/data/index
> Apr 25, 2013 5:48:07 AM org.apache.solr.update.processor.LogUpdateProcessor 
> finish
> INFO: [master0] webapp=/solr-4-1 path=/update/extract 
> params={literal.searchconnectortest15_c8150e41_cc49_4a ...... 
> &literal.id=26afa5dc-40ad-442a-ac79-0e7880c06aa1& .....
> {add=[26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910940958720), 
> 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910971367424), 
> 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910976610304), 
> 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910983950336), 
> 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910989193216), 
> 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910995484672)]} 0 58
>
>
> But only the latest in the content list will be indexed.
>
>
> My schema.xml has the following field definitions:
>
>     <field name="id" type="string" indexed="true" stored="true" 
> required="true" />
>     <field name="content" type="text_general" indexed="false" stored="true" 
> multiValued="true"/>
>
>     <field name="contentkey" type="string" indexed="true" stored="true" 
> multiValued="true"/>
>     <field name="contentid" type="string" indexed="true" stored="true" 
> multiValued="true"/>
>     <field name="contentfilename " type="string" indexed="true" stored="true" 
> multiValued="true"/>
>     <field name="contentmimetype" type="string" indexed="true" stored="true" 
> multiValued="true"/>
>
>     <field name="fulltext" type="text_general" indexed="true" stored="true" 
> multiValued="true"/>
>
>
> I'm using the tika ExtractingRequestHandler which can extract binary files.
>
>
>
>   <requestHandler name="/update/extract"
>                   startup="lazy"
>                   class="solr.extraction.ExtractingRequestHandler" >
>     <lst name="defaults">
>       <str name="lowernames">true</str>
>       <str name="uprefix">ignored_</str>
>
>       <!-- capture link hrefs but ignore div attributes -->
>       <str name="captureAttr">true</str>
>       <str name="fmap.a">links</str>
>       <str name="fmap.div">ignored_</str>
>
>     </lst>
>   </requestHandler>
>
> Is it possible to index multiple files with the same id?
> It is necessary to implement my own RequestHandler?
>
> With best regards Mark
>
>
>

Re: index multiple files into one index entity

Reply via email to