Re: index multiple files into one index entity
You did not open source it by any chance? :-) Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Sun, May 26, 2013 at 8:23 PM, Yury Kats yuryk...@yahoo.com wrote: That's exactly what happens. Each streams goes into a separate document. If all streams share the same unique id parameter, the last stream will overwrite everything. I've asked this same question last year. Got no responses and ended up writing my own UpdateRequestProcessor. See http://tinyurl.com/phhqsb4 On 5/26/2013 11:15 AM, Alexandre Rafalovitch wrote: If I understand correctly, the issue is: 1) The client provides multiple content stream and expects Tika to parse all of them and stick all the extracted content into one big SolrDoc 2) Tika (looking at load() method of: ExtractingDocumentLoader.java (Github link: http://bit.ly/12GsDl9 ) does not actually suspect that it's load method may be called multiple types and therefore happily submit the document at the end of that call. Probably submits a new document for each content source, which probably means it just overrides the same doc over and over again. If I am right, then we have a bug in Tika handler's expectations (of single load() call). The next step would be to put together a very simple use case and open a Jira case with it. Regards, Alex. P.s. I am not a Solr code wrangler, so this MAY be completely wrong. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Sun, May 26, 2013 at 10:46 AM, Erick Erickson erickerick...@gmail.com wrote: I'm still not quite getting the issue. Separate requests (i.e. any addition of a SolrInputDocument) are treated as a separate document. There's no notion of append the contents of one doc to another based on ID, unless you're doing atomic updates. And Tika takes some care to index separate files as separate documents. Now, if you don't need these as with the same uniqueKey, you might index them as separate documents and include a field that lets you associate these documents somehow (see the group/field collapsing Wiki page). But otherwise, I think I need a higher-level view of what you're trying to accomplish to make an intelligent comment. Best Erick On Thu, May 23, 2013 at 9:05 AM, mark.ka...@t-systems.com wrote: Hello Erick, Thank you for your fast answer. Maybe I don't exclaim my question clearly. I want index many files to one index entity. I will use the same behavior as any other multivalued field which can indexed to one unique id. So I think every ContentStreamUpdateRequest represent one index entity, isn't it? And with each addContentStream I will add one File to this entity. Thank you and with best Regards Mark -Ursprüngliche Nachricht- Von: Erick Erickson [mailto:erickerick...@gmail.com] Gesendet: Donnerstag, 23. Mai 2013 14:11 An: solr-user@lucene.apache.org Betreff: Re: index multiple files into one index entity I just skimmed your post, but I'm responding to the last bit. If you have uniqueKey defined as id in schema.xml then no, you cannot have multiple documents with the same ID. Whenever a new doc comes in it replaces the old doc with that ID. You can remove the uniqueKey definition and do what you want, but there are very few Solr installations with no uniqueKey and it's probably a better idea to make your id's truly unique. Best Erick On Thu, May 23, 2013 at 6:14 AM, mark.ka...@t-systems.com wrote: Hello solr team, I want to index multiple fields into one solr index entity, with the same id. We are using solr 4.1 I try it with following source fragment: public void addContentSet(ContentSet contentSet) throws SearchProviderException { ... ContentStreamUpdateRequest csur = generateCSURequest(contentSet.getIndexId(), contentSet); String indexId = contentSet.getIndexId(); ConcurrentUpdateSolrServer server = serverPool.getUpdateServer(indexId); server.request(csur); ... } private ContentStreamUpdateRequest generateCSURequest(String indexId, ContentSet contentSet) throws IOException { ContentStreamUpdateRequest csur = new ContentStreamUpdateRequest(confStore.getExtractUrl()); ModifiableSolrParams parameters = csur.getParams(); if (parameters == null) { parameters = new ModifiableSolrParams(); } parameters.set(literalsOverride, false); // maps the tika default content attribute
Re: index multiple files into one index entity
No, the implementation was very specific to my needs. On 5/27/2013 8:28 AM, Alexandre Rafalovitch wrote: You did not open source it by any chance? :-) Regards, Alex.
Re: index multiple files into one index entity
I'm still not quite getting the issue. Separate requests (i.e. any addition of a SolrInputDocument) are treated as a separate document. There's no notion of append the contents of one doc to another based on ID, unless you're doing atomic updates. And Tika takes some care to index separate files as separate documents. Now, if you don't need these as with the same uniqueKey, you might index them as separate documents and include a field that lets you associate these documents somehow (see the group/field collapsing Wiki page). But otherwise, I think I need a higher-level view of what you're trying to accomplish to make an intelligent comment. Best Erick On Thu, May 23, 2013 at 9:05 AM, mark.ka...@t-systems.com wrote: Hello Erick, Thank you for your fast answer. Maybe I don't exclaim my question clearly. I want index many files to one index entity. I will use the same behavior as any other multivalued field which can indexed to one unique id. So I think every ContentStreamUpdateRequest represent one index entity, isn't it? And with each addContentStream I will add one File to this entity. Thank you and with best Regards Mark -Ursprüngliche Nachricht- Von: Erick Erickson [mailto:erickerick...@gmail.com] Gesendet: Donnerstag, 23. Mai 2013 14:11 An: solr-user@lucene.apache.org Betreff: Re: index multiple files into one index entity I just skimmed your post, but I'm responding to the last bit. If you have uniqueKey defined as id in schema.xml then no, you cannot have multiple documents with the same ID. Whenever a new doc comes in it replaces the old doc with that ID. You can remove the uniqueKey definition and do what you want, but there are very few Solr installations with no uniqueKey and it's probably a better idea to make your id's truly unique. Best Erick On Thu, May 23, 2013 at 6:14 AM, mark.ka...@t-systems.com wrote: Hello solr team, I want to index multiple fields into one solr index entity, with the same id. We are using solr 4.1 I try it with following source fragment: public void addContentSet(ContentSet contentSet) throws SearchProviderException { ... ContentStreamUpdateRequest csur = generateCSURequest(contentSet.getIndexId(), contentSet); String indexId = contentSet.getIndexId(); ConcurrentUpdateSolrServer server = serverPool.getUpdateServer(indexId); server.request(csur); ... } private ContentStreamUpdateRequest generateCSURequest(String indexId, ContentSet contentSet) throws IOException { ContentStreamUpdateRequest csur = new ContentStreamUpdateRequest(confStore.getExtractUrl()); ModifiableSolrParams parameters = csur.getParams(); if (parameters == null) { parameters = new ModifiableSolrParams(); } parameters.set(literalsOverride, false); // maps the tika default content attribute to the Attribute with name 'fulltext' parameters.set(fmap.content, SearchSystemAttributeDef.FULLTEXT.getName()); // create an empty content stream, this seams necessary for ContentStreamUpdateRequest csur.addContentStream(new ImaContentStream()); for (Content content : contentSet.getContentList()) { csur.addContentStream(new ImaContentStream(content)); // for each content stream add additional attributes parameters.add(literal. + SearchSystemAttributeDef.CONTENT_ID.getName(), content.getBinaryObjectId().toString()); parameters.add(literal. + SearchSystemAttributeDef.CONTENT_KEY.getName(), content.getContentKey()); parameters.add(literal. + SearchSystemAttributeDef.FILE_NAME.getName(), content.getContentName()); parameters.add(literal. + SearchSystemAttributeDef.MIME_TYPE.getName(), content.getMimeType()); } parameters.set(literal.id , indexId); // adding some other attributes ... csur.setParams(parameters); return csur; } During debugging I can see that the method 'server.request(csur)' read for each ImaContentStream the buffer. When I'm looking on solr catalina log I see that the attached files reach the solr servlet. INFO: Releasing directory:/data/V-4-1/master0/data/index Apr 25, 2013 5:48:07 AM org.apache.solr.update.processor.LogUpdateProcessor finish INFO: [master0] webapp=/solr-4-1 path=/update/extract params={literal.searchconnectortest15_c8150e41_cc49_4a .. literal.id=26afa5dc-40ad-442a-ac79-0e7880c06aa1 . {add=[26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910940958720), 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910971367424), 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910976610304), 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910983950336), 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910989193216
index multiple files into one index entity
Hello solr team, I want to index multiple fields into one solr index entity, with the same id. We are using solr 4.1 I try it with following source fragment: public void addContentSet(ContentSet contentSet) throws SearchProviderException { ... ContentStreamUpdateRequest csur = generateCSURequest(contentSet.getIndexId(), contentSet); String indexId = contentSet.getIndexId(); ConcurrentUpdateSolrServer server = serverPool.getUpdateServer(indexId); server.request(csur); ... } private ContentStreamUpdateRequest generateCSURequest(String indexId, ContentSet contentSet) throws IOException { ContentStreamUpdateRequest csur = new ContentStreamUpdateRequest(confStore.getExtractUrl()); ModifiableSolrParams parameters = csur.getParams(); if (parameters == null) { parameters = new ModifiableSolrParams(); } parameters.set(literalsOverride, false); // maps the tika default content attribute to the Attribute with name 'fulltext' parameters.set(fmap.content, SearchSystemAttributeDef.FULLTEXT.getName()); // create an empty content stream, this seams necessary for ContentStreamUpdateRequest csur.addContentStream(new ImaContentStream()); for (Content content : contentSet.getContentList()) { csur.addContentStream(new ImaContentStream(content)); // for each content stream add additional attributes parameters.add(literal. + SearchSystemAttributeDef.CONTENT_ID.getName(), content.getBinaryObjectId().toString()); parameters.add(literal. + SearchSystemAttributeDef.CONTENT_KEY.getName(), content.getContentKey()); parameters.add(literal. + SearchSystemAttributeDef.FILE_NAME.getName(), content.getContentName()); parameters.add(literal. + SearchSystemAttributeDef.MIME_TYPE.getName(), content.getMimeType()); } parameters.set(literal.id , indexId); // adding some other attributes ... csur.setParams(parameters); return csur; } During debugging I can see that the method 'server.request(csur)' read for each ImaContentStream the buffer. When I'm looking on solr catalina log I see that the attached files reach the solr servlet. INFO: Releasing directory:/data/V-4-1/master0/data/index Apr 25, 2013 5:48:07 AM org.apache.solr.update.processor.LogUpdateProcessor finish INFO: [master0] webapp=/solr-4-1 path=/update/extract params={literal.searchconnectortest15_c8150e41_cc49_4a .. literal.id=26afa5dc-40ad-442a-ac79-0e7880c06aa1 . {add=[26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910940958720), 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910971367424), 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910976610304), 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910983950336), 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910989193216), 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910995484672)]} 0 58 But only the latest in the content list will be indexed. My schema.xml has the following field definitions: field name=id type=string indexed=true stored=true required=true / field name=content type=text_general indexed=false stored=true multiValued=true/ field name=contentkey type=string indexed=true stored=true multiValued=true/ field name=contentid type=string indexed=true stored=true multiValued=true/ field name=contentfilename type=string indexed=true stored=true multiValued=true/ field name=contentmimetype type=string indexed=true stored=true multiValued=true/ field name=fulltext type=text_general indexed=true stored=true multiValued=true/ I'm using the tika ExtractingRequestHandler which can extract binary files. requestHandler name=/update/extract startup=lazy class=solr.extraction.ExtractingRequestHandler lst name=defaults str name=lowernamestrue/str str name=uprefixignored_/str !-- capture link hrefs but ignore div attributes -- str name=captureAttrtrue/str str name=fmap.alinks/str str name=fmap.divignored_/str /lst /requestHandler Is it possible to index multiple files with the same id? It is necessary to implement my own RequestHandler? With best regards Mark
Re: index multiple files into one index entity
I just skimmed your post, but I'm responding to the last bit. If you have uniqueKey defined as id in schema.xml then no, you cannot have multiple documents with the same ID. Whenever a new doc comes in it replaces the old doc with that ID. You can remove the uniqueKey definition and do what you want, but there are very few Solr installations with no uniqueKey and it's probably a better idea to make your id's truly unique. Best Erick On Thu, May 23, 2013 at 6:14 AM, mark.ka...@t-systems.com wrote: Hello solr team, I want to index multiple fields into one solr index entity, with the same id. We are using solr 4.1 I try it with following source fragment: public void addContentSet(ContentSet contentSet) throws SearchProviderException { ... ContentStreamUpdateRequest csur = generateCSURequest(contentSet.getIndexId(), contentSet); String indexId = contentSet.getIndexId(); ConcurrentUpdateSolrServer server = serverPool.getUpdateServer(indexId); server.request(csur); ... } private ContentStreamUpdateRequest generateCSURequest(String indexId, ContentSet contentSet) throws IOException { ContentStreamUpdateRequest csur = new ContentStreamUpdateRequest(confStore.getExtractUrl()); ModifiableSolrParams parameters = csur.getParams(); if (parameters == null) { parameters = new ModifiableSolrParams(); } parameters.set(literalsOverride, false); // maps the tika default content attribute to the Attribute with name 'fulltext' parameters.set(fmap.content, SearchSystemAttributeDef.FULLTEXT.getName()); // create an empty content stream, this seams necessary for ContentStreamUpdateRequest csur.addContentStream(new ImaContentStream()); for (Content content : contentSet.getContentList()) { csur.addContentStream(new ImaContentStream(content)); // for each content stream add additional attributes parameters.add(literal. + SearchSystemAttributeDef.CONTENT_ID.getName(), content.getBinaryObjectId().toString()); parameters.add(literal. + SearchSystemAttributeDef.CONTENT_KEY.getName(), content.getContentKey()); parameters.add(literal. + SearchSystemAttributeDef.FILE_NAME.getName(), content.getContentName()); parameters.add(literal. + SearchSystemAttributeDef.MIME_TYPE.getName(), content.getMimeType()); } parameters.set(literal.id , indexId); // adding some other attributes ... csur.setParams(parameters); return csur; } During debugging I can see that the method 'server.request(csur)' read for each ImaContentStream the buffer. When I'm looking on solr catalina log I see that the attached files reach the solr servlet. INFO: Releasing directory:/data/V-4-1/master0/data/index Apr 25, 2013 5:48:07 AM org.apache.solr.update.processor.LogUpdateProcessor finish INFO: [master0] webapp=/solr-4-1 path=/update/extract params={literal.searchconnectortest15_c8150e41_cc49_4a .. literal.id=26afa5dc-40ad-442a-ac79-0e7880c06aa1 . {add=[26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910940958720), 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910971367424), 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910976610304), 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910983950336), 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910989193216), 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910995484672)]} 0 58 But only the latest in the content list will be indexed. My schema.xml has the following field definitions: field name=id type=string indexed=true stored=true required=true / field name=content type=text_general indexed=false stored=true multiValued=true/ field name=contentkey type=string indexed=true stored=true multiValued=true/ field name=contentid type=string indexed=true stored=true multiValued=true/ field name=contentfilename type=string indexed=true stored=true multiValued=true/ field name=contentmimetype type=string indexed=true stored=true multiValued=true/ field name=fulltext type=text_general indexed=true stored=true multiValued=true/ I'm using the tika ExtractingRequestHandler which can extract binary files. requestHandler name=/update/extract startup=lazy class=solr.extraction.ExtractingRequestHandler lst name=defaults str name=lowernamestrue/str str name=uprefixignored_/str !-- capture link hrefs but ignore div attributes -- str name=captureAttrtrue/str str name=fmap.alinks/str str name=fmap.divignored_/str /lst /requestHandler Is it possible to index multiple files with the same id? It is necessary to implement my own
AW: index multiple files into one index entity
Hello Erick, Thank you for your fast answer. Maybe I don't exclaim my question clearly. I want index many files to one index entity. I will use the same behavior as any other multivalued field which can indexed to one unique id. So I think every ContentStreamUpdateRequest represent one index entity, isn't it? And with each addContentStream I will add one File to this entity. Thank you and with best Regards Mark -Ursprüngliche Nachricht- Von: Erick Erickson [mailto:erickerick...@gmail.com] Gesendet: Donnerstag, 23. Mai 2013 14:11 An: solr-user@lucene.apache.org Betreff: Re: index multiple files into one index entity I just skimmed your post, but I'm responding to the last bit. If you have uniqueKey defined as id in schema.xml then no, you cannot have multiple documents with the same ID. Whenever a new doc comes in it replaces the old doc with that ID. You can remove the uniqueKey definition and do what you want, but there are very few Solr installations with no uniqueKey and it's probably a better idea to make your id's truly unique. Best Erick On Thu, May 23, 2013 at 6:14 AM, mark.ka...@t-systems.com wrote: Hello solr team, I want to index multiple fields into one solr index entity, with the same id. We are using solr 4.1 I try it with following source fragment: public void addContentSet(ContentSet contentSet) throws SearchProviderException { ... ContentStreamUpdateRequest csur = generateCSURequest(contentSet.getIndexId(), contentSet); String indexId = contentSet.getIndexId(); ConcurrentUpdateSolrServer server = serverPool.getUpdateServer(indexId); server.request(csur); ... } private ContentStreamUpdateRequest generateCSURequest(String indexId, ContentSet contentSet) throws IOException { ContentStreamUpdateRequest csur = new ContentStreamUpdateRequest(confStore.getExtractUrl()); ModifiableSolrParams parameters = csur.getParams(); if (parameters == null) { parameters = new ModifiableSolrParams(); } parameters.set(literalsOverride, false); // maps the tika default content attribute to the Attribute with name 'fulltext' parameters.set(fmap.content, SearchSystemAttributeDef.FULLTEXT.getName()); // create an empty content stream, this seams necessary for ContentStreamUpdateRequest csur.addContentStream(new ImaContentStream()); for (Content content : contentSet.getContentList()) { csur.addContentStream(new ImaContentStream(content)); // for each content stream add additional attributes parameters.add(literal. + SearchSystemAttributeDef.CONTENT_ID.getName(), content.getBinaryObjectId().toString()); parameters.add(literal. + SearchSystemAttributeDef.CONTENT_KEY.getName(), content.getContentKey()); parameters.add(literal. + SearchSystemAttributeDef.FILE_NAME.getName(), content.getContentName()); parameters.add(literal. + SearchSystemAttributeDef.MIME_TYPE.getName(), content.getMimeType()); } parameters.set(literal.id , indexId); // adding some other attributes ... csur.setParams(parameters); return csur; } During debugging I can see that the method 'server.request(csur)' read for each ImaContentStream the buffer. When I'm looking on solr catalina log I see that the attached files reach the solr servlet. INFO: Releasing directory:/data/V-4-1/master0/data/index Apr 25, 2013 5:48:07 AM org.apache.solr.update.processor.LogUpdateProcessor finish INFO: [master0] webapp=/solr-4-1 path=/update/extract params={literal.searchconnectortest15_c8150e41_cc49_4a .. literal.id=26afa5dc-40ad-442a-ac79-0e7880c06aa1 . {add=[26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910940958720), 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910971367424), 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910976610304), 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910983950336), 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910989193216), 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910995484672)]} 0 58 But only the latest in the content list will be indexed. My schema.xml has the following field definitions: field name=id type=string indexed=true stored=true required=true / field name=content type=text_general indexed=false stored=true multiValued=true/ field name=contentkey type=string indexed=true stored=true multiValued=true/ field name=contentid type=string indexed=true stored=true multiValued=true/ field name=contentfilename type=string indexed=true stored=true multiValued=true/ field name=contentmimetype type=string indexed=true stored=true multiValued=true/ field name=fulltext type=text_general indexed