Re: index multiple files into one index entity

2013-05-27 Thread Alexandre Rafalovitch
You did not open source it by any chance? :-)

Regards,
   Alex.
Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Sun, May 26, 2013 at 8:23 PM, Yury Kats yuryk...@yahoo.com wrote:
 That's exactly what happens. Each streams goes into a separate document.
 If all streams share the same unique id parameter, the last stream
 will overwrite everything.

 I've asked this same question last year. Got no responses and ended up
 writing my own UpdateRequestProcessor.

 See http://tinyurl.com/phhqsb4

 On 5/26/2013 11:15 AM, Alexandre Rafalovitch wrote:
 If I understand correctly, the issue is:
 1) The client provides multiple content stream and expects Tika to
 parse all of them and stick all the extracted content into one big
 SolrDoc
 2) Tika (looking at load() method of: ExtractingDocumentLoader.java
 (Github link: http://bit.ly/12GsDl9 ) does not actually suspect that
 it's load method may be called multiple types and therefore happily
 submit the document at the end of that call. Probably submits a new
 document for each content source, which probably means it just
 overrides the same doc over and over again.

 If I am right, then we have a bug in Tika handler's expectations (of
 single load() call). The next step would be to put together a very
 simple use case and open a Jira case with it.

 Regards,
Alex.
 P.s. I am not a Solr code wrangler, so this MAY be completely wrong.

 Personal blog: http://blog.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all
 at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
 book)


 On Sun, May 26, 2013 at 10:46 AM, Erick Erickson
 erickerick...@gmail.com wrote:
 I'm still not quite getting the issue. Separate requests (i.e. any
 addition of a SolrInputDocument) are treated as a separate document.
 There's no notion of append the contents of one doc to another based
 on ID, unless you're doing atomic updates.

 And Tika takes some care to index separate files as separate documents.

 Now, if you don't need these as with the same uniqueKey, you might
 index them as separate documents and include a field that lets you
 associate these documents somehow (see the group/field collapsing Wiki
 page).

 But otherwise, I think I need a higher-level view of what you're
 trying to accomplish to make an intelligent comment.

 Best
 Erick

 On Thu, May 23, 2013 at 9:05 AM,  mark.ka...@t-systems.com wrote:
 Hello Erick,
 Thank you for your fast answer.

 Maybe I don't exclaim my question clearly.
 I want index many files to one index entity. I will use the same behavior 
 as any other multivalued field which can indexed to one unique id.
 So I think every ContentStreamUpdateRequest represent one index entity, 
 isn't it? And with each addContentStream I will add one File to this 
 entity.

 Thank you and with best Regards
 Mark




 -Ursprüngliche Nachricht-
 Von: Erick Erickson [mailto:erickerick...@gmail.com]
 Gesendet: Donnerstag, 23. Mai 2013 14:11
 An: solr-user@lucene.apache.org
 Betreff: Re: index multiple files into one index entity

 I just skimmed your post, but I'm responding to the last bit.

 If you have uniqueKey defined as id in schema.xml then no, you cannot 
 have multiple documents with the same ID.
 Whenever a new doc comes in it replaces the old doc with that ID.

 You can remove the uniqueKey definition and do what you want, but there 
 are very few Solr installations with no uniqueKey and it's probably a 
 better idea to make your id's truly unique.

 Best
 Erick

 On Thu, May 23, 2013 at 6:14 AM,  mark.ka...@t-systems.com wrote:
 Hello solr team,

 I want to index multiple fields into one solr index entity, with the
 same id. We are using solr 4.1


 I try it with following source fragment:

 public void addContentSet(ContentSet contentSet) throws
 SearchProviderException {

 ...

 ContentStreamUpdateRequest csur = 
 generateCSURequest(contentSet.getIndexId(), contentSet);
 String indexId = contentSet.getIndexId();

 ConcurrentUpdateSolrServer server = 
 serverPool.getUpdateServer(indexId);
 server.request(csur);

 ...
 }

 private ContentStreamUpdateRequest generateCSURequest(String indexId, 
 ContentSet contentSet)
 throws IOException {
 ContentStreamUpdateRequest csur = new
 ContentStreamUpdateRequest(confStore.getExtractUrl());

 ModifiableSolrParams parameters = csur.getParams();
 if (parameters == null) {
 parameters = new ModifiableSolrParams();
 }

 parameters.set(literalsOverride, false);

 // maps the tika default content attribute

Re: index multiple files into one index entity

2013-05-27 Thread Yury Kats
No, the implementation was very specific to my needs.

On 5/27/2013 8:28 AM, Alexandre Rafalovitch wrote:
 You did not open source it by any chance? :-)
 
 Regards,
Alex.



Re: index multiple files into one index entity

2013-05-26 Thread Erick Erickson
I'm still not quite getting the issue. Separate requests (i.e. any
addition of a SolrInputDocument) are treated as a separate document.
There's no notion of append the contents of one doc to another based
on ID, unless you're doing atomic updates.

And Tika takes some care to index separate files as separate documents.

Now, if you don't need these as with the same uniqueKey, you might
index them as separate documents and include a field that lets you
associate these documents somehow (see the group/field collapsing Wiki
page).

But otherwise, I think I need a higher-level view of what you're
trying to accomplish to make an intelligent comment.

Best
Erick

On Thu, May 23, 2013 at 9:05 AM,  mark.ka...@t-systems.com wrote:
 Hello Erick,
 Thank you for your fast answer.

 Maybe I don't exclaim my question clearly.
 I want index many files to one index entity. I will use the same behavior as 
 any other multivalued field which can indexed to one unique id.
 So I think every ContentStreamUpdateRequest represent one index entity, isn't 
 it? And with each addContentStream I will add one File to this entity.

 Thank you and with best Regards
 Mark




 -Ursprüngliche Nachricht-
 Von: Erick Erickson [mailto:erickerick...@gmail.com]
 Gesendet: Donnerstag, 23. Mai 2013 14:11
 An: solr-user@lucene.apache.org
 Betreff: Re: index multiple files into one index entity

 I just skimmed your post, but I'm responding to the last bit.

 If you have uniqueKey defined as id in schema.xml then no, you cannot 
 have multiple documents with the same ID.
 Whenever a new doc comes in it replaces the old doc with that ID.

 You can remove the uniqueKey definition and do what you want, but there are 
 very few Solr installations with no uniqueKey and it's probably a better 
 idea to make your id's truly unique.

 Best
 Erick

 On Thu, May 23, 2013 at 6:14 AM,  mark.ka...@t-systems.com wrote:
 Hello solr team,

 I want to index multiple fields into one solr index entity, with the
 same id. We are using solr 4.1


 I try it with following source fragment:

 public void addContentSet(ContentSet contentSet) throws
 SearchProviderException {

 ...

 ContentStreamUpdateRequest csur = 
 generateCSURequest(contentSet.getIndexId(), contentSet);
 String indexId = contentSet.getIndexId();

 ConcurrentUpdateSolrServer server = 
 serverPool.getUpdateServer(indexId);
 server.request(csur);

 ...
 }

 private ContentStreamUpdateRequest generateCSURequest(String indexId, 
 ContentSet contentSet)
 throws IOException {
 ContentStreamUpdateRequest csur = new
 ContentStreamUpdateRequest(confStore.getExtractUrl());

 ModifiableSolrParams parameters = csur.getParams();
 if (parameters == null) {
 parameters = new ModifiableSolrParams();
 }

 parameters.set(literalsOverride, false);

 // maps the tika default content attribute to the Attribute with 
 name 'fulltext'
 parameters.set(fmap.content, 
 SearchSystemAttributeDef.FULLTEXT.getName());
 // create an empty content stream, this seams necessary for 
 ContentStreamUpdateRequest
 csur.addContentStream(new ImaContentStream());

 for (Content content : contentSet.getContentList()) {
 csur.addContentStream(new ImaContentStream(content));
 // for each content stream add additional attributes
 parameters.add(literal. + 
 SearchSystemAttributeDef.CONTENT_ID.getName(), 
 content.getBinaryObjectId().toString());
 parameters.add(literal. + 
 SearchSystemAttributeDef.CONTENT_KEY.getName(), content.getContentKey());
 parameters.add(literal. + 
 SearchSystemAttributeDef.FILE_NAME.getName(), content.getContentName());
 parameters.add(literal. + 
 SearchSystemAttributeDef.MIME_TYPE.getName(), content.getMimeType());
 }

 parameters.set(literal.id , indexId);

 // adding some other attributes
 ...

 csur.setParams(parameters);

 return csur;
 }

 During debugging I can see that the method 'server.request(csur)' read for 
 each ImaContentStream the buffer.
 When I'm looking on solr catalina log I see that the attached files reach 
 the solr servlet.

 INFO: Releasing directory:/data/V-4-1/master0/data/index
 Apr 25, 2013 5:48:07 AM
 org.apache.solr.update.processor.LogUpdateProcessor finish
 INFO: [master0] webapp=/solr-4-1 path=/update/extract 
 params={literal.searchconnectortest15_c8150e41_cc49_4a .. 
 literal.id=26afa5dc-40ad-442a-ac79-0e7880c06aa1 .
 {add=[26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910940958720),
 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910971367424),
 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910976610304),
 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910983950336),
 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910989193216

index multiple files into one index entity

2013-05-23 Thread Mark.Kappe
Hello solr team,

I want to index multiple fields into one solr index entity, with the same id. 
We are using solr 4.1


I try it with following source fragment:

public void addContentSet(ContentSet contentSet) throws 
SearchProviderException {

...

ContentStreamUpdateRequest csur = 
generateCSURequest(contentSet.getIndexId(), contentSet);
String indexId = contentSet.getIndexId();

ConcurrentUpdateSolrServer server = 
serverPool.getUpdateServer(indexId);
server.request(csur);

...
}

private ContentStreamUpdateRequest generateCSURequest(String indexId, 
ContentSet contentSet)
throws IOException {
ContentStreamUpdateRequest csur = new 
ContentStreamUpdateRequest(confStore.getExtractUrl());

ModifiableSolrParams parameters = csur.getParams();
if (parameters == null) {
parameters = new ModifiableSolrParams();
}

parameters.set(literalsOverride, false);

// maps the tika default content attribute to the Attribute with name 
'fulltext'
parameters.set(fmap.content, 
SearchSystemAttributeDef.FULLTEXT.getName());
// create an empty content stream, this seams necessary for 
ContentStreamUpdateRequest
csur.addContentStream(new ImaContentStream());

for (Content content : contentSet.getContentList()) {
csur.addContentStream(new ImaContentStream(content));
// for each content stream add additional attributes
parameters.add(literal. + 
SearchSystemAttributeDef.CONTENT_ID.getName(), 
content.getBinaryObjectId().toString());
parameters.add(literal. + 
SearchSystemAttributeDef.CONTENT_KEY.getName(), content.getContentKey());
parameters.add(literal. + 
SearchSystemAttributeDef.FILE_NAME.getName(), content.getContentName());
parameters.add(literal. + 
SearchSystemAttributeDef.MIME_TYPE.getName(), content.getMimeType());
}

parameters.set(literal.id , indexId);

// adding some other attributes
...

csur.setParams(parameters);

return csur;
}

During debugging I can see that the method 'server.request(csur)' read for each 
ImaContentStream the buffer.
When I'm looking on solr catalina log I see that the attached files reach the 
solr servlet.

INFO: Releasing directory:/data/V-4-1/master0/data/index
Apr 25, 2013 5:48:07 AM org.apache.solr.update.processor.LogUpdateProcessor 
finish
INFO: [master0] webapp=/solr-4-1 path=/update/extract 
params={literal.searchconnectortest15_c8150e41_cc49_4a .. 
literal.id=26afa5dc-40ad-442a-ac79-0e7880c06aa1 .
{add=[26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910940958720), 
26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910971367424), 
26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910976610304), 
26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910983950336), 
26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910989193216), 
26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910995484672)]} 0 58


But only the latest in the content list will be indexed.


My schema.xml has the following field definitions:

field name=id type=string indexed=true stored=true required=true 
/
field name=content type=text_general indexed=false stored=true 
multiValued=true/

field name=contentkey type=string indexed=true stored=true 
multiValued=true/
field name=contentid type=string indexed=true stored=true 
multiValued=true/
field name=contentfilename  type=string indexed=true stored=true 
multiValued=true/
field name=contentmimetype type=string indexed=true stored=true 
multiValued=true/

field name=fulltext type=text_general indexed=true stored=true 
multiValued=true/


I'm using the tika ExtractingRequestHandler which can extract binary files.



  requestHandler name=/update/extract
  startup=lazy
  class=solr.extraction.ExtractingRequestHandler 
lst name=defaults
  str name=lowernamestrue/str
  str name=uprefixignored_/str

  !-- capture link hrefs but ignore div attributes --
  str name=captureAttrtrue/str
  str name=fmap.alinks/str
  str name=fmap.divignored_/str

/lst
  /requestHandler

Is it possible to index multiple files with the same id?
It is necessary to implement my own RequestHandler?

With best regards Mark





Re: index multiple files into one index entity

2013-05-23 Thread Erick Erickson
I just skimmed your post, but I'm responding to the last bit.

If you have uniqueKey defined as id in schema.xml then
no, you cannot have multiple documents with the same ID.
Whenever a new doc comes in it replaces the old doc with that ID.

You can remove the uniqueKey definition and do what you want,
but there are very few Solr installations with no uniqueKey and
it's probably a better idea to make your id's truly unique.

Best
Erick

On Thu, May 23, 2013 at 6:14 AM,  mark.ka...@t-systems.com wrote:
 Hello solr team,

 I want to index multiple fields into one solr index entity, with the same id. 
 We are using solr 4.1


 I try it with following source fragment:

 public void addContentSet(ContentSet contentSet) throws 
 SearchProviderException {

 ...

 ContentStreamUpdateRequest csur = 
 generateCSURequest(contentSet.getIndexId(), contentSet);
 String indexId = contentSet.getIndexId();

 ConcurrentUpdateSolrServer server = 
 serverPool.getUpdateServer(indexId);
 server.request(csur);

 ...
 }

 private ContentStreamUpdateRequest generateCSURequest(String indexId, 
 ContentSet contentSet)
 throws IOException {
 ContentStreamUpdateRequest csur = new 
 ContentStreamUpdateRequest(confStore.getExtractUrl());

 ModifiableSolrParams parameters = csur.getParams();
 if (parameters == null) {
 parameters = new ModifiableSolrParams();
 }

 parameters.set(literalsOverride, false);

 // maps the tika default content attribute to the Attribute with name 
 'fulltext'
 parameters.set(fmap.content, 
 SearchSystemAttributeDef.FULLTEXT.getName());
 // create an empty content stream, this seams necessary for 
 ContentStreamUpdateRequest
 csur.addContentStream(new ImaContentStream());

 for (Content content : contentSet.getContentList()) {
 csur.addContentStream(new ImaContentStream(content));
 // for each content stream add additional attributes
 parameters.add(literal. + 
 SearchSystemAttributeDef.CONTENT_ID.getName(), 
 content.getBinaryObjectId().toString());
 parameters.add(literal. + 
 SearchSystemAttributeDef.CONTENT_KEY.getName(), content.getContentKey());
 parameters.add(literal. + 
 SearchSystemAttributeDef.FILE_NAME.getName(), content.getContentName());
 parameters.add(literal. + 
 SearchSystemAttributeDef.MIME_TYPE.getName(), content.getMimeType());
 }

 parameters.set(literal.id , indexId);

 // adding some other attributes
 ...

 csur.setParams(parameters);

 return csur;
 }

 During debugging I can see that the method 'server.request(csur)' read for 
 each ImaContentStream the buffer.
 When I'm looking on solr catalina log I see that the attached files reach the 
 solr servlet.

 INFO: Releasing directory:/data/V-4-1/master0/data/index
 Apr 25, 2013 5:48:07 AM org.apache.solr.update.processor.LogUpdateProcessor 
 finish
 INFO: [master0] webapp=/solr-4-1 path=/update/extract 
 params={literal.searchconnectortest15_c8150e41_cc49_4a .. 
 literal.id=26afa5dc-40ad-442a-ac79-0e7880c06aa1 .
 {add=[26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910940958720), 
 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910971367424), 
 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910976610304), 
 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910983950336), 
 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910989193216), 
 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910995484672)]} 0 58


 But only the latest in the content list will be indexed.


 My schema.xml has the following field definitions:

 field name=id type=string indexed=true stored=true 
 required=true /
 field name=content type=text_general indexed=false stored=true 
 multiValued=true/

 field name=contentkey type=string indexed=true stored=true 
 multiValued=true/
 field name=contentid type=string indexed=true stored=true 
 multiValued=true/
 field name=contentfilename  type=string indexed=true stored=true 
 multiValued=true/
 field name=contentmimetype type=string indexed=true stored=true 
 multiValued=true/

 field name=fulltext type=text_general indexed=true stored=true 
 multiValued=true/


 I'm using the tika ExtractingRequestHandler which can extract binary files.



   requestHandler name=/update/extract
   startup=lazy
   class=solr.extraction.ExtractingRequestHandler 
 lst name=defaults
   str name=lowernamestrue/str
   str name=uprefixignored_/str

   !-- capture link hrefs but ignore div attributes --
   str name=captureAttrtrue/str
   str name=fmap.alinks/str
   str name=fmap.divignored_/str

 /lst
   /requestHandler

 Is it possible to index multiple files with the same id?
 It is necessary to implement my own 

AW: index multiple files into one index entity

2013-05-23 Thread Mark.Kappe
Hello Erick,
Thank you for your fast answer.

Maybe I don't exclaim my question clearly.
I want index many files to one index entity. I will use the same behavior as 
any other multivalued field which can indexed to one unique id.
So I think every ContentStreamUpdateRequest represent one index entity, isn't 
it? And with each addContentStream I will add one File to this entity.

Thank you and with best Regards
Mark




-Ursprüngliche Nachricht-
Von: Erick Erickson [mailto:erickerick...@gmail.com] 
Gesendet: Donnerstag, 23. Mai 2013 14:11
An: solr-user@lucene.apache.org
Betreff: Re: index multiple files into one index entity

I just skimmed your post, but I'm responding to the last bit.

If you have uniqueKey defined as id in schema.xml then no, you cannot have 
multiple documents with the same ID.
Whenever a new doc comes in it replaces the old doc with that ID.

You can remove the uniqueKey definition and do what you want, but there are 
very few Solr installations with no uniqueKey and it's probably a better idea 
to make your id's truly unique.

Best
Erick

On Thu, May 23, 2013 at 6:14 AM,  mark.ka...@t-systems.com wrote:
 Hello solr team,

 I want to index multiple fields into one solr index entity, with the 
 same id. We are using solr 4.1


 I try it with following source fragment:

 public void addContentSet(ContentSet contentSet) throws 
 SearchProviderException {

 ...

 ContentStreamUpdateRequest csur = 
 generateCSURequest(contentSet.getIndexId(), contentSet);
 String indexId = contentSet.getIndexId();

 ConcurrentUpdateSolrServer server = 
 serverPool.getUpdateServer(indexId);
 server.request(csur);

 ...
 }

 private ContentStreamUpdateRequest generateCSURequest(String indexId, 
 ContentSet contentSet)
 throws IOException {
 ContentStreamUpdateRequest csur = new 
 ContentStreamUpdateRequest(confStore.getExtractUrl());

 ModifiableSolrParams parameters = csur.getParams();
 if (parameters == null) {
 parameters = new ModifiableSolrParams();
 }

 parameters.set(literalsOverride, false);

 // maps the tika default content attribute to the Attribute with name 
 'fulltext'
 parameters.set(fmap.content, 
 SearchSystemAttributeDef.FULLTEXT.getName());
 // create an empty content stream, this seams necessary for 
 ContentStreamUpdateRequest
 csur.addContentStream(new ImaContentStream());

 for (Content content : contentSet.getContentList()) {
 csur.addContentStream(new ImaContentStream(content));
 // for each content stream add additional attributes
 parameters.add(literal. + 
 SearchSystemAttributeDef.CONTENT_ID.getName(), 
 content.getBinaryObjectId().toString());
 parameters.add(literal. + 
 SearchSystemAttributeDef.CONTENT_KEY.getName(), content.getContentKey());
 parameters.add(literal. + 
 SearchSystemAttributeDef.FILE_NAME.getName(), content.getContentName());
 parameters.add(literal. + 
 SearchSystemAttributeDef.MIME_TYPE.getName(), content.getMimeType());
 }

 parameters.set(literal.id , indexId);

 // adding some other attributes
 ...

 csur.setParams(parameters);

 return csur;
 }

 During debugging I can see that the method 'server.request(csur)' read for 
 each ImaContentStream the buffer.
 When I'm looking on solr catalina log I see that the attached files reach the 
 solr servlet.

 INFO: Releasing directory:/data/V-4-1/master0/data/index
 Apr 25, 2013 5:48:07 AM 
 org.apache.solr.update.processor.LogUpdateProcessor finish
 INFO: [master0] webapp=/solr-4-1 path=/update/extract 
 params={literal.searchconnectortest15_c8150e41_cc49_4a .. 
 literal.id=26afa5dc-40ad-442a-ac79-0e7880c06aa1 .
 {add=[26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910940958720), 
 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910971367424), 
 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910976610304), 
 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910983950336), 
 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910989193216), 
 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910995484672)]} 0 58


 But only the latest in the content list will be indexed.


 My schema.xml has the following field definitions:

 field name=id type=string indexed=true stored=true 
 required=true /
 field name=content type=text_general indexed=false 
 stored=true multiValued=true/

 field name=contentkey type=string indexed=true stored=true 
 multiValued=true/
 field name=contentid type=string indexed=true stored=true 
 multiValued=true/
 field name=contentfilename  type=string indexed=true stored=true 
 multiValued=true/
 field name=contentmimetype type=string indexed=true 
 stored=true multiValued=true/

 field name=fulltext type=text_general indexed