SolrJ |ContentStreamUpdateRequest | Accessing parsed items without committing to solr
i have a bit strange usecase. when i index a pdf to solr i use ContentStreamUpdateRequest. The lucene document then contains in the text field all containing items (the parsed items of the physical pdf). i also need to add these parsed items to another lucene document. is there a way, to receive/parse these items just in memory, without comitting them to lucene? -- View this message in context: http://lucene.472066.n3.nabble.com/SolrJ-ContentStreamUpdateRequest-Accessing-parsed-items-without-committing-to-solr-tp4032636.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SolrJ |ContentStreamUpdateRequest | Accessing parsed items without committing to solr
If I understand it, you are sending the file to Solr which then uses Tika library to do the preprocessing/extraction and stores the results in the defined fields . If you don't want Solr to do the storing and want to change extracted fields, just use the Tika library in your client and work with returned document yourself. This is less of a network load as well, as you don't send the whole file over the wire. Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Fri, Jan 11, 2013 at 3:55 PM, uwe72 uwe.clem...@exxcellent.de wrote: i have a bit strange usecase. when i index a pdf to solr i use ContentStreamUpdateRequest. The lucene document then contains in the text field all containing items (the parsed items of the physical pdf). i also need to add these parsed items to another lucene document. is there a way, to receive/parse these items just in memory, without comitting them to lucene? -- View this message in context: http://lucene.472066.n3.nabble.com/SolrJ-ContentStreamUpdateRequest-Accessing-parsed-items-without-committing-to-solr-tp4032636.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SolrJ |ContentStreamUpdateRequest | Accessing parsed items without committing to solr
Yes, i don't really want to index/store the pdf document in lucene. i just need the parsed tokens for other things. So you mean i can use ExtractingRequestHandler.java to retrieve the items. has anybody a piece of code, doing that? actually i give the pdf as input and want the parsed items (the same what would be in the text field in the stored lucene doc). -- View this message in context: http://lucene.472066.n3.nabble.com/SolrJ-ContentStreamUpdateRequest-Accessing-parsed-items-without-committing-to-solr-tp4032636p4032646.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SolrJ |ContentStreamUpdateRequest | Accessing parsed items without committing to solr
ok, seems this works: Tika tika = new Tika(); String tokens = tika.parseToString(file); -- View this message in context: http://lucene.472066.n3.nabble.com/SolrJ-ContentStreamUpdateRequest-Accessing-parsed-items-without-committing-to-solr-tp4032636p4032649.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SolrJ |ContentStreamUpdateRequest | Accessing parsed items without committing to solr
Look at the extractOnly parameter. But doing this in your client is the more recommended way of doing this to keep Solr from getting beat up too bad. Erik On Jan 11, 2013, at 15:55, uwe72 uwe.clem...@exxcellent.de wrote: i have a bit strange usecase. when i index a pdf to solr i use ContentStreamUpdateRequest. The lucene document then contains in the text field all containing items (the parsed items of the physical pdf). i also need to add these parsed items to another lucene document. is there a way, to receive/parse these items just in memory, without comitting them to lucene? -- View this message in context: http://lucene.472066.n3.nabble.com/SolrJ-ContentStreamUpdateRequest-Accessing-parsed-items-without-committing-to-solr-tp4032636.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SolrJ |ContentStreamUpdateRequest | Accessing parsed items without committing to solr
Erik, what do u mean with this parameter, i don't find it.. -- View this message in context: http://lucene.472066.n3.nabble.com/SolrJ-ContentStreamUpdateRequest-Accessing-parsed-items-without-committing-to-solr-tp4032636p4032656.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SolrJ |ContentStreamUpdateRequest | Accessing parsed items without committing to solr
It's an ExtractingRequestHandler parameter (see the wiki). Not quite sure the Java incantation to set that but definitely possible. Erik On Jan 11, 2013, at 17:14, uwe72 uwe.clem...@exxcellent.de wrote: Erik, what do u mean with this parameter, i don't find it.. -- View this message in context: http://lucene.472066.n3.nabble.com/SolrJ-ContentStreamUpdateRequest-Accessing-parsed-items-without-committing-to-solr-tp4032636p4032656.html Sent from the Solr - User mailing list archive at Nabble.com.