SolrJ |ContentStreamUpdateRequest | Accessing parsed items without committing to solr

2013-01-11 Thread uwe72
i have a bit strange usecase.

when i index a pdf to solr i use ContentStreamUpdateRequest.
The lucene document then contains in the text field all containing items
(the parsed items of the physical pdf).

i also need to add these parsed items to another lucene document.

is there a way, to receive/parse these items just in memory, without
comitting them to lucene?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrJ-ContentStreamUpdateRequest-Accessing-parsed-items-without-committing-to-solr-tp4032636.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SolrJ |ContentStreamUpdateRequest | Accessing parsed items without committing to solr

2013-01-11 Thread Alexandre Rafalovitch
If I understand it, you are sending the file to Solr which then uses Tika
library to do the preprocessing/extraction and stores the results in the
defined fields .

If you don't want Solr to do the storing and want to change extracted
fields, just use the Tika library in your client and work with returned
document yourself. This is less of a network load as well, as you don't
send the whole file over the wire.

Regards,
   Alex.

Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Fri, Jan 11, 2013 at 3:55 PM, uwe72 uwe.clem...@exxcellent.de wrote:

 i have a bit strange usecase.

 when i index a pdf to solr i use ContentStreamUpdateRequest.
 The lucene document then contains in the text field all containing items
 (the parsed items of the physical pdf).

 i also need to add these parsed items to another lucene document.

 is there a way, to receive/parse these items just in memory, without
 comitting them to lucene?



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/SolrJ-ContentStreamUpdateRequest-Accessing-parsed-items-without-committing-to-solr-tp4032636.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: SolrJ |ContentStreamUpdateRequest | Accessing parsed items without committing to solr

2013-01-11 Thread uwe72
Yes, i don't really want to index/store the pdf document in lucene.

i just need the parsed tokens for other things.

So you mean i can use ExtractingRequestHandler.java to retrieve the items.

has anybody a piece of code, doing that?

actually i give the pdf as input and want the parsed items (the same what
would be in the text field in the stored lucene doc).





--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrJ-ContentStreamUpdateRequest-Accessing-parsed-items-without-committing-to-solr-tp4032636p4032646.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SolrJ |ContentStreamUpdateRequest | Accessing parsed items without committing to solr

2013-01-11 Thread uwe72
ok, seems this works:

  Tika tika = new Tika();
  String tokens = tika.parseToString(file);  




--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrJ-ContentStreamUpdateRequest-Accessing-parsed-items-without-committing-to-solr-tp4032636p4032649.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SolrJ |ContentStreamUpdateRequest | Accessing parsed items without committing to solr

2013-01-11 Thread Erik Hatcher
Look at the extractOnly parameter. 

But doing this in your client is the more recommended way of doing this to keep 
Solr from getting beat up too bad. 

Erik

On Jan 11, 2013, at 15:55, uwe72 uwe.clem...@exxcellent.de wrote:

 i have a bit strange usecase.
 
 when i index a pdf to solr i use ContentStreamUpdateRequest.
 The lucene document then contains in the text field all containing items
 (the parsed items of the physical pdf).
 
 i also need to add these parsed items to another lucene document.
 
 is there a way, to receive/parse these items just in memory, without
 comitting them to lucene?
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/SolrJ-ContentStreamUpdateRequest-Accessing-parsed-items-without-committing-to-solr-tp4032636.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: SolrJ |ContentStreamUpdateRequest | Accessing parsed items without committing to solr

2013-01-11 Thread uwe72
Erik, what do u mean with this parameter, i don't find it..



--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrJ-ContentStreamUpdateRequest-Accessing-parsed-items-without-committing-to-solr-tp4032636p4032656.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SolrJ |ContentStreamUpdateRequest | Accessing parsed items without committing to solr

2013-01-11 Thread Erik Hatcher
It's an ExtractingRequestHandler parameter (see the wiki).  Not quite sure the 
Java incantation to set that but definitely possible. 
 
Erik

On Jan 11, 2013, at 17:14, uwe72 uwe.clem...@exxcellent.de wrote:

 Erik, what do u mean with this parameter, i don't find it..
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/SolrJ-ContentStreamUpdateRequest-Accessing-parsed-items-without-committing-to-solr-tp4032636p4032656.html
 Sent from the Solr - User mailing list archive at Nabble.com.