RE: How to index the parsed content effectively

2014-07-14 Thread Allison, Timothy B.
...@gmail.com] Sent: Friday, July 11, 2014 1:38 PM To: user@tika.apache.org Subject: Re: How to index the parsed content effectively Hi Tim, All. On 02/07/14 14:32, Allison, Timothy B. wrote: Hi Sergey, I'd take a look at what the DataImportHandler in Solr does. If you want to store

Re: How to index the parsed content effectively

2014-07-11 Thread Sergey Beryozkin
Hi Tim, All. On 02/07/14 14:32, Allison, Timothy B. wrote: Hi Sergey, I'd take a look at what the DataImportHandler in Solr does. If you want to store the field, you need to create the field with a String (as opposed to a Reader); which means you have to have the whole thing in memory.

Re: How to index the parsed content effectively

2014-07-02 Thread Ken Krugler
On Jul 2, 2014, at 5:27am, Sergey Beryozkin sberyoz...@gmail.com wrote: Hi All, We've been experimenting with indexing the parsed content in Lucene and our initial attempt was to index the output from ToTextContentHandler.toString() as a Lucene Text field. This is unlikely to be

Re: How to index the parsed content effectively

2014-07-02 Thread Christian Reuschling
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 If you want to have a try, we created a crawling Tika parser, which gives recursive, incremental crawing capabilities to Tika. There we also implemented a handler as a decorator that writes into a Lucene index. Checkout 'Create a Lucene index'

Re: How to index the parsed content effectively

2014-07-02 Thread Sergey Beryozkin
Hi, On 02/07/14 13:54, Ken Krugler wrote: On Jul 2, 2014, at 5:27am, Sergey Beryozkin sberyoz...@gmail.com mailto:sberyoz...@gmail.com wrote: Hi All, We've been experimenting with indexing the parsed content in Lucene and our initial attempt was to index the output from

RE: How to index the parsed content effectively

2014-07-02 Thread Allison, Timothy B.
Hi Sergey, I'd take a look at what the DataImportHandler in Solr does. If you want to store the field, you need to create the field with a String (as opposed to a Reader); which means you have to have the whole thing in memory. Also, if you're proposing adding a field entry in a

Re: How to index the parsed content effectively

2014-07-02 Thread Sergey Beryozkin
Hi Tim Thanks for sharing your thoughts. I find them very helpful, On 02/07/14 14:32, Allison, Timothy B. wrote: Hi Sergey, I'd take a look at what the DataImportHandler in Solr does. If you want to store the field, you need to create the field with a String (as opposed to a Reader);

Re: How to index the parsed content effectively

2014-07-02 Thread Christian Reuschling
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 another aspect is, if you index such large documents, you also recieve these documents inside your search results, which is then again a bit ambigous for a user (if there is one in the use case). The search problem is only partially solved in this

Re: How to index the parsed content effectively

2014-07-02 Thread Sergey Beryozkin
Hi On 02/07/14 17:32, Christian Reuschling wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 another aspect is, if you index such large documents, you also recieve these documents inside your search results, which is then again a bit ambigous for a user (if there is one in the use case).