Re: retrieve tokens

2004-12-22 Thread Otis Gospodnetic
Martijn, have you seen the Highlighter in the Lucene Sandbox? If you've stored your text in the Lucene index, there is no need to go back to DB to pull out the blog, parse it, and highlight it - the Highlighter in the Sandbox will do this for you. Otis --- M. Smit [EMAIL PROTECTED] wrote:

Re: retrieve tokens

2004-12-22 Thread M. Smit
Otis, Problem is though that I'm a little reluctant storing the data Field.Text instead of Field.UnStored, because of the shear size of the documents and the multitude I would like to index (say some 100paged * 2k documents). But than again, it's size versus

Re: retrieve tokens

2004-12-22 Thread Erik Hatcher
On Dec 22, 2004, at 12:04 PM, M. Smit wrote: Problem is though that I'm a little reluctant storing the data Field.Text instead of Field.UnStored, because of the shear size of the documents and the multitude I would like to index (say some 100paged * 2k documents). But than again, it's size

Re: retrieve tokens

2004-12-22 Thread M. Smit
Erik Hatcher wrote: Highlighter does not mandate you store your text in the index. It is just a convenient way to do it. You're free to pull the text from anywhere and highlight it based on the query. Furthermore, you are saying that the highlighter takes care of the corresponding

Re: retrieve tokens

2004-12-22 Thread Mike Snare
But for the other issue on 'store lucene' vs 'store db'. Does anyone can provide me with some field experience on size? The system I'm developing will provide searching through some 2000 pdf's, say some 200 pages each. I feed the plain text into Lucene on a Field.UnStored bases. I also store

Re: retrieve tokens

2004-12-22 Thread Otis Gospodnetic
I suspect Martijn really wants that snippet dynamically generated, with KWIC, as on the lucenebook.com screen shot. Thus, he can't generate and store the snippet at index time, and has to construct it at search time. Otis --- Mike Snare [EMAIL PROTECTED] wrote: But for the other issue on

Re: retrieve tokens

2004-12-22 Thread Otis Gospodnetic
For simpy.com I store the full text of web pages in Lucene, in order to provide full-text web searches. Nutch (nutch.org) does the same. You can set the maximal number of tokens you want indexed via IndexWriter. You can also compress fields in the newest version of Lucene (or maybe just the one

Re: retrieve tokens

2004-12-22 Thread Erik Hatcher
On Dec 22, 2004, at 12:43 PM, M. Smit wrote: Erik Hatcher wrote: But for the other issue on 'store lucene' vs 'store db'. Does anyone can provide me with some field experience on size? The system I'm developing will provide searching through some 2000 pdf's, say some 200 pages each. I feed the

Re: retrieve tokens

2004-12-22 Thread Martijn
Erik Hatcher wrote: On Dec 22, 2004, at 12:43 PM, M. Smit wrote: Consider that you're only highlighting 20 or so entries at one time. Getting the text from a Lucene index you're already navigating will be quite quick. But it shouldn't be too bad to pull 20 records from a database either.

Re: retrieve tokens

2004-12-22 Thread Martijn
Otis Gospodnetic wrote: I suspect Martijn really wants that snippet dynamically generated, with KWIC, as on the lucenebook.com screen shot. Thus, he can't generate and store the snippet at index time, and has to construct it at search time. Otis That is correct. I won't be having a lot of