RE: lucense index/document architecture

2007-01-27 Thread Joost Schouten
Erick, Otis, Thank you for your help. I will work with a single index and parent fields. It's hard to say exactly how much raw data I will index as this differs per client. But I guess right now I'm more looking at 1G (contents of a non-CLOB/BLOB DB). But one client is thinking of throwing their e

Re : lucene doc id's

2007-01-27 Thread saikrishna venkata pendyala
Hai , I was trying to store to document id's external. I have found that lucene generates document id's linearly starting from 0 and are not changed until any document is deleted. but it did work for me. How could I store document id's externally.

Re: lucense index/document architecture

2007-01-27 Thread Otis Gospodnetic
100TB? Ouch. Yes, most certainly very different. Again, how to split the index and design the whole system depends on how this is going to be used, how it's going to be changed, if it's going to be changed, how it's going to grow, etc. I'd love to hear from you once you start working with 100

Re: Re : lucene document id's

2007-01-27 Thread Erick Erickson
I believe you are correct about when document IDs change. That said, I'd strongly recommend you spend some time trying think of a way to keep from doing this, since it may lead to endless synchronization issues. But if you must, you can retrieve a document with IndexReader.document(id); On 1/27/

Re: lucense index/document architecture

2007-01-27 Thread Erick Erickson
I put in 1TB as a number because I thought it would surely be bigger than anything you intended to put in your database. And you reply with 100 times that size . The index I'm working with now is 5GB, so I have no wisdom to offer you at all about how to scale to 100TB. You should probably inf

My program stops indexing after 10000th documents is indexed

2007-01-27 Thread maureen tanuwidjaja
Hi all, Is there any limitation of number of file that lucene can handle? I indexed a total of 3 XML Documents,however it stops at 1th documents. No warning,no error ,no exception as well. Indexing C:\sweetpea\wikipedia_xmlfiles\part-18\491876.xml Indexing C:\sweetp

Re: Multiword Highlighting

2007-01-27 Thread Mark Miller
Isn't it semi trivial if you are not interested in the fragments (I swear it seems that most people are not)? Isn't it you that suggested turning the query into a SpanQuery, extracting the spans and then doing the highlighting after a rewrite? This seems somewhat trivial so what am I missing? I

Re: My program stops indexing after 10000th documents is indexed

2007-01-27 Thread Chris Hostetter
did you try triggering a thread dump to see what it was doing at that point? depending on your merge factors and other IndexWriter settings it could just be doing a relaly big merge. : Date: Sat, 27 Jan 2007 09:40:47 -0800 (PST) : From: maureen tanuwidjaja <[EMAIL PROTECTED]> : Reply-To: java-us

IndexWriter.docCount

2007-01-27 Thread karl wettin
/** Returns the number of documents currently in this index. */ public synchronized int docCount() { int count = ramSegmentInfos.size(); for (int i = 0; i < segmentInfos.size(); i++) { SegmentInfo si = segmentInfos.info(i); count += si.docCount; } return count; } I

RE: lucense index/document architecture

2007-01-27 Thread Joost Schouten
I'll keep you posted ;-) Joost Schouten Director JS Portal Dasstraat 21 2623CB Delft the Netherlands P: +31 6 160 160 14 W: www.jsportal.com -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Sunday, January 28, 2007 4:34 AM To: java-user@lucene.apache.org Subje

Re: IndexWriter.docCount

2007-01-27 Thread Doron Cohen
Hi Karl, karl wettin <[EMAIL PROTECTED]> wrote on 27/01/2007 11:54:18: > /** Returns the number of documents currently in this index. */ >public synchronized int docCount() { > int count = ramSegmentInfos.size(); > for (int i = 0; i < segmentInfos.size(); i++) { >SegmentInfo

Re: Announcement: Lucene powering Monster job search index (Beta)

2007-01-27 Thread no spam
Isn't this extremely ineffecient to do the euclidean distance twice? Perhaps not a huge deal if a small search result set. I at times have 13,000 results that match my search terms of an index with 1.2 million docs. Can't you do some simple radian math first to ensure it's way out of bounds, the

Re: IndexWriter.docCount

2007-01-27 Thread karl wettin
27 jan 2007 kl. 21.19 skrev Doron Cohen: karl wettin <[EMAIL PROTECTED]> wrote on 27/01/2007 11:54:18: /** Returns the number of documents currently in this index. */ public synchronized int docCount() { I don't understand, what is it this method returns? "Something else" - it is the

Re: Multiword Highlighting

2007-01-27 Thread markharw00d
>>Isn't it semi trivial if you are not interested in the fragments (I swear it seems that most people are not)? I I haven't conducted a survey but it's the typical web search engine scenario - select only a small subset of the matching document content for display in SERPS. I would expect that

Re: Multiword Highlighting

2007-01-27 Thread Mark Miller
markharw00d wrote: >>Isn't it semi trivial if you are not interested in the fragments (I swear it seems that most people are not)? I I haven't conducted a survey but it's the typical web search engine scenario - select only a small subset of the matching document content for display in SERP

Re: Multiword Highlighting

2007-01-27 Thread Mark Miller
Maybe a new highlighter with no attempt at summarising could more easily address phrase support for small pieces of content. It will always be hard to faithfully represent all possible query match logic - especially if there are NOTs, ANDs and ORs mixed in with all the term proximity logic

Re: Re : lucene document id's

2007-01-27 Thread Kay Roepke
Hi! I promised karl that I'd share something on this topic, so here it goes. It fits the subject, too ;) On Jan 27, 2007, at 6:14 PM, Erick Erickson wrote: I believe you are correct about when document IDs change. That said, I'd strongly recommend you spend some time trying think of a way

Re: IndexWriter.docCount

2007-01-27 Thread Doron Cohen
karl wettin <[EMAIL PROTECTED]> wrote on 27/01/2007 13:49:24: > Deleted as in still available in the segment and noted in the delted > file, but not optimized and IllegalArgumentException thrown in case > of IndexReader.document(n)? At least I think that is the way a > Directory works? Yes.. so i

Re: Multiword Highlighting

2007-01-27 Thread Otis Gospodnetic
For what it's worth Mark (Miller), there *is* a need for "just highlight the query terms without trying to get excerpts" functionality - something a la Google cache (different colours...mmm, nice). I've had people ask me for this before, and I know I could use this functionality, too. Please c