Re: DocumentWriter, StopFilter should use HashMap... (patch)
On Mar 9, 2004, at 10:23 PM, Kevin A. Burton wrote: You need do make it a HashSet: table = new HashSet( stopTable.keySet() ); Done. Also... while you're at it... the private variable name is 'table' which this HashSet certainly is *not* ;) Well, depends on your definition of 'table' I suppose :) I changed it to a type-agnostic stopWords. Probably makes sense to just call this variable 'hashset' and then force the type to be HashSet since it's necessary for this to be a HashSet to maintain any decent performance. You'll need to update your second constructor to require a HashSet too.. would be very bad to let callers use another set impl... TreeSet and SortedSet would still be too slow... I refuse to expose HashSet... sorry! :) But I did wrap what is passed in, like above, in a HashSet in my latest commit. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Storing numbers
On Tuesday 09 March 2004 20:51, Timothy Stone wrote: Michael Giles wrote: Tim, Looks like you can only access it with a subscription. :( Sounds good, though. Really? I don't have a subscription. Got to it via the archives actually now that I think about it: Try Volume 7, Issue 12. I also need an subscription for: http://www.sys-con.com/story/search.cfm?pub=1ss=lucene - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Large document collections?
I'm looking for information on the largest document collection that Lucene has been used to index, the biggest benchmark I've been able to find so far is 1MM documents. I'd like to generate some benchmarks for large collections (1-100MM) records and would like to know if this is feasible without using distributed indexes, etc. It's mostly to construct a performance profile relating indexing/retrieval time and storage requirements to the number of documents. Thanks. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Large document collections?
I think even a 100K or 1MM doc collection will give you an idea about the retrieval time/storage requirements (which, of course, are highly dependent on what you index and how you index it). I know several people have created collections with up to 50MM docs on a single machine (not sure about number of CPUs, etc.) Otis --- Mark Devaney [EMAIL PROTECTED] wrote: I'm looking for information on the largest document collection that Lucene has been used to index, the biggest benchmark I've been able to find so far is 1MM documents. I'd like to generate some benchmarks for large collections (1-100MM) records and would like to know if this is feasible without using distributed indexes, etc. It's mostly to construct a performance profile relating indexing/retrieval time and storage requirements to the number of documents. Thanks. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Storing numbers
Try this link and scroll to top: http://www.sys-con.com/story/?storyid=37296DE=1#RES Thank you, Tim - excelent article. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Wednesday, March 10, 2004 10:23 AM To: Lucene Users List Subject: Re: Storing numbers On Tuesday 09 March 2004 20:51, Timothy Stone wrote: Michael Giles wrote: Tim, Looks like you can only access it with a subscription. :( Sounds good, though. Really? I don't have a subscription. Got to it via the archives actually now that I think about it: Try Volume 7, Issue 12. I also need an subscription for: http://www.sys-con.com/story/search.cfm?pub=1ss=lucene - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Large document collections?
I use several collections, one of 1 200 000 documents, one of 3 800 000 and another one of 12 000 000 documents (for the biggests) and the performances are quite good (except for search with wildcards). Our machine have 1 giga bites of memory and 2 CPU. - Original Message - From: Mark Devaney [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Wednesday, March 10, 2004 4:26 PM Subject: Large document collections? I'm looking for information on the largest document collection that Lucene has been used to index, the biggest benchmark I've been able to find so far is 1MM documents. I'd like to generate some benchmarks for large collections (1-100MM) records and would like to know if this is feasible without using distributed indexes, etc. It's mostly to construct a performance profile relating indexing/retrieval time and storage requirements to the number of documents. Thanks. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Large document collections?
Well usually the time of response are 5-10 sec max, it depends of the queries (except for queries with a wildcard). i put a time out of 30 seconds for all the queries. queries with wildcard can fail because of java.lang.out.of.memories error you can try yourself on the website of my compagny (but the website seems down for the moment) if you want the adress send me a mail out of the list please, i'll explain you in details how to make your own test and how our software works - Original Message - From: Albert Vila Puig [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, March 10, 2004 5:36 PM Subject: Re: Large document collections? Can you please provide some queries and their performance? Thanks Paladin wrote: I use several collections, one of 1 200 000 documents, one of 3 800 000 and another one of 12 000 000 documents (for the biggests) and the performances are quite good (except for search with wildcards). Our machine have 1 giga bites of memory and 2 CPU. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: DocumentWriter, StopFilter should use HashMap... (patch)
Erik Hatcher wrote: Also... while you're at it... the private variable name is 'table' which this HashSet certainly is *not* ;) Well, depends on your definition of 'table' I suppose :) I changed it to a type-agnostic stopWords. Did you know that internally HashSet uses a HashMap? I sure didn't! hashset.contains() maps to hashmap.containsKey() It uses a key - value mapping to a generic PRESENT Object... hm. Probably makes sense to just call this variable 'hashset' and then force the type to be HashSet since it's necessary for this to be a HashSet to maintain any decent performance. You'll need to update your second constructor to require a HashSet too.. would be very bad to let callers use another set impl... TreeSet and SortedSet would still be too slow... I refuse to expose HashSet... sorry! :) But I did wrap what is passed in, like above, in a HashSet in my latest commit. Hm... You're doing this EVEN if the caller passes a HashSet directly?! Why do you have a problem exposing a HashSet/Map... it SHOULD be a Hash based implementation. Doing anything else is just wrong and would seriously slow down Lucene indexing. Also... you're HashSet constructor has to copy values from the original HashSet into the new HashSet ... not very clean and this can just be removed by forcing the caller to use a HashSet (which they should). :) Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster begin:vcard fn:Kevin Burton n:Burton;Kevin email;internet:[EMAIL PROTECTED] x-mozilla-html:TRUE version:2.1 end:vcard signature.asc Description: OpenPGP digital signature
Re: DocumentWriter, StopFilter should use HashMap... (patch)
On Mar 10, 2004, at 2:59 PM, Kevin A. Burton wrote: I refuse to expose HashSet... sorry! :) But I did wrap what is passed in, like above, in a HashSet in my latest commit. Hm... You're doing this EVEN if the caller passes a HashSet directly?! Well it was in the ctor. But I guess I'm not seeing all the times the filter is being constructed to make this a cause a performance hit. Why do you have a problem exposing a HashSet/Map... it SHOULD be a Hash based implementation. Doing anything else is just wrong and would seriously slow down Lucene indexing. Just semantically, it is a set of stop words - so in theory it shouldn't matter the actual implementation. I'm an interface purist at heart. Also... you're HashSet constructor has to copy values from the original HashSet into the new HashSet ... not very clean and this can just be removed by forcing the caller to use a HashSet (which they should). I've caved in and gone HashSet all the way. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
1.3-final builds as 1.4-rc1-dev?
Hello, I noticed that Lucene 1.3-final source builds a JAR file whose version number is 1.4-rc1-dev. What does this mean? Will 1.4-final build as 1.5-rc1-dev? Just Curious, Jeff - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: 1.3-final builds as 1.4-rc1-dev?
It means we screwed up the timing somehow and changed the build file version after we built the binary version, is my guess. We'll be more careful with the 1.4 release and make sure this doesn't happen then. Erik On Mar 10, 2004, at 8:34 PM, Jeff Wong wrote: Hello, I noticed that Lucene 1.3-final source builds a JAR file whose version number is 1.4-rc1-dev. What does this mean? Will 1.4-final build as 1.5-rc1-dev? Just Curious, Jeff - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: 1.3-final builds as 1.4-rc1-dev?
Jeff Wong wrote: I noticed that Lucene 1.3-final source builds a JAR file whose version number is 1.4-rc1-dev. What does this mean? Will 1.4-final build as 1.5-rc1-dev? Probably. If you modify the sources of a 1.3-final release, and build them, you're not building 1.3-final, but a derivative. We could call it 1.3-dev or something, but that would be strange, as 1.3 development is closed. All development is now towards 1.4-based releases. As a side-effect, even if you make no changes to the 1.3-final sources and build them, it builds as 1.4-rc1-dev. I think that is still safer than calling it 1.3-final, since 1.3-final should be reserved for the exact jar file downloaded from Apache. In general, anything ending with -dev doesn't have any guarantees, and the version before that is only meant to be suggestive. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: 1.3-final builds as 1.4-rc1-dev?
On Mar 10, 2004, at 9:45 PM, Doug Cutting wrote: Jeff Wong wrote: I noticed that Lucene 1.3-final source builds a JAR file whose version number is 1.4-rc1-dev. What does this mean? Will 1.4-final build as 1.5-rc1-dev? Probably. If you modify the sources of a 1.3-final release, and build them, you're not building 1.3-final, but a derivative. We could call it 1.3-dev or something, but that would be strange, as 1.3 development is closed. All development is now towards 1.4-based releases. As a side-effect, even if you make no changes to the 1.3-final sources and build them, it builds as 1.4-rc1-dev. I think that is still safer than calling it 1.3-final, since 1.3-final should be reserved for the exact jar file downloaded from Apache. In general, anything ending with -dev doesn't have any guarantees, and the version before that is only meant to be suggestive. Ah... this seems perfectly reasonable! And I concur if its not the exact JAR then it shouldn't have the final stamp of approval. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: DocumentWriter, StopFilter should use HashMap... (patch)
Erik Hatcher wrote: Also... you're HashSet constructor has to copy values from the original HashSet into the new HashSet ... not very clean and this can just be removed by forcing the caller to use a HashSet (which they should). I've caved in and gone HashSet all the way. Did you not see my message suggesting a way to both not expose HashSet publicly and also not to copy values? If not, I attached it. Doug ---BeginMessage--- [EMAIL PROTECTED] wrote: - public StopFilter(TokenStream in, Set stopTable) { + public StopFilter(TokenStream in, Set stopWords) { super(in); -table = stopTable; +this.stopWords = new HashSet(stopWords); } This always allocates a new HashSet, which, if the stop list is large, and documents are small, could impact performance. Perhaps we can replace this with something like: public StopFilter(TokenStream in, Set stopWords) { this(in, stopWords instanceof HashSet ? ((HashSet)stopWords) : new HashSet(stopWords)); } and then add another constructor: private StopFilter(TokenStream in, HashSet stopWords) { super(in); this.stopWords = stopTable; } Also, if we want the implementation to always be a HashSet internally, for performance, we ought to declare the field to be a HashSet, no? The competing goals here are: 1. Not to expose publicly the implementation of the Set; 2. Not to copy the contents of the Set when folks pass the value of makeStopSet. 3. Use the most efficient implementation internally. I think the changes above meet all of these. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] ---End Message--- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: DocumentWriter, StopFilter should use HashMap... (patch)
Doug Cutting wrote: Erik Hatcher wrote: Also... you're HashSet constructor has to copy values from the original HashSet into the new HashSet ... not very clean and this can just be removed by forcing the caller to use a HashSet (which they should). I've caved in and gone HashSet all the way. Did you not see my message suggesting a way to both not expose HashSet publicly and also not to copy values? If not, I attached it. For the record I didn't see it... but it echos my points... Thanks! Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster begin:vcard fn:Kevin Burton n:Burton;Kevin email;internet:[EMAIL PROTECTED] x-mozilla-html:TRUE version:2.1 end:vcard signature.asc Description: OpenPGP digital signature
Re: DocumentWriter, StopFilter should use HashMap... (patch)
On Mar 10, 2004, at 10:28 PM, Doug Cutting wrote: Erik Hatcher wrote: Also... you're HashSet constructor has to copy values from the original HashSet into the new HashSet ... not very clean and this can just be removed by forcing the caller to use a HashSet (which they should). I've caved in and gone HashSet all the way. Did you not see my message suggesting a way to both not expose HashSet publicly and also not to copy values? If not, I attached it. Yes, I saw it. But is there a reason not to just expose HashSet given that it is the data structure that is most efficient? I bought into Kevin's arguments that it made sense to just expose HashSet. As for copying values - that is only happening now if you use the Hashtable or String[] constructor. Erik Doug From: Doug Cutting [EMAIL PROTECTED] Date: March 10, 2004 1:08:24 PM EST To: Lucene Developers List [EMAIL PROTECTED] Subject: Re: cvs commit: jakarta-lucene/src/java/org/apache/lucene/analysis StopFilter.java Reply-To: Lucene Developers List [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: - public StopFilter(TokenStream in, Set stopTable) { + public StopFilter(TokenStream in, Set stopWords) { super(in); -table = stopTable; +this.stopWords = new HashSet(stopWords); } This always allocates a new HashSet, which, if the stop list is large, and documents are small, could impact performance. Perhaps we can replace this with something like: public StopFilter(TokenStream in, Set stopWords) { this(in, stopWords instanceof HashSet ? ((HashSet)stopWords) : new HashSet(stopWords)); } and then add another constructor: private StopFilter(TokenStream in, HashSet stopWords) { super(in); this.stopWords = stopTable; } Also, if we want the implementation to always be a HashSet internally, for performance, we ought to declare the field to be a HashSet, no? The competing goals here are: 1. Not to expose publicly the implementation of the Set; 2. Not to copy the contents of the Set when folks pass the value of makeStopSet. 3. Use the most efficient implementation internally. I think the changes above meet all of these. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
incomplete word match
I have a situation where I need to be able to find incomplete word matches, for example a search for the string 'ape' would return matches for 'grapes' 'naples' 'staples' etc. I have been searching the archives of this user list and can't seem to find any example of someone doing this. At one point I recall finding someone's site (on Google) who indicated that their search engine was Lucene, and they offered the capability of doing this type of matching. However I can't seem to find that site again to save my life! Has anyone been successful in implementing this type of matching with Lucene? If so, would you be able to share some insight as to how you did it? Thanks in advance! -TP __ Do you Yahoo!? Yahoo! Search - Find what youre looking for faster http://search.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]