Re: DocumentWriter, StopFilter should use HashMap... (patch)
Just found the rest of the thread. I'll shut up now ;) sv On Sun, 14 Mar 2004, Stephane James Vaucher wrote: Back from a weeks' vacation, so this reply is a little late, maybe out of order as well ;). Comment inline: On Tue, 9 Mar 2004, Kevin A. Burton wrote: Doug Cutting wrote: Erik Hatcher wrote: Well, one issue you didn't consider is changing a public method signature. I will make this change, but leave the Hashtable signature method there. I suppose we could change the signature to use a Map instead, but I believe there are some issues with doing something like this if you do not recompile your own source code against a new Lucene JAR so I will simply provide another signature too. This would no longer compile with the change Kevin proposes. To make things back-compatible we must: 1. Keep but deprectate StopFilter(Hashtable) constructor; 2. Keep but deprecate StopFilter.makeStopTable(String[]); 3. Add a new constructor: StopFilter(HashMap); 4. Add a new method: StopFilter.makeStopMap(String[]); Why impose implementation details in the constructor? Shouldn't the constructor use a Map (not a HashMap), a Set, or a String array? sv Does that make sense? This patch and attachment take care of this problem... It does make this class more complex than it needs to be... but 1/2 of the methods are deprecated. Kevin - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: DocumentWriter, StopFilter should use HashMap... (patch)
I don't buy it. HashSet is but one implementation of a Set. By choosing the HashSet implementation you are not only tying the class to a hash-based implementation, you are trying the interface to *that specific* hash-based implementation or it's subclasses. In the end, either you buy the concept of the interface and its abstraction or you don't. I firmly believe in using interfaces as they were intended to be used. Scott P.S. In fact, HashSet isn't always going to be the most efficient anyway. Just for one example: Consider possible implementations if I have only 1 or 2 entries. On Mar 10, 2004, at 11:13 PM, Erik Hatcher wrote: On Mar 10, 2004, at 10:28 PM, Doug Cutting wrote: Erik Hatcher wrote: Also... you're HashSet constructor has to copy values from the original HashSet into the new HashSet ... not very clean and this can just be removed by forcing the caller to use a HashSet (which they should). I've caved in and gone HashSet all the way. Did you not see my message suggesting a way to both not expose HashSet publicly and also not to copy values? If not, I attached it. Yes, I saw it. But is there a reason not to just expose HashSet given that it is the data structure that is most efficient? I bought into Kevin's arguments that it made sense to just expose HashSet. As for copying values - that is only happening now if you use the Hashtable or String[] constructor. Erik Doug From: Doug Cutting [EMAIL PROTECTED] Date: March 10, 2004 1:08:24 PM EST To: Lucene Developers List [EMAIL PROTECTED] Subject: Re: cvs commit: jakarta-lucene/src/java/org/apache/lucene/analysis StopFilter.java Reply-To: Lucene Developers List [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: - public StopFilter(TokenStream in, Set stopTable) { + public StopFilter(TokenStream in, Set stopWords) { super(in); -table = stopTable; +this.stopWords = new HashSet(stopWords); } This always allocates a new HashSet, which, if the stop list is large, and documents are small, could impact performance. Perhaps we can replace this with something like: public StopFilter(TokenStream in, Set stopWords) { this(in, stopWords instanceof HashSet ? ((HashSet)stopWords) : new HashSet(stopWords)); } and then add another constructor: private StopFilter(TokenStream in, HashSet stopWords) { super(in); this.stopWords = stopTable; } Also, if we want the implementation to always be a HashSet internally, for performance, we ought to declare the field to be a HashSet, no? The competing goals here are: 1. Not to expose publicly the implementation of the Set; 2. Not to copy the contents of the Set when folks pass the value of makeStopSet. 3. Use the most efficient implementation internally. I think the changes above meet all of these. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] smime.p7s Description: S/MIME cryptographic signature
Re: DocumentWriter, StopFilter should use HashMap... (patch)
Erik Hatcher wrote: Yes, I saw it. But is there a reason not to just expose HashSet given that it is the data structure that is most efficient? I bought into Kevin's arguments that it made sense to just expose HashSet. Just the general principal that one shouldn't expose more of the implementation than one must. I can imagine faster things than a HashSet for this, e.g., a well-coded letter tree (trie) could be a bit faster, since it would only touch each character in the key once. But it's not a big deal, perhaps not worth fixing at this point. I proposed a solution that both respected this concern (yours, as I recall) while at the same time avoiding copying. It doesn't need to be an either/or situation. We can easily hide the implementation, avoid copying, and use the most efficient implementation internally. If you no longer care about hiding the implementation, then I guess this is moot. Before we started this exercise the implementation was exposed, so things have gotten no worse, only better. But they could have gotten just a little bit better yet! Cheers, Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: DocumentWriter, StopFilter should use HashMap... (patch)
I will refactor again using Set with no copying this time (except for the String[] and Hashtable) constructors. This was my original preference, but I got caught up in the arguments by Kevin and lost my ideals temporarily :) I expect to do this later tonight or tomorrow. Erik On Mar 11, 2004, at 12:04 PM, Doug Cutting wrote: Erik Hatcher wrote: Yes, I saw it. But is there a reason not to just expose HashSet given that it is the data structure that is most efficient? I bought into Kevin's arguments that it made sense to just expose HashSet. Just the general principal that one shouldn't expose more of the implementation than one must. I can imagine faster things than a HashSet for this, e.g., a well-coded letter tree (trie) could be a bit faster, since it would only touch each character in the key once. But it's not a big deal, perhaps not worth fixing at this point. I proposed a solution that both respected this concern (yours, as I recall) while at the same time avoiding copying. It doesn't need to be an either/or situation. We can easily hide the implementation, avoid copying, and use the most efficient implementation internally. If you no longer care about hiding the implementation, then I guess this is moot. Before we started this exercise the implementation was exposed, so things have gotten no worse, only better. But they could have gotten just a little bit better yet! Cheers, Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: DocumentWriter, StopFilter should use HashMap... (patch)
Scott ganyo wrote: I don't buy it. HashSet is but one implementation of a Set. By choosing the HashSet implementation you are not only tying the class to a hash-based implementation, you are trying the interface to *that specific* hash-based implementation or it's subclasses. In the end, either you buy the concept of the interface and its abstraction or you don't. I firmly believe in using interfaces as they were intended to be used. An interface isn't just the concept of a Java interface but ALSO the implied and required semantics. TreeSet, etc are too slow to be used with the StopFitler thus we should prevent their use. We require HashSet/Map... Scott P.S. In fact, HashSet isn't always going to be the most efficient anyway. Just for one example: Consider possible implementations if I have only 1 or 2 entries. HashSet is not always the most efficient... if you need to do runtime inserts and bulk removal TreeSet/Map might be more efficient. Also if you need to sort the map then you're stuck with a tree. KEvin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster signature.asc Description: OpenPGP digital signature
Re: DocumentWriter, StopFilter should use HashMap... (patch)
Erik Hatcher wrote: I will refactor again using Set with no copying this time (except for the String[] and Hashtable) constructors. This was my original preference, but I got caught up in the arguments by Kevin and lost my ideals temporarily :) I expect to do this later tonight or tomorrow. How about this as a compromise... No copy on constructor... use a Set but in the documentation summarize this conversation and point out that the user should use a HashSet and NOT any other type of set and that it will result in a copy.. I think Doug's comment about a potentially faster impl in the future was a good point... Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster signature.asc Description: OpenPGP digital signature
Re: DocumentWriter, StopFilter should use HashMap... (patch)
Part of the dilemma of which implementation to actually be used will be solved implicit since our function to construct the Set will return a HashSet - and this will surely be the method most folks would use. But I will be sure to note in the Javadoc that the implementation of the Set is important. Erik On Mar 11, 2004, at 5:22 PM, Kevin A. Burton wrote: Erik Hatcher wrote: I will refactor again using Set with no copying this time (except for the String[] and Hashtable) constructors. This was my original preference, but I got caught up in the arguments by Kevin and lost my ideals temporarily :) I expect to do this later tonight or tomorrow. How about this as a compromise... No copy on constructor... use a Set but in the documentation summarize this conversation and point out that the user should use a HashSet and NOT any other type of set and that it will result in a copy.. I think Doug's comment about a potentially faster impl in the future was a good point... Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: DocumentWriter, StopFilter should use HashMap... (patch)
On Mar 9, 2004, at 10:23 PM, Kevin A. Burton wrote: You need do make it a HashSet: table = new HashSet( stopTable.keySet() ); Done. Also... while you're at it... the private variable name is 'table' which this HashSet certainly is *not* ;) Well, depends on your definition of 'table' I suppose :) I changed it to a type-agnostic stopWords. Probably makes sense to just call this variable 'hashset' and then force the type to be HashSet since it's necessary for this to be a HashSet to maintain any decent performance. You'll need to update your second constructor to require a HashSet too.. would be very bad to let callers use another set impl... TreeSet and SortedSet would still be too slow... I refuse to expose HashSet... sorry! :) But I did wrap what is passed in, like above, in a HashSet in my latest commit. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: DocumentWriter, StopFilter should use HashMap... (patch)
Erik Hatcher wrote: Also... while you're at it... the private variable name is 'table' which this HashSet certainly is *not* ;) Well, depends on your definition of 'table' I suppose :) I changed it to a type-agnostic stopWords. Did you know that internally HashSet uses a HashMap? I sure didn't! hashset.contains() maps to hashmap.containsKey() It uses a key - value mapping to a generic PRESENT Object... hm. Probably makes sense to just call this variable 'hashset' and then force the type to be HashSet since it's necessary for this to be a HashSet to maintain any decent performance. You'll need to update your second constructor to require a HashSet too.. would be very bad to let callers use another set impl... TreeSet and SortedSet would still be too slow... I refuse to expose HashSet... sorry! :) But I did wrap what is passed in, like above, in a HashSet in my latest commit. Hm... You're doing this EVEN if the caller passes a HashSet directly?! Why do you have a problem exposing a HashSet/Map... it SHOULD be a Hash based implementation. Doing anything else is just wrong and would seriously slow down Lucene indexing. Also... you're HashSet constructor has to copy values from the original HashSet into the new HashSet ... not very clean and this can just be removed by forcing the caller to use a HashSet (which they should). :) Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster begin:vcard fn:Kevin Burton n:Burton;Kevin email;internet:[EMAIL PROTECTED] x-mozilla-html:TRUE version:2.1 end:vcard signature.asc Description: OpenPGP digital signature
Re: DocumentWriter, StopFilter should use HashMap... (patch)
On Mar 10, 2004, at 2:59 PM, Kevin A. Burton wrote: I refuse to expose HashSet... sorry! :) But I did wrap what is passed in, like above, in a HashSet in my latest commit. Hm... You're doing this EVEN if the caller passes a HashSet directly?! Well it was in the ctor. But I guess I'm not seeing all the times the filter is being constructed to make this a cause a performance hit. Why do you have a problem exposing a HashSet/Map... it SHOULD be a Hash based implementation. Doing anything else is just wrong and would seriously slow down Lucene indexing. Just semantically, it is a set of stop words - so in theory it shouldn't matter the actual implementation. I'm an interface purist at heart. Also... you're HashSet constructor has to copy values from the original HashSet into the new HashSet ... not very clean and this can just be removed by forcing the caller to use a HashSet (which they should). I've caved in and gone HashSet all the way. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: DocumentWriter, StopFilter should use HashMap... (patch)
Erik Hatcher wrote: Also... you're HashSet constructor has to copy values from the original HashSet into the new HashSet ... not very clean and this can just be removed by forcing the caller to use a HashSet (which they should). I've caved in and gone HashSet all the way. Did you not see my message suggesting a way to both not expose HashSet publicly and also not to copy values? If not, I attached it. Doug ---BeginMessage--- [EMAIL PROTECTED] wrote: - public StopFilter(TokenStream in, Set stopTable) { + public StopFilter(TokenStream in, Set stopWords) { super(in); -table = stopTable; +this.stopWords = new HashSet(stopWords); } This always allocates a new HashSet, which, if the stop list is large, and documents are small, could impact performance. Perhaps we can replace this with something like: public StopFilter(TokenStream in, Set stopWords) { this(in, stopWords instanceof HashSet ? ((HashSet)stopWords) : new HashSet(stopWords)); } and then add another constructor: private StopFilter(TokenStream in, HashSet stopWords) { super(in); this.stopWords = stopTable; } Also, if we want the implementation to always be a HashSet internally, for performance, we ought to declare the field to be a HashSet, no? The competing goals here are: 1. Not to expose publicly the implementation of the Set; 2. Not to copy the contents of the Set when folks pass the value of makeStopSet. 3. Use the most efficient implementation internally. I think the changes above meet all of these. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] ---End Message--- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: DocumentWriter, StopFilter should use HashMap... (patch)
Doug Cutting wrote: Erik Hatcher wrote: Also... you're HashSet constructor has to copy values from the original HashSet into the new HashSet ... not very clean and this can just be removed by forcing the caller to use a HashSet (which they should). I've caved in and gone HashSet all the way. Did you not see my message suggesting a way to both not expose HashSet publicly and also not to copy values? If not, I attached it. For the record I didn't see it... but it echos my points... Thanks! Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster begin:vcard fn:Kevin Burton n:Burton;Kevin email;internet:[EMAIL PROTECTED] x-mozilla-html:TRUE version:2.1 end:vcard signature.asc Description: OpenPGP digital signature
Re: DocumentWriter, StopFilter should use HashMap... (patch)
On Mar 10, 2004, at 10:28 PM, Doug Cutting wrote: Erik Hatcher wrote: Also... you're HashSet constructor has to copy values from the original HashSet into the new HashSet ... not very clean and this can just be removed by forcing the caller to use a HashSet (which they should). I've caved in and gone HashSet all the way. Did you not see my message suggesting a way to both not expose HashSet publicly and also not to copy values? If not, I attached it. Yes, I saw it. But is there a reason not to just expose HashSet given that it is the data structure that is most efficient? I bought into Kevin's arguments that it made sense to just expose HashSet. As for copying values - that is only happening now if you use the Hashtable or String[] constructor. Erik Doug From: Doug Cutting [EMAIL PROTECTED] Date: March 10, 2004 1:08:24 PM EST To: Lucene Developers List [EMAIL PROTECTED] Subject: Re: cvs commit: jakarta-lucene/src/java/org/apache/lucene/analysis StopFilter.java Reply-To: Lucene Developers List [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: - public StopFilter(TokenStream in, Set stopTable) { + public StopFilter(TokenStream in, Set stopWords) { super(in); -table = stopTable; +this.stopWords = new HashSet(stopWords); } This always allocates a new HashSet, which, if the stop list is large, and documents are small, could impact performance. Perhaps we can replace this with something like: public StopFilter(TokenStream in, Set stopWords) { this(in, stopWords instanceof HashSet ? ((HashSet)stopWords) : new HashSet(stopWords)); } and then add another constructor: private StopFilter(TokenStream in, HashSet stopWords) { super(in); this.stopWords = stopTable; } Also, if we want the implementation to always be a HashSet internally, for performance, we ought to declare the field to be a HashSet, no? The competing goals here are: 1. Not to expose publicly the implementation of the Set; 2. Not to copy the contents of the Set when folks pass the value of makeStopSet. 3. Use the most efficient implementation internally. I think the changes above meet all of these. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: DocumentWriter, StopFilter should use HashMap... (patch)
I really don't think this will make any noticable difference, but why not. Could you please send a diff -uN patch, please? I made the same changes locally about a year ago, but have since thrown away my local changes (for no good reason that I recall). Thanks, Otis --- Kevin A. Burton [EMAIL PROTECTED] wrote: I'm looking at StopFilter.java right now... I did a kill -3 java and a number of my threads were blocked here: ksa-task-thread-34 prio=1 tid=0xad89fbe8 nid=0x1c6e waiting for monitor entry [b9bff000..b9bff8d0] at java.util.Hashtable.get(Hashtable.java:332) - waiting to lock 0x61569720 (a java.util.Hashtable) at org.apache.lucene.analysis.StopFilter.next(StopFilter.java:94) at org.apache.lucene.index.DocumentWriter.invertDocument(DocumentWriter.java:170) at org.apache.lucene.index.DocumentWriter.addDocument(DocumentWriter.java:111) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:257) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:244) at ksa.index.AdvancedIndexWriter.addDocument(AdvancedIndexWriter.java:136) at ksa.robot.FeedTaskParserListener.onItemEnd(FeedTaskParserListener.java:331) Is there ANY reason to keep this as a Hashtable? It's just preventing inversion across multiple threads. They all have to lock on this hashtable. Note that this guy is initialized ONCE and no more puts take place so I don't see why not. It's readonly after the StopFilter is created. I think this might really end up speeding up indexing a bit. No hard benchmarks yet though. Right now though it's just an inefficiency that should be removed. I've attached a quick implementation. Kevin -- Please reply using PGP: http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster package org.apache.lucene.analysis; /* * The Apache Software License, Version 1.1 * * Copyright (c) 2001 The Apache Software Foundation. All rights * reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * * 1. Redistributions of source code must retain the above copyright *notice, this list of conditions and the following disclaimer. * * 2. Redistributions in binary form must reproduce the above copyright *notice, this list of conditions and the following disclaimer in *the documentation and/or other materials provided with the *distribution. * * 3. The end-user documentation included with the redistribution, *if any, must include the following acknowledgment: * This product includes software developed by the *Apache Software Foundation (http://www.apache.org/). *Alternately, this acknowledgment may appear in the software itself, *if and wherever such third-party acknowledgments normally appear. * * 4. The names Apache and Apache Software Foundation and *Apache Lucene must not be used to endorse or promote products *derived from this software without prior written permission. For *written permission, please contact [EMAIL PROTECTED] * * 5. Products derived from this software may not be called Apache, *Apache Lucene, nor may Apache appear in their name, without *prior written permission of the Apache Software Foundation. * * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE * DISCLAIMED. IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR * ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * * This software consists of voluntary contributions made by many * individuals on behalf of the Apache Software Foundation. For more * information on the Apache Software Foundation, please see * http://www.apache.org/. */ import java.io.IOException; import java.util.*; /** Removes stop
Re: DocumentWriter, StopFilter should use HashMap... (patch)
Otis Gospodnetic wrote: I really don't think this will make any noticable difference, but why not. Could you please send a diff -uN patch, please? I made the same changes locally about a year ago, but have since thrown away my local changes (for no good reason that I recall). Just diff it locally... it's just a search replace for Hashtable - HashMap... Pretty trivial. Kevin -- Please reply using PGP: http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster begin:vcard fn:Kevin Burton n:Burton;Kevin email;internet:[EMAIL PROTECTED] x-mozilla-html:TRUE version:2.1 end:vcard signature.asc Description: OpenPGP digital signature
Re: DocumentWriter, StopFilter should use HashMap... (patch)
Erik Hatcher wrote: I don't see any reason for this to be a Hashtable. It seems an acceptable alternative to not share analyzer/filter instances across threads - they don't really take up much space, so is there a reason to share them? Or I'm guessing you're sharing it implicitly through an IndexWriter, huh? I'll away further feedback before committing this change, but seems reasonable to me. Yeah... I'm using a RAMDirectory and adding documents to it across multiple threads... some of them index at the same time. The patch is super small... the only difference is that it's using a HashMap which isn't synchronized... it can't hurt anything... but feedback is a good thing :) Kevin -- Please reply using PGP: http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster begin:vcard fn:Kevin Burton n:Burton;Kevin email;internet:[EMAIL PROTECTED] x-mozilla-html:TRUE version:2.1 end:vcard signature.asc Description: OpenPGP digital signature
Re: DocumentWriter, StopFilter should use HashMap... (patch)
Well, one issue you didn't consider is changing a public method signature. I will make this change, but leave the Hashtable signature method there. I suppose we could change the signature to use a Map instead, but I believe there are some issues with doing something like this if you do not recompile your own source code against a new Lucene JAR so I will simply provide another signature too. Erik On Mar 9, 2004, at 4:15 AM, Kevin A. Burton wrote: Erik Hatcher wrote: I don't see any reason for this to be a Hashtable. It seems an acceptable alternative to not share analyzer/filter instances across threads - they don't really take up much space, so is there a reason to share them? Or I'm guessing you're sharing it implicitly through an IndexWriter, huh? I'll away further feedback before committing this change, but seems reasonable to me. Yeah... I'm using a RAMDirectory and adding documents to it across multiple threads... some of them index at the same time. The patch is super small... the only difference is that it's using a HashMap which isn't synchronized... it can't hurt anything... but feedback is a good thing :) Kevin -- Please reply using PGP: http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster burton.vcf - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: DocumentWriter, StopFilter should use HashMap... (patch)
Erik Hatcher wrote: Well, one issue you didn't consider is changing a public method signature. I will make this change, but leave the Hashtable signature method there. I suppose we could change the signature to use a Map instead, but I believe there are some issues with doing something like this if you do not recompile your own source code against a new Lucene JAR so I will simply provide another signature too. This is also a problem for folks who're implementing analyzers which use StopFilter. For example: public MyAnalyzer extends Analyzer { private static Hashtable stopTable = StopFilter.makeStopTable(stopWords); public TokenStream tokenStream(String field, Reader reader) { ... new StopFilter(stopTable) ... } This would no longer compile with the change Kevin proposes. To make things back-compatible we must: 1. Keep but deprectate StopFilter(Hashtable) constructor; 2. Keep but deprecate StopFilter.makeStopTable(String[]); 3. Add a new constructor: StopFilter(HashMap); 4. Add a new method: StopFilter.makeStopMap(String[]); Does that make sense? Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: DocumentWriter, StopFilter should use HashMap... (patch)
Maybe I missed something but I always thought the stop list should be a Set, not a Map (or Hashtable/Dictionary). After all, all you need to know is existence and that's what a Set does. Doug Cutting wrote: Erik Hatcher wrote: Well, one issue you didn't consider is changing a public method signature. I will make this change, but leave the Hashtable signature method there. I suppose we could change the signature to use a Map instead, but I believe there are some issues with doing something like this if you do not recompile your own source code against a new Lucene JAR so I will simply provide another signature too. This is also a problem for folks who're implementing analyzers which use StopFilter. For example: public MyAnalyzer extends Analyzer { private static Hashtable stopTable = StopFilter.makeStopTable(stopWords); public TokenStream tokenStream(String field, Reader reader) { ... new StopFilter(stopTable) ... } This would no longer compile with the change Kevin proposes. To make things back-compatible we must: 1. Keep but deprectate StopFilter(Hashtable) constructor; 2. Keep but deprecate StopFilter.makeStopTable(String[]); 3. Add a new constructor: StopFilter(HashMap); 4. Add a new method: StopFilter.makeStopMap(String[]); Does that make sense? Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: DocumentWriter, StopFilter should use HashMap... (patch)
David Spencer wrote: Maybe I missed something but I always thought the stop list should be a Set, not a Map (or Hashtable/Dictionary). After all, all you need to know is existence and that's what a Set does. Good point. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: DocumentWriter, StopFilter should use HashMap... (patch)
Doug Cutting wrote: Erik Hatcher wrote: Well, one issue you didn't consider is changing a public method signature. I will make this change, but leave the Hashtable signature method there. I suppose we could change the signature to use a Map instead, but I believe there are some issues with doing something like this if you do not recompile your own source code against a new Lucene JAR so I will simply provide another signature too. This is also a problem for folks who're implementing analyzers which use StopFilter. For example: public MyAnalyzer extends Analyzer { private static Hashtable stopTable = StopFilter.makeStopTable(stopWords); public TokenStream tokenStream(String field, Reader reader) { ... new StopFilter(stopTable) ... } This would no longer compile with the change Kevin proposes. To make things back-compatible we must: 1. Keep but deprectate StopFilter(Hashtable) constructor; 2. Keep but deprecate StopFilter.makeStopTable(String[]); 3. Add a new constructor: StopFilter(HashMap); 4. Add a new method: StopFilter.makeStopMap(String[]); Does that make sense? Ah... ok... good point. If no one does this I'll take care of it... Kevin -- Please reply using PGP: http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster begin:vcard fn:Kevin Burton n:Burton;Kevin email;internet:[EMAIL PROTECTED] x-mozilla-html:TRUE version:2.1 end:vcard signature.asc Description: OpenPGP digital signature
Re: DocumentWriter, StopFilter should use HashMap... (patch)
David Spencer wrote: Maybe I missed something but I always thought the stop list should be a Set, not a Map (or Hashtable/Dictionary). After all, all you need to know is existence and that's what a Set does. It stores the word as the key and the value... I don't care either way... There was no HashSet back when this was written. I was just going to leave it as a HashMap so that in the future if we ever wanted to change the value we could... Either way. -- Please reply using PGP: http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster begin:vcard fn:Kevin Burton n:Burton;Kevin email;internet:[EMAIL PROTECTED] x-mozilla-html:TRUE version:2.1 end:vcard signature.asc Description: OpenPGP digital signature
Re: DocumentWriter, StopFilter should use HashMap... (patch)
Doug Cutting wrote: Erik Hatcher wrote: Well, one issue you didn't consider is changing a public method signature. I will make this change, but leave the Hashtable signature method there. I suppose we could change the signature to use a Map instead, but I believe there are some issues with doing something like this if you do not recompile your own source code against a new Lucene JAR so I will simply provide another signature too. This would no longer compile with the change Kevin proposes. To make things back-compatible we must: 1. Keep but deprectate StopFilter(Hashtable) constructor; 2. Keep but deprecate StopFilter.makeStopTable(String[]); 3. Add a new constructor: StopFilter(HashMap); 4. Add a new method: StopFilter.makeStopMap(String[]); Does that make sense? This patch and attachment take care of this problem... It does make this class more complex than it needs to be... but 1/2 of the methods are deprecated. Kevin -- Please reply using PGP: http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster package org.apache.lucene.analysis; /* * The Apache Software License, Version 1.1 * * Copyright (c) 2001 The Apache Software Foundation. All rights * reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * * 1. Redistributions of source code must retain the above copyright *notice, this list of conditions and the following disclaimer. * * 2. Redistributions in binary form must reproduce the above copyright *notice, this list of conditions and the following disclaimer in *the documentation and/or other materials provided with the *distribution. * * 3. The end-user documentation included with the redistribution, *if any, must include the following acknowledgment: * This product includes software developed by the *Apache Software Foundation (http://www.apache.org/). *Alternately, this acknowledgment may appear in the software itself, *if and wherever such third-party acknowledgments normally appear. * * 4. The names Apache and Apache Software Foundation and *Apache Lucene must not be used to endorse or promote products *derived from this software without prior written permission. For *written permission, please contact [EMAIL PROTECTED] * * 5. Products derived from this software may not be called Apache, *Apache Lucene, nor may Apache appear in their name, without *prior written permission of the Apache Software Foundation. * * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE * DISCLAIMED. IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR * ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * * This software consists of voluntary contributions made by many * individuals on behalf of the Apache Software Foundation. For more * information on the Apache Software Foundation, please see * http://www.apache.org/. */ import java.io.IOException; import java.util.*; /** Removes stop words from a token stream. */ public final class StopFilter extends TokenFilter { //Note: this could migrate to using a HashSet private HashMap map; /** Constructs a filter which removes words from the input TokenStream that are named in the array of words. */ public StopFilter(TokenStream in, String[] stopWords) { super(in); map = makeStopMap(stopWords); } /** Constructs a filter which removes words from the input TokenStream that are named in the HashMap. */ public StopFilter(TokenStream in, HashMap stopMap) { super(in); map = stopMap; } /** * @deprecated Use HashMap instead. */ public StopFilter(TokenStream in, Hashtable stopTable) { super(in); map = new HashMap(); Enumeration keys = stopTable.keys(); while ( keys.hasMoreElements() ) { Object key = keys.nextElement(); map.put( key, stopTable.get( key ) );
Re: DocumentWriter, StopFilter should use HashMap... (patch)
Doug Cutting wrote: David Spencer wrote: Maybe I missed something but I always thought the stop list should be a Set, not a Map (or Hashtable/Dictionary). After all, all you need to know is existence and that's what a Set does. Good point. It's easy to migrate to a HashSet... either way... I was thinking about the same thing myself... Kevin -- Please reply using PGP: http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster begin:vcard fn:Kevin Burton n:Burton;Kevin email;internet:[EMAIL PROTECTED] x-mozilla-html:TRUE version:2.1 end:vcard signature.asc Description: OpenPGP digital signature
Re: DocumentWriter, StopFilter should use HashMap... (patch)
Kevin - I've made this change and committed it, using a Set. Let me know if there are any issues with what I've committed - I believe I've faithfully preserved backwards compatibility. Erik p.s. ... On Mar 9, 2004, at 2:00 PM, Kevin A. Burton wrote: public StopFilter(TokenStream in, Hashtable stopTable) { super(in); map = new HashMap(); Enumeration keys = stopTable.keys(); while ( keys.hasMoreElements() ) { Object key = keys.nextElement(); map.put( key, stopTable.get( key ) ); } By the way, the ctor to HashMap can take a Map, which Hashtable is also :)) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: DocumentWriter, StopFilter should use HashMap... (patch)
This would no longer compile with the change Kevin proposes. To make things back-compatible we must: 1. Keep but deprectate StopFilter(Hashtable) constructor; 2. Keep but deprecate StopFilter.makeStopTable(String[]); 3. Add a new constructor: StopFilter(HashMap); If you'd use StopFilter(Map), then it'd be back compatible to users using HasTable in their constructor. I'm not sure in olde Java versions but 1.4 java Hasstable implements Map. (And OTOH why HashMap and not Map?) 4. Add a new method: StopFilter.makeStopMap(String[]); Does that make sense? Doug incze - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: DocumentWriter, StopFilter should use HashMap... (patch)
Erik Hatcher wrote: Kevin - I've made this change and committed it, using a Set. Let me know if there are any issues with what I've committed - I believe I've faithfully preserved backwards compatibility. Great... I'll take a look! p.s. ... On Mar 9, 2004, at 2:00 PM, Kevin A. Burton wrote: public StopFilter(TokenStream in, Hashtable stopTable) { super(in); map = new HashMap(); Enumeration keys = stopTable.keys(); while ( keys.hasMoreElements() ) { Object key = keys.nextElement(); map.put( key, stopTable.get( key ) ); } By the way, the ctor to HashMap can take a Map, which Hashtable is also :)) Crap... good point.. Actually that was the FIRST thing I checked but my javadoc index wasn't up to date... long story. Actually I was pissed to find out that it didn't implement a map interface... :) -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster begin:vcard fn:Kevin Burton n:Burton;Kevin email;internet:[EMAIL PROTECTED] x-mozilla-html:TRUE version:2.1 end:vcard signature.asc Description: OpenPGP digital signature
Re: DocumentWriter, StopFilter should use HashMap... (patch)
Erik Hatcher wrote: Kevin - I've made this change and committed it, using a Set. Let me know if there are any issues with what I've committed - I believe I've faithfully preserved backwards compatibility. Actually... Erik.. I don't think your Hashtable constructor will work... By default Hashtable.keySet returns a SynchronizedSet. (on JDK 1.4.2). so were're back to where we started: public StopFilter(TokenStream in, Hashtable stopTable) { super(in); table = stopTable.keySet(); } You need do make it a HashSet: table = new HashSet( stopTable.keySet() ); Also... while you're at it... the private variable name is 'table' which this HashSet certainly is *not* ;) Probably makes sense to just call this variable 'hashset' and then force the type to be HashSet since it's necessary for this to be a HashSet to maintain any decent performance. You'll need to update your second constructor to require a HashSet too.. would be very bad to let callers use another set impl... TreeSet and SortedSet would still be too slow... Anyway... I had this feature in my patch ;) Thanks! Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster begin:vcard fn:Kevin Burton n:Burton;Kevin email;internet:[EMAIL PROTECTED] x-mozilla-html:TRUE version:2.1 end:vcard signature.asc Description: OpenPGP digital signature
DocumentWriter, StopFilter should use HashMap... (patch)
I'm looking at StopFilter.java right now... I did a kill -3 java and a number of my threads were blocked here: ksa-task-thread-34 prio=1 tid=0xad89fbe8 nid=0x1c6e waiting for monitor entry [b9bff000..b9bff8d0] at java.util.Hashtable.get(Hashtable.java:332) - waiting to lock 0x61569720 (a java.util.Hashtable) at org.apache.lucene.analysis.StopFilter.next(StopFilter.java:94) at org.apache.lucene.index.DocumentWriter.invertDocument(DocumentWriter.java:170) at org.apache.lucene.index.DocumentWriter.addDocument(DocumentWriter.java:111) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:257) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:244) at ksa.index.AdvancedIndexWriter.addDocument(AdvancedIndexWriter.java:136) at ksa.robot.FeedTaskParserListener.onItemEnd(FeedTaskParserListener.java:331) Is there ANY reason to keep this as a Hashtable? It's just preventing inversion across multiple threads. They all have to lock on this hashtable. Note that this guy is initialized ONCE and no more puts take place so I don't see why not. It's readonly after the StopFilter is created. I think this might really end up speeding up indexing a bit. No hard benchmarks yet though. Right now though it's just an inefficiency that should be removed. I've attached a quick implementation. Kevin -- Please reply using PGP: http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster package org.apache.lucene.analysis; /* * The Apache Software License, Version 1.1 * * Copyright (c) 2001 The Apache Software Foundation. All rights * reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * * 1. Redistributions of source code must retain the above copyright *notice, this list of conditions and the following disclaimer. * * 2. Redistributions in binary form must reproduce the above copyright *notice, this list of conditions and the following disclaimer in *the documentation and/or other materials provided with the *distribution. * * 3. The end-user documentation included with the redistribution, *if any, must include the following acknowledgment: * This product includes software developed by the *Apache Software Foundation (http://www.apache.org/). *Alternately, this acknowledgment may appear in the software itself, *if and wherever such third-party acknowledgments normally appear. * * 4. The names Apache and Apache Software Foundation and *Apache Lucene must not be used to endorse or promote products *derived from this software without prior written permission. For *written permission, please contact [EMAIL PROTECTED] * * 5. Products derived from this software may not be called Apache, *Apache Lucene, nor may Apache appear in their name, without *prior written permission of the Apache Software Foundation. * * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE * DISCLAIMED. IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR * ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * * This software consists of voluntary contributions made by many * individuals on behalf of the Apache Software Foundation. For more * information on the Apache Software Foundation, please see * http://www.apache.org/. */ import java.io.IOException; import java.util.*; /** Removes stop words from a token stream. */ public final class StopFilter extends TokenFilter { //Note: this could migrate to using a HashSet private HashMap table; /** Constructs a filter which removes words from the input TokenStream that are named in the array of words. */ public StopFilter(TokenStream in, String[] stopWords) { super(in); table = makeStopTable(stopWords); } /** Constructs a filter which removes words from the input
Re: DocumentWriter, StopFilter should use HashMap... (patch)
I don't see any reason for this to be a Hashtable. It seems an acceptable alternative to not share analyzer/filter instances across threads - they don't really take up much space, so is there a reason to share them? Or I'm guessing you're sharing it implicitly through an IndexWriter, huh? I'll away further feedback before committing this change, but seems reasonable to me. Erik On Mar 8, 2004, at 8:50 PM, Kevin A. Burton wrote: I'm looking at StopFilter.java right now... I did a kill -3 java and a number of my threads were blocked here: ksa-task-thread-34 prio=1 tid=0xad89fbe8 nid=0x1c6e waiting for monitor entry [b9bff000..b9bff8d0] at java.util.Hashtable.get(Hashtable.java:332) - waiting to lock 0x61569720 (a java.util.Hashtable) at org.apache.lucene.analysis.StopFilter.next(StopFilter.java:94) at org.apache.lucene.index.DocumentWriter.invertDocument(DocumentWriter.ja va:170) at org.apache.lucene.index.DocumentWriter.addDocument(DocumentWriter.java: 111) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:257) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:244) at ksa.index.AdvancedIndexWriter.addDocument(AdvancedIndexWriter.java: 136) at ksa.robot.FeedTaskParserListener.onItemEnd(FeedTaskParserListener.java: 331) Is there ANY reason to keep this as a Hashtable? It's just preventing inversion across multiple threads. They all have to lock on this hashtable. Note that this guy is initialized ONCE and no more puts take place so I don't see why not. It's readonly after the StopFilter is created. I think this might really end up speeding up indexing a bit. No hard benchmarks yet though. Right now though it's just an inefficiency that should be removed. I've attached a quick implementation. Kevin -- Please reply using PGP: http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster package org.apache.lucene.analysis; /* * The Apache Software License, Version 1.1 * * Copyright (c) 2001 The Apache Software Foundation. All rights * reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * * 1. Redistributions of source code must retain the above copyright *notice, this list of conditions and the following disclaimer. * * 2. Redistributions in binary form must reproduce the above copyright *notice, this list of conditions and the following disclaimer in *the documentation and/or other materials provided with the *distribution. * * 3. The end-user documentation included with the redistribution, *if any, must include the following acknowledgment: * This product includes software developed by the *Apache Software Foundation (http://www.apache.org/). *Alternately, this acknowledgment may appear in the software itself, *if and wherever such third-party acknowledgments normally appear. * * 4. The names Apache and Apache Software Foundation and *Apache Lucene must not be used to endorse or promote products *derived from this software without prior written permission. For *written permission, please contact [EMAIL PROTECTED] * * 5. Products derived from this software may not be called Apache, *Apache Lucene, nor may Apache appear in their name, without *prior written permission of the Apache Software Foundation. * * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE * DISCLAIMED. IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR * ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * * This software consists of voluntary contributions made by many * individuals on behalf of the Apache Software Foundation. For more * information on the Apache Software Foundation, please see * http://www.apache.org/. */ import java.io.IOException; import java.util.*; /** Removes