Re: DocumentWriter, StopFilter should use HashMap... (patch)

2004-03-14 Thread Stephane James Vaucher
Just found the rest of the thread. I'll shut up now ;)

sv

On Sun, 14 Mar 2004, Stephane James Vaucher wrote:

 Back from a weeks' vacation, so this reply is a little late, maybe out of
 order as well ;). Comment inline:

 On Tue, 9 Mar 2004, Kevin A. Burton wrote:

  Doug Cutting wrote:
 
   Erik Hatcher wrote:
  
   Well, one issue you didn't consider is changing a public method
   signature.  I will make this change, but leave the Hashtable
   signature method there.  I suppose we could change the signature to
   use a Map instead, but I believe there are some issues with doing
   something like this if you do not recompile your own source code
   against a new Lucene JAR so I will simply provide another
   signature too.
  
  
   This would no longer compile with the change Kevin proposes.
  
   To make things back-compatible we must:
  
   1. Keep but deprectate StopFilter(Hashtable) constructor;
   2. Keep but deprecate StopFilter.makeStopTable(String[]);
   3. Add a new constructor: StopFilter(HashMap);
   4. Add a new method: StopFilter.makeStopMap(String[]);

 Why impose implementation details in the constructor? Shouldn't the
 constructor use a Map (not a HashMap), a Set, or a String array?

 sv

  
   Does that make sense?
  
  This patch and attachment take care of this problem...
 
  It does make this class more complex than it needs to be... but 1/2 of
  the methods are deprecated.
 
  Kevin
 
 


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: DocumentWriter, StopFilter should use HashMap... (patch)

2004-03-11 Thread Scott ganyo
I don't buy it.  HashSet is but one implementation of a Set.  By 
choosing the HashSet implementation you are not only tying the class to 
a hash-based implementation, you are trying the interface to *that 
specific* hash-based implementation or it's subclasses.  In the end, 
either you buy the concept of the interface and its abstraction or you 
don't.  I firmly believe in using interfaces as they were intended to 
be used.

Scott

P.S. In fact, HashSet isn't always going to be the most efficient 
anyway.  Just for one example:  Consider possible implementations if I 
have only 1 or 2 entries.

On Mar 10, 2004, at 11:13 PM, Erik Hatcher wrote:

On Mar 10, 2004, at 10:28 PM, Doug Cutting wrote:
Erik Hatcher wrote:
Also... you're HashSet constructor has to copy values from the 
original HashSet into the new HashSet ... not very clean and this 
can just be removed by forcing the caller to use a HashSet (which 
they should).
I've caved in and gone HashSet all the way.
Did you not see my message suggesting a way to both not expose 
HashSet publicly and also not to copy values?  If not, I attached it.
Yes, I saw it.  But is there a reason not to just expose HashSet given 
that it is the data structure that is most efficient?  I bought into 
Kevin's arguments that it made sense to just expose HashSet.

As for copying values - that is only happening now if you use the 
Hashtable or String[] constructor.

	Erik


Doug



From: Doug Cutting [EMAIL PROTECTED]
Date: March 10, 2004 1:08:24 PM EST
To: Lucene Developers List [EMAIL PROTECTED]
Subject: Re: cvs commit: 
jakarta-lucene/src/java/org/apache/lucene/analysis StopFilter.java
Reply-To: Lucene Developers List [EMAIL PROTECTED]

[EMAIL PROTECTED] wrote:
  -  public StopFilter(TokenStream in, Set stopTable) {
  +  public StopFilter(TokenStream in, Set stopWords) {
   super(in);
  -table = stopTable;
  +this.stopWords = new HashSet(stopWords);
 }
This always allocates a new HashSet, which, if the stop list is 
large, and documents are small, could impact performance.

Perhaps we can replace this with something like:

public StopFilter(TokenStream in, Set stopWords) {
  this(in, stopWords instanceof HashSet ? ((HashSet)stopWords)
   : new HashSet(stopWords));
}
and then add another constructor:

private StopFilter(TokenStream in, HashSet stopWords) {
  super(in);
  this.stopWords = stopTable;
}
Also, if we want the implementation to always be a HashSet 
internally, for performance, we ought to declare the field to be a 
HashSet, no?

The competing goals here are:
  1. Not to expose publicly the implementation of the Set;
  2. Not to copy the contents of the Set when folks pass the value of 
makeStopSet.
  3. Use the most efficient implementation internally.

I think the changes above meet all of these.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


smime.p7s
Description: S/MIME cryptographic signature


Re: DocumentWriter, StopFilter should use HashMap... (patch)

2004-03-11 Thread Doug Cutting
Erik Hatcher wrote:
Yes, I saw it.  But is there a reason not to just expose HashSet given 
that it is the data structure that is most efficient?  I bought into 
Kevin's arguments that it made sense to just expose HashSet.
Just the general principal that one shouldn't expose more of the 
implementation than one must.  I can imagine faster things than a 
HashSet for this, e.g., a well-coded letter tree (trie) could be a bit 
faster, since it would only touch each character in the key once.  But 
it's not a big deal, perhaps not worth fixing at this point.

I proposed a solution that both respected this concern (yours, as I 
recall) while at the same time avoiding copying.  It doesn't need to be 
an either/or situation.  We can easily hide the implementation, avoid 
copying, and use the most efficient implementation internally.  If you 
no longer care about hiding the implementation, then I guess this is 
moot.  Before we started this exercise the implementation was exposed, 
so things have gotten no worse, only better.  But they could have gotten 
just a little bit better yet!

Cheers,

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: DocumentWriter, StopFilter should use HashMap... (patch)

2004-03-11 Thread Erik Hatcher
I will refactor again using Set with no copying this time (except for 
the String[] and Hashtable) constructors.  This was my original 
preference, but I got caught up in the arguments by Kevin and lost my 
ideals temporarily :)

I expect to do this later tonight or tomorrow.

	Erik

On Mar 11, 2004, at 12:04 PM, Doug Cutting wrote:

Erik Hatcher wrote:
Yes, I saw it.  But is there a reason not to just expose HashSet 
given that it is the data structure that is most efficient?  I bought 
into Kevin's arguments that it made sense to just expose HashSet.
Just the general principal that one shouldn't expose more of the 
implementation than one must.  I can imagine faster things than a 
HashSet for this, e.g., a well-coded letter tree (trie) could be a bit 
faster, since it would only touch each character in the key once.  But 
it's not a big deal, perhaps not worth fixing at this point.

I proposed a solution that both respected this concern (yours, as I 
recall) while at the same time avoiding copying.  It doesn't need to 
be an either/or situation.  We can easily hide the implementation, 
avoid copying, and use the most efficient implementation internally.  
If you no longer care about hiding the implementation, then I guess 
this is moot.  Before we started this exercise the implementation was 
exposed, so things have gotten no worse, only better.  But they could 
have gotten just a little bit better yet!

Cheers,

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: DocumentWriter, StopFilter should use HashMap... (patch)

2004-03-11 Thread Kevin A. Burton
Scott ganyo wrote:

I don't buy it.  HashSet is but one implementation of a Set.  By 
choosing the HashSet implementation you are not only tying the class 
to a hash-based implementation, you are trying the interface to *that 
specific* hash-based implementation or it's subclasses.  In the end, 
either you buy the concept of the interface and its abstraction or you 
don't.  I firmly believe in using interfaces as they were intended to 
be used.
An interface isn't just the concept of a Java interface but ALSO the 
implied and required semantics.

TreeSet, etc are too slow to be used with the StopFitler thus we should 
prevent their use. 

We require HashSet/Map...

Scott

P.S. In fact, HashSet isn't always going to be the most efficient 
anyway.  Just for one example:  Consider possible implementations if I 
have only 1 or 2 entries.

HashSet is not always the most efficient... if you need to do runtime 
inserts and bulk removal TreeSet/Map might be more efficient.  Also if 
you need to sort the map then you're stuck with a tree.

KEvin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature


Re: DocumentWriter, StopFilter should use HashMap... (patch)

2004-03-11 Thread Kevin A. Burton
Erik Hatcher wrote:

I will refactor again using Set with no copying this time (except for 
the String[] and Hashtable) constructors.  This was my original 
preference, but I got caught up in the arguments by Kevin and lost my 
ideals temporarily :)

I expect to do this later tonight or tomorrow.
How about this as a compromise...

No copy on constructor... use a Set but in the documentation summarize 
this conversation and point out that the user should use a HashSet and 
NOT any other type of set and that it will result in a copy..

I think Doug's comment about a potentially faster impl in the future was 
a good point...

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature


Re: DocumentWriter, StopFilter should use HashMap... (patch)

2004-03-11 Thread Erik Hatcher
Part of the dilemma of which implementation to actually be used will be 
solved implicit since our function to construct the Set will return a 
HashSet - and this will surely be the method most folks would use.  But 
I will be sure to note in the Javadoc that the implementation of the 
Set is important.

	Erik

On Mar 11, 2004, at 5:22 PM, Kevin A. Burton wrote:

Erik Hatcher wrote:

I will refactor again using Set with no copying this time (except for 
the String[] and Hashtable) constructors.  This was my original 
preference, but I got caught up in the arguments by Kevin and lost my 
ideals temporarily :)

I expect to do this later tonight or tomorrow.
How about this as a compromise...

No copy on constructor... use a Set but in the documentation summarize 
this conversation and point out that the user should use a HashSet and 
NOT any other type of set and that it will result in a copy..

I think Doug's comment about a potentially faster impl in the future 
was a good point...

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc  NewsMonster - 
http://www.newsmonster.org/
   Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: DocumentWriter, StopFilter should use HashMap... (patch)

2004-03-10 Thread Erik Hatcher
On Mar 9, 2004, at 10:23 PM, Kevin A. Burton wrote:
You need do make it a HashSet:

  table = new HashSet( stopTable.keySet() );
Done.

Also... while you're at it... the private variable name is 'table' 
which this HashSet certainly is *not* ;)
Well, depends on your definition of 'table' I suppose :)  I changed it 
to a type-agnostic stopWords.

Probably makes sense to just call this variable 'hashset' and then 
force the type to be HashSet since it's necessary for this to be a 
HashSet to maintain any decent performance.  You'll need to update 
your second constructor to require a HashSet too.. would be very bad 
to let callers use another set impl... TreeSet and SortedSet would 
still be too slow...
I refuse to expose HashSet... sorry!  :)  But I did wrap what is passed 
in, like above, in a HashSet in my latest commit.

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: DocumentWriter, StopFilter should use HashMap... (patch)

2004-03-10 Thread Kevin A. Burton
Erik Hatcher wrote:


Also... while you're at it... the private variable name is 'table' 
which this HashSet certainly is *not* ;)


Well, depends on your definition of 'table' I suppose :)  I changed it 
to a type-agnostic stopWords.
Did you know that internally HashSet uses a HashMap?

I sure didn't!

hashset.contains() maps to hashmap.containsKey()

It uses a key - value mapping to a generic PRESENT Object... hm. 

Probably makes sense to just call this variable 'hashset' and then 
force the type to be HashSet since it's necessary for this to be a 
HashSet to maintain any decent performance.  You'll need to update 
your second constructor to require a HashSet too.. would be very bad 
to let callers use another set impl... TreeSet and SortedSet would 
still be too slow...
I refuse to expose HashSet... sorry!  :)  But I did wrap what is 
passed in, like above, in a HashSet in my latest commit. 
Hm... You're doing this EVEN if the caller passes a HashSet directly?!

Why do you have a problem exposing a HashSet/Map... it SHOULD be a Hash 
based implementation.  Doing anything else is just wrong and would 
seriously slow down Lucene indexing.

Also... you're HashSet constructor has to copy values from the original 
HashSet into the new HashSet ... not very clean and this can just be 
removed by forcing the caller to use a HashSet (which they should).

:)

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

begin:vcard
fn:Kevin Burton
n:Burton;Kevin
email;internet:[EMAIL PROTECTED]
x-mozilla-html:TRUE
version:2.1
end:vcard



signature.asc
Description: OpenPGP digital signature


Re: DocumentWriter, StopFilter should use HashMap... (patch)

2004-03-10 Thread Erik Hatcher
On Mar 10, 2004, at 2:59 PM, Kevin A. Burton wrote:
I refuse to expose HashSet... sorry!  :)  But I did wrap what is 
passed in, like above, in a HashSet in my latest commit.
Hm... You're doing this EVEN if the caller passes a HashSet directly?!
Well it was in the ctor.  But I guess I'm not seeing all the times the 
filter is being constructed to make this a cause a performance hit.

Why do you have a problem exposing a HashSet/Map... it SHOULD be a 
Hash based implementation.  Doing anything else is just wrong and 
would seriously slow down Lucene indexing.
Just semantically, it is a set of stop words - so in theory it 
shouldn't matter the actual implementation.  I'm an interface purist at 
heart.

Also... you're HashSet constructor has to copy values from the 
original HashSet into the new HashSet ... not very clean and this can 
just be removed by forcing the caller to use a HashSet (which they 
should).
I've caved in and gone HashSet all the way.

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: DocumentWriter, StopFilter should use HashMap... (patch)

2004-03-10 Thread Doug Cutting
Erik Hatcher wrote:
Also... you're HashSet constructor has to copy values from the 
original HashSet into the new HashSet ... not very clean and this can 
just be removed by forcing the caller to use a HashSet (which they 
should).
I've caved in and gone HashSet all the way.
Did you not see my message suggesting a way to both not expose HashSet 
publicly and also not to copy values?  If not, I attached it.

Doug


---BeginMessage---
[EMAIL PROTECTED] wrote:
  -  public StopFilter(TokenStream in, Set stopTable) {
  +  public StopFilter(TokenStream in, Set stopWords) {
   super(in);
  -table = stopTable;
  +this.stopWords = new HashSet(stopWords);
 }
This always allocates a new HashSet, which, if the stop list is large, 
and documents are small, could impact performance.

Perhaps we can replace this with something like:

public StopFilter(TokenStream in, Set stopWords) {
  this(in, stopWords instanceof HashSet ? ((HashSet)stopWords)
   : new HashSet(stopWords));
}
and then add another constructor:

private StopFilter(TokenStream in, HashSet stopWords) {
  super(in);
  this.stopWords = stopTable;
}
Also, if we want the implementation to always be a HashSet internally, 
for performance, we ought to declare the field to be a HashSet, no?

The competing goals here are:
  1. Not to expose publicly the implementation of the Set;
  2. Not to copy the contents of the Set when folks pass the value of 
makeStopSet.
  3. Use the most efficient implementation internally.

I think the changes above meet all of these.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---End Message---
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: DocumentWriter, StopFilter should use HashMap... (patch)

2004-03-10 Thread Kevin A. Burton
Doug Cutting wrote:

Erik Hatcher wrote:

Also... you're HashSet constructor has to copy values from the 
original HashSet into the new HashSet ... not very clean and this 
can just be removed by forcing the caller to use a HashSet (which 
they should).


I've caved in and gone HashSet all the way.


Did you not see my message suggesting a way to both not expose HashSet 
publicly and also not to copy values?  If not, I attached it.

For the record I didn't see it... but it echos my points...

Thanks!

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

begin:vcard
fn:Kevin Burton
n:Burton;Kevin
email;internet:[EMAIL PROTECTED]
x-mozilla-html:TRUE
version:2.1
end:vcard



signature.asc
Description: OpenPGP digital signature


Re: DocumentWriter, StopFilter should use HashMap... (patch)

2004-03-10 Thread Erik Hatcher
On Mar 10, 2004, at 10:28 PM, Doug Cutting wrote:
Erik Hatcher wrote:
Also... you're HashSet constructor has to copy values from the 
original HashSet into the new HashSet ... not very clean and this 
can just be removed by forcing the caller to use a HashSet (which 
they should).
I've caved in and gone HashSet all the way.
Did you not see my message suggesting a way to both not expose HashSet 
publicly and also not to copy values?  If not, I attached it.
Yes, I saw it.  But is there a reason not to just expose HashSet given 
that it is the data structure that is most efficient?  I bought into 
Kevin's arguments that it made sense to just expose HashSet.

As for copying values - that is only happening now if you use the 
Hashtable or String[] constructor.

	Erik


Doug



From: Doug Cutting [EMAIL PROTECTED]
Date: March 10, 2004 1:08:24 PM EST
To: Lucene Developers List [EMAIL PROTECTED]
Subject: Re: cvs commit: 
jakarta-lucene/src/java/org/apache/lucene/analysis StopFilter.java
Reply-To: Lucene Developers List [EMAIL PROTECTED]

[EMAIL PROTECTED] wrote:
  -  public StopFilter(TokenStream in, Set stopTable) {
  +  public StopFilter(TokenStream in, Set stopWords) {
   super(in);
  -table = stopTable;
  +this.stopWords = new HashSet(stopWords);
 }
This always allocates a new HashSet, which, if the stop list is large, 
and documents are small, could impact performance.

Perhaps we can replace this with something like:

public StopFilter(TokenStream in, Set stopWords) {
  this(in, stopWords instanceof HashSet ? ((HashSet)stopWords)
   : new HashSet(stopWords));
}
and then add another constructor:

private StopFilter(TokenStream in, HashSet stopWords) {
  super(in);
  this.stopWords = stopTable;
}
Also, if we want the implementation to always be a HashSet internally, 
for performance, we ought to declare the field to be a HashSet, no?

The competing goals here are:
  1. Not to expose publicly the implementation of the Set;
  2. Not to copy the contents of the Set when folks pass the value of 
makeStopSet.
  3. Use the most efficient implementation internally.

I think the changes above meet all of these.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: DocumentWriter, StopFilter should use HashMap... (patch)

2004-03-09 Thread Otis Gospodnetic
I really don't think this will make any noticable difference, but why
not.  Could you please send a diff -uN patch, please?
I made the same changes locally about a year ago, but have since thrown
away my local changes (for no good reason that I recall).

Thanks,
Otis

--- Kevin A. Burton [EMAIL PROTECTED] wrote:
 I'm looking at StopFilter.java right now...
 
 I did a kill -3 java and a number of my threads were blocked here:
 
 ksa-task-thread-34 prio=1 tid=0xad89fbe8 nid=0x1c6e waiting for 
 monitor entry [b9bff000..b9bff8d0]
 at java.util.Hashtable.get(Hashtable.java:332)
 - waiting to lock 0x61569720 (a java.util.Hashtable)
 at
 org.apache.lucene.analysis.StopFilter.next(StopFilter.java:94)
 at 

org.apache.lucene.index.DocumentWriter.invertDocument(DocumentWriter.java:170)
 at 

org.apache.lucene.index.DocumentWriter.addDocument(DocumentWriter.java:111)
 at 
 org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:257)
 at 
 org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:244)
 at 

ksa.index.AdvancedIndexWriter.addDocument(AdvancedIndexWriter.java:136)
 at 

ksa.robot.FeedTaskParserListener.onItemEnd(FeedTaskParserListener.java:331)
 
 Is there ANY reason to keep this as a Hashtable?  It's just
 preventing 
 inversion across multiple threads.  They all have to lock on this
 hashtable.
 
 Note that this guy is initialized ONCE and no more puts take place so
 I 
 don't see why not.  It's readonly after the StopFilter is created.
 
 I think this might really end up speeding up indexing a bit.  No hard
 
 benchmarks yet though.  Right now though it's just an inefficiency
 that 
 should be removed.
 
 I've attached a quick implementation. 
 
 Kevin
 
 -- 
 
 Please reply using PGP:
 
 http://peerfear.org/pubkey.asc
 
 NewsMonster - http://www.newsmonster.org/
 
 Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
AIM/YIM - sfburtonator,  Web - http://peerfear.org/
 GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
   IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster
 
  package org.apache.lucene.analysis;
 
 /*
 
  * The Apache Software License, Version 1.1
  *
  * Copyright (c) 2001 The Apache Software Foundation.  All rights
  * reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  *
  * 1. Redistributions of source code must retain the above copyright
  *notice, this list of conditions and the following disclaimer.
  *
  * 2. Redistributions in binary form must reproduce the above
 copyright
  *notice, this list of conditions and the following disclaimer in
  *the documentation and/or other materials provided with the
  *distribution.
  *
  * 3. The end-user documentation included with the redistribution,
  *if any, must include the following acknowledgment:
  *   This product includes software developed by the
  *Apache Software Foundation (http://www.apache.org/).
  *Alternately, this acknowledgment may appear in the software
 itself,
  *if and wherever such third-party acknowledgments normally
 appear.
  *
  * 4. The names Apache and Apache Software Foundation and
  *Apache Lucene must not be used to endorse or promote products
  *derived from this software without prior written permission.
 For
  *written permission, please contact [EMAIL PROTECTED]
  *
  * 5. Products derived from this software may not be called Apache,
  *Apache Lucene, nor may Apache appear in their name, without
  *prior written permission of the Apache Software Foundation.
  *
  * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED
  * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
  * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
  * DISCLAIMED.  IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR
  * ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
  * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
  * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
  * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED
 AND
  * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
  * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
  * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
 
  *
  * This software consists of voluntary contributions made by many
  * individuals on behalf of the Apache Software Foundation.  For more
  * information on the Apache Software Foundation, please see
  * http://www.apache.org/.
  */
 
 import java.io.IOException;
 import java.util.*;
 
 /** Removes stop 

Re: DocumentWriter, StopFilter should use HashMap... (patch)

2004-03-09 Thread Kevin A. Burton
Otis Gospodnetic wrote:

I really don't think this will make any noticable difference, but why
not.  Could you please send a diff -uN patch, please?
I made the same changes locally about a year ago, but have since thrown
away my local changes (for no good reason that I recall).
 

Just diff it locally... it's just a search replace for Hashtable - 
HashMap...

Pretty trivial.

Kevin

--

Please reply using PGP:

   http://peerfear.org/pubkey.asc

   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

begin:vcard
fn:Kevin Burton
n:Burton;Kevin
email;internet:[EMAIL PROTECTED]
x-mozilla-html:TRUE
version:2.1
end:vcard



signature.asc
Description: OpenPGP digital signature


Re: DocumentWriter, StopFilter should use HashMap... (patch)

2004-03-09 Thread Kevin A. Burton
Erik Hatcher wrote:

I don't see any reason for this to be a Hashtable.

It seems an acceptable alternative to not share analyzer/filter  
instances across threads - they don't really take up much space, so 
is  there a reason to share them?  Or I'm guessing you're sharing it  
implicitly through an IndexWriter, huh?

I'll away further feedback before committing this change, but seems  
reasonable to me.

Yeah... I'm using a RAMDirectory and adding documents to it across 
multiple threads... some of them index at the same time.

The patch is super small... the only difference is that it's using a 
HashMap which isn't synchronized... it can't hurt anything...

but feedback is a good thing :)

Kevin

--

Please reply using PGP:

   http://peerfear.org/pubkey.asc

   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

begin:vcard
fn:Kevin Burton
n:Burton;Kevin
email;internet:[EMAIL PROTECTED]
x-mozilla-html:TRUE
version:2.1
end:vcard



signature.asc
Description: OpenPGP digital signature


Re: DocumentWriter, StopFilter should use HashMap... (patch)

2004-03-09 Thread Erik Hatcher
Well, one issue you didn't consider is changing a public method 
signature.  I will make this change, but leave the Hashtable signature 
method there.  I suppose we could change the signature to use a Map 
instead, but I believe there are some issues with doing something like 
this if you do not recompile your own source code against a new Lucene 
JAR so I will simply provide another signature too.

	Erik

On Mar 9, 2004, at 4:15 AM, Kevin A. Burton wrote:

Erik Hatcher wrote:

I don't see any reason for this to be a Hashtable.

It seems an acceptable alternative to not share analyzer/filter  
instances across threads - they don't really take up much space, so 
is  there a reason to share them?  Or I'm guessing you're sharing it  
implicitly through an IndexWriter, huh?

I'll away further feedback before committing this change, but seems  
reasonable to me.

Yeah... I'm using a RAMDirectory and adding documents to it across 
multiple threads... some of them index at the same time.

The patch is super small... the only difference is that it's using a 
HashMap which isn't synchronized... it can't hurt anything...

but feedback is a good thing :)

Kevin

--

Please reply using PGP:

   http://peerfear.org/pubkey.asc
   NewsMonster - http://www.newsmonster.org/
   Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster
burton.vcf


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: DocumentWriter, StopFilter should use HashMap... (patch)

2004-03-09 Thread Doug Cutting
Erik Hatcher wrote:
Well, one issue you didn't consider is changing a public method 
signature.  I will make this change, but leave the Hashtable signature 
method there.  I suppose we could change the signature to use a Map 
instead, but I believe there are some issues with doing something like 
this if you do not recompile your own source code against a new Lucene 
JAR so I will simply provide another signature too.
This is also a problem for folks who're implementing analyzers which use 
StopFilter.  For example:

public MyAnalyzer extends Analyzer {

  private static Hashtable stopTable =
StopFilter.makeStopTable(stopWords);
  public TokenStream tokenStream(String field, Reader reader) {
... new StopFilter(stopTable) ...
}

This would no longer compile with the change Kevin proposes.

To make things back-compatible we must:

1. Keep but deprectate StopFilter(Hashtable) constructor;
2. Keep but deprecate StopFilter.makeStopTable(String[]);
3. Add a new constructor: StopFilter(HashMap);
4. Add a new method: StopFilter.makeStopMap(String[]);
Does that make sense?

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: DocumentWriter, StopFilter should use HashMap... (patch)

2004-03-09 Thread David Spencer
Maybe I missed something but I always thought the stop list should be a 
Set, not a Map (or Hashtable/Dictionary). After all, all you need to 
know is existence and that's what a Set does.

Doug Cutting wrote:

Erik Hatcher wrote:

Well, one issue you didn't consider is changing a public method 
signature.  I will make this change, but leave the Hashtable signature 
method there.  I suppose we could change the signature to use a Map 
instead, but I believe there are some issues with doing something like 
this if you do not recompile your own source code against a new Lucene 
JAR so I will simply provide another signature too.


This is also a problem for folks who're implementing analyzers which use 
StopFilter.  For example:

public MyAnalyzer extends Analyzer {

  private static Hashtable stopTable =
StopFilter.makeStopTable(stopWords);
  public TokenStream tokenStream(String field, Reader reader) {
... new StopFilter(stopTable) ...
}

This would no longer compile with the change Kevin proposes.

To make things back-compatible we must:

1. Keep but deprectate StopFilter(Hashtable) constructor;
2. Keep but deprecate StopFilter.makeStopTable(String[]);
3. Add a new constructor: StopFilter(HashMap);
4. Add a new method: StopFilter.makeStopMap(String[]);
Does that make sense?

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: DocumentWriter, StopFilter should use HashMap... (patch)

2004-03-09 Thread Doug Cutting
David Spencer wrote:
Maybe I missed something but I always thought the stop list should be a 
Set, not a Map (or Hashtable/Dictionary). After all, all you need to 
know is existence and that's what a Set does.
Good point.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: DocumentWriter, StopFilter should use HashMap... (patch)

2004-03-09 Thread Kevin A. Burton
Doug Cutting wrote:

Erik Hatcher wrote:

Well, one issue you didn't consider is changing a public method 
signature.  I will make this change, but leave the Hashtable 
signature method there.  I suppose we could change the signature to 
use a Map instead, but I believe there are some issues with doing 
something like this if you do not recompile your own source code 
against a new Lucene JAR so I will simply provide another 
signature too.


This is also a problem for folks who're implementing analyzers which 
use StopFilter.  For example:

public MyAnalyzer extends Analyzer {

  private static Hashtable stopTable =
StopFilter.makeStopTable(stopWords);
  public TokenStream tokenStream(String field, Reader reader) {
... new StopFilter(stopTable) ...
}

This would no longer compile with the change Kevin proposes.

To make things back-compatible we must:

1. Keep but deprectate StopFilter(Hashtable) constructor;
2. Keep but deprecate StopFilter.makeStopTable(String[]);
3. Add a new constructor: StopFilter(HashMap);
4. Add a new method: StopFilter.makeStopMap(String[]);
Does that make sense?
Ah... ok... good point.  If no one does this I'll take care of it...

Kevin

--

Please reply using PGP:

   http://peerfear.org/pubkey.asc

   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

begin:vcard
fn:Kevin Burton
n:Burton;Kevin
email;internet:[EMAIL PROTECTED]
x-mozilla-html:TRUE
version:2.1
end:vcard



signature.asc
Description: OpenPGP digital signature


Re: DocumentWriter, StopFilter should use HashMap... (patch)

2004-03-09 Thread Kevin A. Burton
David Spencer wrote:

Maybe I missed something but I always thought the stop list should be 
a Set, not a Map (or Hashtable/Dictionary). After all, all you need to 
know is existence and that's what a Set does.
It stores the word as the key and the value...

I don't care either way... There was no HashSet back when this was 
written. I was just going to leave it as a HashMap so that in the future 
if we ever wanted to change the value we could...

Either way.

--

Please reply using PGP:

   http://peerfear.org/pubkey.asc

   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

begin:vcard
fn:Kevin Burton
n:Burton;Kevin
email;internet:[EMAIL PROTECTED]
x-mozilla-html:TRUE
version:2.1
end:vcard



signature.asc
Description: OpenPGP digital signature


Re: DocumentWriter, StopFilter should use HashMap... (patch)

2004-03-09 Thread Kevin A. Burton
Doug Cutting wrote:

Erik Hatcher wrote:

Well, one issue you didn't consider is changing a public method 
signature.  I will make this change, but leave the Hashtable 
signature method there.  I suppose we could change the signature to 
use a Map instead, but I believe there are some issues with doing 
something like this if you do not recompile your own source code 
against a new Lucene JAR so I will simply provide another 
signature too.


This would no longer compile with the change Kevin proposes.

To make things back-compatible we must:

1. Keep but deprectate StopFilter(Hashtable) constructor;
2. Keep but deprecate StopFilter.makeStopTable(String[]);
3. Add a new constructor: StopFilter(HashMap);
4. Add a new method: StopFilter.makeStopMap(String[]);
Does that make sense?

This patch and attachment take care of this problem... 

It does make this class more complex than it needs to be... but 1/2 of 
the methods are deprecated.

Kevin

--

Please reply using PGP:

   http://peerfear.org/pubkey.asc

   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

package org.apache.lucene.analysis;

/* 
 * The Apache Software License, Version 1.1
 *
 * Copyright (c) 2001 The Apache Software Foundation.  All rights
 * reserved.
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions
 * are met:
 *
 * 1. Redistributions of source code must retain the above copyright
 *notice, this list of conditions and the following disclaimer.
 *
 * 2. Redistributions in binary form must reproduce the above copyright
 *notice, this list of conditions and the following disclaimer in
 *the documentation and/or other materials provided with the
 *distribution.
 *
 * 3. The end-user documentation included with the redistribution,
 *if any, must include the following acknowledgment:
 *   This product includes software developed by the
 *Apache Software Foundation (http://www.apache.org/).
 *Alternately, this acknowledgment may appear in the software itself,
 *if and wherever such third-party acknowledgments normally appear.
 *
 * 4. The names Apache and Apache Software Foundation and
 *Apache Lucene must not be used to endorse or promote products
 *derived from this software without prior written permission. For
 *written permission, please contact [EMAIL PROTECTED]
 *
 * 5. Products derived from this software may not be called Apache,
 *Apache Lucene, nor may Apache appear in their name, without
 *prior written permission of the Apache Software Foundation.
 *
 * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED
 * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
 * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
 * DISCLAIMED.  IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR
 * ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
 * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
 * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
 * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
 * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
 * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
 * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
 * SUCH DAMAGE.
 * 
 *
 * This software consists of voluntary contributions made by many
 * individuals on behalf of the Apache Software Foundation.  For more
 * information on the Apache Software Foundation, please see
 * http://www.apache.org/.
 */

import java.io.IOException;
import java.util.*;

/** Removes stop words from a token stream. */

public final class StopFilter extends TokenFilter {

  //Note: this could migrate to using a HashSet
  private HashMap map;

  /** Constructs a filter which removes words from the input
TokenStream that are named in the array of words. */
  public StopFilter(TokenStream in, String[] stopWords) {
super(in);
map = makeStopMap(stopWords);
  }

  /** Constructs a filter which removes words from the input
TokenStream that are named in the HashMap. */
  public StopFilter(TokenStream in, HashMap stopMap) {
super(in);
map = stopMap;
  }

  /**
   * @deprecated Use HashMap instead.
   */
  public StopFilter(TokenStream in, Hashtable stopTable) {
super(in);
map = new HashMap();

Enumeration keys = stopTable.keys();
while ( keys.hasMoreElements() ) {
Object key = keys.nextElement();
map.put( key, stopTable.get( key ) );
 

Re: DocumentWriter, StopFilter should use HashMap... (patch)

2004-03-09 Thread Kevin A. Burton
Doug Cutting wrote:

David Spencer wrote:

Maybe I missed something but I always thought the stop list should be 
a Set, not a Map (or Hashtable/Dictionary). After all, all you need 
to know is existence and that's what a Set does.


Good point.
It's easy to migrate to a HashSet... either way...   I was thinking 
about the same thing myself...

Kevin

--

Please reply using PGP:

   http://peerfear.org/pubkey.asc

   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

begin:vcard
fn:Kevin Burton
n:Burton;Kevin
email;internet:[EMAIL PROTECTED]
x-mozilla-html:TRUE
version:2.1
end:vcard



signature.asc
Description: OpenPGP digital signature


Re: DocumentWriter, StopFilter should use HashMap... (patch)

2004-03-09 Thread Erik Hatcher
Kevin - I've made this change and committed it, using a Set.

Let me know if there are any issues with what I've committed - I 
believe I've faithfully preserved backwards compatibility.

	Erik

p.s. ...

On Mar 9, 2004, at 2:00 PM, Kevin A. Burton wrote:
  public StopFilter(TokenStream in, Hashtable stopTable) {
super(in);
map = new HashMap();
Enumeration keys = stopTable.keys();
while ( keys.hasMoreElements() ) {
Object key = keys.nextElement();
map.put( key, stopTable.get( key ) );
}
By the way, the ctor to HashMap can take a Map, which Hashtable is also 
:))

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: DocumentWriter, StopFilter should use HashMap... (patch)

2004-03-09 Thread Incze Lajos

 This would no longer compile with the change Kevin proposes.
 
 To make things back-compatible we must:
 
 1. Keep but deprectate StopFilter(Hashtable) constructor;
 2. Keep but deprecate StopFilter.makeStopTable(String[]);
 3. Add a new constructor: StopFilter(HashMap);

If you'd use StopFilter(Map), then it'd be back compatible
to users using HasTable in their constructor. I'm not sure
in olde Java versions but 1.4 java Hasstable implements
Map. (And OTOH why HashMap and not Map?)

 4. Add a new method: StopFilter.makeStopMap(String[]);
 
 Does that make sense?
 
 Doug


incze

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: DocumentWriter, StopFilter should use HashMap... (patch)

2004-03-09 Thread Kevin A. Burton
Erik Hatcher wrote:

Kevin - I've made this change and committed it, using a Set.

Let me know if there are any issues with what I've committed - I 
believe I've faithfully preserved backwards compatibility.

Great... I'll take a look!

p.s. ...

On Mar 9, 2004, at 2:00 PM, Kevin A. Burton wrote:

  public StopFilter(TokenStream in, Hashtable stopTable) {
super(in);
map = new HashMap();
Enumeration keys = stopTable.keys();
while ( keys.hasMoreElements() ) {
Object key = keys.nextElement();
map.put( key, stopTable.get( key ) );
}


By the way, the ctor to HashMap can take a Map, which Hashtable is 
also :))

Crap... good point.. Actually that was the FIRST thing I checked but my 
javadoc index wasn't up to date... long story.  Actually I was pissed to 
find out that it didn't implement a map interface...  :)

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

begin:vcard
fn:Kevin Burton
n:Burton;Kevin
email;internet:[EMAIL PROTECTED]
x-mozilla-html:TRUE
version:2.1
end:vcard



signature.asc
Description: OpenPGP digital signature


Re: DocumentWriter, StopFilter should use HashMap... (patch)

2004-03-09 Thread Kevin A. Burton
Erik Hatcher wrote:

Kevin - I've made this change and committed it, using a Set.

Let me know if there are any issues with what I've committed - I 
believe I've faithfully preserved backwards compatibility.
Actually... Erik.. I don't think your Hashtable constructor will work...

By default Hashtable.keySet returns a SynchronizedSet. (on JDK 1.4.2). 
so were're back to where we started:

 public StopFilter(TokenStream in, Hashtable stopTable) {
   super(in);
   table = stopTable.keySet();
 }
 

You need do make it a HashSet:

  table = new HashSet( stopTable.keySet() );

Also... while you're at it... the private variable name is 'table' which 
this HashSet certainly is *not* ;)

Probably makes sense to just call this variable 'hashset' and then force 
the type to be HashSet since it's necessary for this to be a HashSet to 
maintain any decent performance.  You'll need to update your second 
constructor to require a HashSet too.. would be very bad to let callers 
use another set impl... TreeSet and SortedSet would still be too slow...

Anyway... I had this feature in my patch ;)

Thanks!

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

begin:vcard
fn:Kevin Burton
n:Burton;Kevin
email;internet:[EMAIL PROTECTED]
x-mozilla-html:TRUE
version:2.1
end:vcard



signature.asc
Description: OpenPGP digital signature


DocumentWriter, StopFilter should use HashMap... (patch)

2004-03-08 Thread Kevin A. Burton
I'm looking at StopFilter.java right now...

I did a kill -3 java and a number of my threads were blocked here:

ksa-task-thread-34 prio=1 tid=0xad89fbe8 nid=0x1c6e waiting for 
monitor entry [b9bff000..b9bff8d0]
   at java.util.Hashtable.get(Hashtable.java:332)
   - waiting to lock 0x61569720 (a java.util.Hashtable)
   at org.apache.lucene.analysis.StopFilter.next(StopFilter.java:94)
   at 
org.apache.lucene.index.DocumentWriter.invertDocument(DocumentWriter.java:170)
   at 
org.apache.lucene.index.DocumentWriter.addDocument(DocumentWriter.java:111)
   at 
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:257)
   at 
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:244)
   at 
ksa.index.AdvancedIndexWriter.addDocument(AdvancedIndexWriter.java:136)
   at 
ksa.robot.FeedTaskParserListener.onItemEnd(FeedTaskParserListener.java:331)

Is there ANY reason to keep this as a Hashtable?  It's just preventing 
inversion across multiple threads.  They all have to lock on this hashtable.

Note that this guy is initialized ONCE and no more puts take place so I 
don't see why not.  It's readonly after the StopFilter is created.

I think this might really end up speeding up indexing a bit.  No hard 
benchmarks yet though.  Right now though it's just an inefficiency that 
should be removed.

I've attached a quick implementation. 

Kevin

--

Please reply using PGP:

   http://peerfear.org/pubkey.asc

   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

package org.apache.lucene.analysis;

/* 
 * The Apache Software License, Version 1.1
 *
 * Copyright (c) 2001 The Apache Software Foundation.  All rights
 * reserved.
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions
 * are met:
 *
 * 1. Redistributions of source code must retain the above copyright
 *notice, this list of conditions and the following disclaimer.
 *
 * 2. Redistributions in binary form must reproduce the above copyright
 *notice, this list of conditions and the following disclaimer in
 *the documentation and/or other materials provided with the
 *distribution.
 *
 * 3. The end-user documentation included with the redistribution,
 *if any, must include the following acknowledgment:
 *   This product includes software developed by the
 *Apache Software Foundation (http://www.apache.org/).
 *Alternately, this acknowledgment may appear in the software itself,
 *if and wherever such third-party acknowledgments normally appear.
 *
 * 4. The names Apache and Apache Software Foundation and
 *Apache Lucene must not be used to endorse or promote products
 *derived from this software without prior written permission. For
 *written permission, please contact [EMAIL PROTECTED]
 *
 * 5. Products derived from this software may not be called Apache,
 *Apache Lucene, nor may Apache appear in their name, without
 *prior written permission of the Apache Software Foundation.
 *
 * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED
 * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
 * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
 * DISCLAIMED.  IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR
 * ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
 * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
 * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
 * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
 * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
 * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
 * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
 * SUCH DAMAGE.
 * 
 *
 * This software consists of voluntary contributions made by many
 * individuals on behalf of the Apache Software Foundation.  For more
 * information on the Apache Software Foundation, please see
 * http://www.apache.org/.
 */

import java.io.IOException;
import java.util.*;

/** Removes stop words from a token stream. */

public final class StopFilter extends TokenFilter {

  //Note: this could migrate to using a HashSet
  private HashMap table;

  /** Constructs a filter which removes words from the input
TokenStream that are named in the array of words. */
  public StopFilter(TokenStream in, String[] stopWords) {
super(in);
table = makeStopTable(stopWords);
  }

  /** Constructs a filter which removes words from the input

Re: DocumentWriter, StopFilter should use HashMap... (patch)

2004-03-08 Thread Erik Hatcher
I don't see any reason for this to be a Hashtable.

It seems an acceptable alternative to not share analyzer/filter  
instances across threads - they don't really take up much space, so is  
there a reason to share them?  Or I'm guessing you're sharing it  
implicitly through an IndexWriter, huh?

I'll away further feedback before committing this change, but seems  
reasonable to me.

	Erik

On Mar 8, 2004, at 8:50 PM, Kevin A. Burton wrote:
I'm looking at StopFilter.java right now...

I did a kill -3 java and a number of my threads were blocked here:

ksa-task-thread-34 prio=1 tid=0xad89fbe8 nid=0x1c6e waiting for  
monitor entry [b9bff000..b9bff8d0]
   at java.util.Hashtable.get(Hashtable.java:332)
   - waiting to lock 0x61569720 (a java.util.Hashtable)
   at  
org.apache.lucene.analysis.StopFilter.next(StopFilter.java:94)
   at  
org.apache.lucene.index.DocumentWriter.invertDocument(DocumentWriter.ja 
va:170)
   at  
org.apache.lucene.index.DocumentWriter.addDocument(DocumentWriter.java: 
111)
   at  
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:257)
   at  
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:244)
   at  
ksa.index.AdvancedIndexWriter.addDocument(AdvancedIndexWriter.java: 
136)
   at  
ksa.robot.FeedTaskParserListener.onItemEnd(FeedTaskParserListener.java: 
331)

Is there ANY reason to keep this as a Hashtable?  It's just preventing  
inversion across multiple threads.  They all have to lock on this  
hashtable.

Note that this guy is initialized ONCE and no more puts take place so  
I don't see why not.  It's readonly after the StopFilter is created.

I think this might really end up speeding up indexing a bit.  No hard  
benchmarks yet though.  Right now though it's just an inefficiency  
that should be removed.

I've attached a quick implementation.
Kevin
--

Please reply using PGP:

   http://peerfear.org/pubkey.asc
   NewsMonster - http://www.newsmonster.org/
   Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster
package org.apache.lucene.analysis;

/* 
 * The Apache Software License, Version 1.1
 *
 * Copyright (c) 2001 The Apache Software Foundation.  All rights
 * reserved.
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions
 * are met:
 *
 * 1. Redistributions of source code must retain the above copyright
 *notice, this list of conditions and the following disclaimer.
 *
 * 2. Redistributions in binary form must reproduce the above copyright
 *notice, this list of conditions and the following disclaimer in
 *the documentation and/or other materials provided with the
 *distribution.
 *
 * 3. The end-user documentation included with the redistribution,
 *if any, must include the following acknowledgment:
 *   This product includes software developed by the
 *Apache Software Foundation (http://www.apache.org/).
 *Alternately, this acknowledgment may appear in the software  
itself,
 *if and wherever such third-party acknowledgments normally appear.
 *
 * 4. The names Apache and Apache Software Foundation and
 *Apache Lucene must not be used to endorse or promote products
 *derived from this software without prior written permission. For
 *written permission, please contact [EMAIL PROTECTED]
 *
 * 5. Products derived from this software may not be called Apache,
 *Apache Lucene, nor may Apache appear in their name, without
 *prior written permission of the Apache Software Foundation.
 *
 * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED
 * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
 * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
 * DISCLAIMED.  IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR
 * ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
 * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
 * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
 * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
 * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
 * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
 * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
 * SUCH DAMAGE.
 * 
 *
 * This software consists of voluntary contributions made by many
 * individuals on behalf of the Apache Software Foundation.  For more
 * information on the Apache Software Foundation, please see
 * http://www.apache.org/.
 */

import java.io.IOException;
import java.util.*;
/** Removes