Re: DocumentWriter, StopFilter should use HashMap... (patch)

2004-03-10 Thread Erik Hatcher
On Mar 9, 2004, at 10:23 PM, Kevin A. Burton wrote:
You need do make it a HashSet:

  table = new HashSet( stopTable.keySet() );
Done.

Also... while you're at it... the private variable name is 'table' 
which this HashSet certainly is *not* ;)
Well, depends on your definition of 'table' I suppose :)  I changed it 
to a type-agnostic stopWords.

Probably makes sense to just call this variable 'hashset' and then 
force the type to be HashSet since it's necessary for this to be a 
HashSet to maintain any decent performance.  You'll need to update 
your second constructor to require a HashSet too.. would be very bad 
to let callers use another set impl... TreeSet and SortedSet would 
still be too slow...
I refuse to expose HashSet... sorry!  :)  But I did wrap what is passed 
in, like above, in a HashSet in my latest commit.

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Storing numbers

2004-03-10 Thread lucene
On Tuesday 09 March 2004 20:51, Timothy Stone wrote:
 Michael Giles wrote:
  Tim,
 
  Looks like you can only access it with a subscription.  :(  Sounds good,
  though.
 
 Really? I don't have a subscription. Got to it via the archives actually
 now that I think about it:

 Try Volume 7, Issue 12.

I also need an subscription for: 
http://www.sys-con.com/story/search.cfm?pub=1ss=lucene

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Large document collections?

2004-03-10 Thread Mark Devaney
I'm looking for information on the largest document collection that Lucene
has been used to index, the biggest benchmark I've been able to find so far
is 1MM documents.

I'd like to generate some benchmarks for large collections (1-100MM) records
and would like to know if this is feasible without using distributed
indexes, etc.  It's mostly to construct a performance profile relating
indexing/retrieval time and storage requirements to the number of documents.

Thanks.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Large document collections?

2004-03-10 Thread Otis Gospodnetic
I think even a 100K or 1MM doc collection will give you an idea about
the retrieval time/storage requirements (which, of course, are highly
dependent on what you index and how you index it).  I know several
people have created collections with up to 50MM docs on a single
machine (not sure about number of CPUs, etc.)

Otis


--- Mark Devaney [EMAIL PROTECTED] wrote:
 I'm looking for information on the largest document collection that
 Lucene
 has been used to index, the biggest benchmark I've been able to find
 so far
 is 1MM documents.
 
 I'd like to generate some benchmarks for large collections (1-100MM)
 records
 and would like to know if this is feasible without using distributed
 indexes, etc.  It's mostly to construct a performance profile
 relating
 indexing/retrieval time and storage requirements to the number of
 documents.
 
 Thanks.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Storing numbers

2004-03-10 Thread Olga Dadasheva

Try this link and scroll to top:
http://www.sys-con.com/story/?storyid=37296DE=1#RES

Thank you, Tim - excelent article.



-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Sent: Wednesday, March 10, 2004 10:23 AM
To: Lucene Users List
Subject: Re: Storing numbers


On Tuesday 09 March 2004 20:51, Timothy Stone wrote:
 Michael Giles wrote:
  Tim,
 
  Looks like you can only access it with a subscription.  :(  Sounds good,
  though.
 
 Really? I don't have a subscription. Got to it via the archives actually
 now that I think about it:

 Try Volume 7, Issue 12.

I also need an subscription for:
http://www.sys-con.com/story/search.cfm?pub=1ss=lucene

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Large document collections?

2004-03-10 Thread Paladin
I use several collections, one of 1 200 000 documents, one of 3 800 000 and
another one of 12 000 000 documents (for the biggests) and the performances
are quite good (except for search with wildcards).
Our machine have 1 giga bites of memory and 2 CPU.


- Original Message - 
From: Mark Devaney [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Wednesday, March 10, 2004 4:26 PM
Subject: Large document collections?


 I'm looking for information on the largest document collection that Lucene
 has been used to index, the biggest benchmark I've been able to find so
far
 is 1MM documents.

 I'd like to generate some benchmarks for large collections (1-100MM)
records
 and would like to know if this is feasible without using distributed
 indexes, etc.  It's mostly to construct a performance profile relating
 indexing/retrieval time and storage requirements to the number of
documents.

 Thanks.


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Large document collections?

2004-03-10 Thread Paladin
Well usually the time of response are 5-10 sec max, it depends of the
queries (except for queries with a wildcard).
i put a time out of 30 seconds for all the queries.
queries with wildcard can fail because of java.lang.out.of.memories error
you can try yourself on the website of my compagny (but the website seems
down for the moment)
if you want the adress send me a mail out of the list please, i'll explain
you in details how to make your own test and how our software works

- Original Message - 
From: Albert Vila Puig [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Wednesday, March 10, 2004 5:36 PM
Subject: Re: Large document collections?


 Can you please provide some queries and their performance?

 Thanks

 Paladin wrote:

 I use several collections, one of 1 200 000 documents, one of 3 800 000
and
 another one of 12 000 000 documents (for the biggests) and the
performances
 are quite good (except for search with wildcards).
 Our machine have 1 giga bites of memory and 2 CPU.
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: DocumentWriter, StopFilter should use HashMap... (patch)

2004-03-10 Thread Kevin A. Burton
Erik Hatcher wrote:


Also... while you're at it... the private variable name is 'table' 
which this HashSet certainly is *not* ;)


Well, depends on your definition of 'table' I suppose :)  I changed it 
to a type-agnostic stopWords.
Did you know that internally HashSet uses a HashMap?

I sure didn't!

hashset.contains() maps to hashmap.containsKey()

It uses a key - value mapping to a generic PRESENT Object... hm. 

Probably makes sense to just call this variable 'hashset' and then 
force the type to be HashSet since it's necessary for this to be a 
HashSet to maintain any decent performance.  You'll need to update 
your second constructor to require a HashSet too.. would be very bad 
to let callers use another set impl... TreeSet and SortedSet would 
still be too slow...
I refuse to expose HashSet... sorry!  :)  But I did wrap what is 
passed in, like above, in a HashSet in my latest commit. 
Hm... You're doing this EVEN if the caller passes a HashSet directly?!

Why do you have a problem exposing a HashSet/Map... it SHOULD be a Hash 
based implementation.  Doing anything else is just wrong and would 
seriously slow down Lucene indexing.

Also... you're HashSet constructor has to copy values from the original 
HashSet into the new HashSet ... not very clean and this can just be 
removed by forcing the caller to use a HashSet (which they should).

:)

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

begin:vcard
fn:Kevin Burton
n:Burton;Kevin
email;internet:[EMAIL PROTECTED]
x-mozilla-html:TRUE
version:2.1
end:vcard



signature.asc
Description: OpenPGP digital signature


Re: DocumentWriter, StopFilter should use HashMap... (patch)

2004-03-10 Thread Erik Hatcher
On Mar 10, 2004, at 2:59 PM, Kevin A. Burton wrote:
I refuse to expose HashSet... sorry!  :)  But I did wrap what is 
passed in, like above, in a HashSet in my latest commit.
Hm... You're doing this EVEN if the caller passes a HashSet directly?!
Well it was in the ctor.  But I guess I'm not seeing all the times the 
filter is being constructed to make this a cause a performance hit.

Why do you have a problem exposing a HashSet/Map... it SHOULD be a 
Hash based implementation.  Doing anything else is just wrong and 
would seriously slow down Lucene indexing.
Just semantically, it is a set of stop words - so in theory it 
shouldn't matter the actual implementation.  I'm an interface purist at 
heart.

Also... you're HashSet constructor has to copy values from the 
original HashSet into the new HashSet ... not very clean and this can 
just be removed by forcing the caller to use a HashSet (which they 
should).
I've caved in and gone HashSet all the way.

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


1.3-final builds as 1.4-rc1-dev?

2004-03-10 Thread Jeff Wong
Hello,

I noticed that Lucene 1.3-final source builds a JAR file whose version
number is 1.4-rc1-dev.  What does this mean?  Will 1.4-final build as
1.5-rc1-dev?

Just Curious,

Jeff


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: 1.3-final builds as 1.4-rc1-dev?

2004-03-10 Thread Erik Hatcher
It means we screwed up the timing somehow and changed the build file 
version after we built the binary version, is my guess.

We'll be more careful with the 1.4 release and make sure this doesn't 
happen then.

	Erik

On Mar 10, 2004, at 8:34 PM, Jeff Wong wrote:

Hello,

I noticed that Lucene 1.3-final source builds a JAR file whose version
number is 1.4-rc1-dev.  What does this mean?  Will 1.4-final build as
1.5-rc1-dev?
Just Curious,

Jeff

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: 1.3-final builds as 1.4-rc1-dev?

2004-03-10 Thread Doug Cutting
Jeff Wong wrote:
I noticed that Lucene 1.3-final source builds a JAR file whose version
number is 1.4-rc1-dev.  What does this mean?  Will 1.4-final build as
1.5-rc1-dev?
Probably.  If you modify the sources of a 1.3-final release, and build 
them, you're not building 1.3-final, but a derivative.  We could call it 
1.3-dev or something, but that would be strange, as 1.3 development is 
closed.  All development is now towards 1.4-based releases.  As a 
side-effect, even if you make no changes to the 1.3-final sources and 
build them, it builds as 1.4-rc1-dev.  I think that is still safer than 
calling it 1.3-final, since 1.3-final should be reserved for the exact 
jar file downloaded from Apache.  In general, anything ending with -dev 
doesn't have any guarantees, and the version before that is only meant 
to be suggestive.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: 1.3-final builds as 1.4-rc1-dev?

2004-03-10 Thread Erik Hatcher
On Mar 10, 2004, at 9:45 PM, Doug Cutting wrote:
Jeff Wong wrote:
I noticed that Lucene 1.3-final source builds a JAR file whose version
number is 1.4-rc1-dev.  What does this mean?  Will 1.4-final build 
as
1.5-rc1-dev?
Probably.  If you modify the sources of a 1.3-final release, and build 
them, you're not building 1.3-final, but a derivative.  We could call 
it 1.3-dev or something, but that would be strange, as 1.3 development 
is closed.  All development is now towards 1.4-based releases.  As a 
side-effect, even if you make no changes to the 1.3-final sources and 
build them, it builds as 1.4-rc1-dev.  I think that is still safer 
than calling it 1.3-final, since 1.3-final should be reserved for the 
exact jar file downloaded from Apache.  In general, anything ending 
with -dev doesn't have any guarantees, and the version before that is 
only meant to be suggestive.
Ah... this seems perfectly reasonable!  And I concur if its not the 
exact JAR then it shouldn't have the final stamp of approval.

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: DocumentWriter, StopFilter should use HashMap... (patch)

2004-03-10 Thread Doug Cutting
Erik Hatcher wrote:
Also... you're HashSet constructor has to copy values from the 
original HashSet into the new HashSet ... not very clean and this can 
just be removed by forcing the caller to use a HashSet (which they 
should).
I've caved in and gone HashSet all the way.
Did you not see my message suggesting a way to both not expose HashSet 
publicly and also not to copy values?  If not, I attached it.

Doug


---BeginMessage---
[EMAIL PROTECTED] wrote:
  -  public StopFilter(TokenStream in, Set stopTable) {
  +  public StopFilter(TokenStream in, Set stopWords) {
   super(in);
  -table = stopTable;
  +this.stopWords = new HashSet(stopWords);
 }
This always allocates a new HashSet, which, if the stop list is large, 
and documents are small, could impact performance.

Perhaps we can replace this with something like:

public StopFilter(TokenStream in, Set stopWords) {
  this(in, stopWords instanceof HashSet ? ((HashSet)stopWords)
   : new HashSet(stopWords));
}
and then add another constructor:

private StopFilter(TokenStream in, HashSet stopWords) {
  super(in);
  this.stopWords = stopTable;
}
Also, if we want the implementation to always be a HashSet internally, 
for performance, we ought to declare the field to be a HashSet, no?

The competing goals here are:
  1. Not to expose publicly the implementation of the Set;
  2. Not to copy the contents of the Set when folks pass the value of 
makeStopSet.
  3. Use the most efficient implementation internally.

I think the changes above meet all of these.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---End Message---
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: DocumentWriter, StopFilter should use HashMap... (patch)

2004-03-10 Thread Kevin A. Burton
Doug Cutting wrote:

Erik Hatcher wrote:

Also... you're HashSet constructor has to copy values from the 
original HashSet into the new HashSet ... not very clean and this 
can just be removed by forcing the caller to use a HashSet (which 
they should).


I've caved in and gone HashSet all the way.


Did you not see my message suggesting a way to both not expose HashSet 
publicly and also not to copy values?  If not, I attached it.

For the record I didn't see it... but it echos my points...

Thanks!

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

begin:vcard
fn:Kevin Burton
n:Burton;Kevin
email;internet:[EMAIL PROTECTED]
x-mozilla-html:TRUE
version:2.1
end:vcard



signature.asc
Description: OpenPGP digital signature


Re: DocumentWriter, StopFilter should use HashMap... (patch)

2004-03-10 Thread Erik Hatcher
On Mar 10, 2004, at 10:28 PM, Doug Cutting wrote:
Erik Hatcher wrote:
Also... you're HashSet constructor has to copy values from the 
original HashSet into the new HashSet ... not very clean and this 
can just be removed by forcing the caller to use a HashSet (which 
they should).
I've caved in and gone HashSet all the way.
Did you not see my message suggesting a way to both not expose HashSet 
publicly and also not to copy values?  If not, I attached it.
Yes, I saw it.  But is there a reason not to just expose HashSet given 
that it is the data structure that is most efficient?  I bought into 
Kevin's arguments that it made sense to just expose HashSet.

As for copying values - that is only happening now if you use the 
Hashtable or String[] constructor.

	Erik


Doug



From: Doug Cutting [EMAIL PROTECTED]
Date: March 10, 2004 1:08:24 PM EST
To: Lucene Developers List [EMAIL PROTECTED]
Subject: Re: cvs commit: 
jakarta-lucene/src/java/org/apache/lucene/analysis StopFilter.java
Reply-To: Lucene Developers List [EMAIL PROTECTED]

[EMAIL PROTECTED] wrote:
  -  public StopFilter(TokenStream in, Set stopTable) {
  +  public StopFilter(TokenStream in, Set stopWords) {
   super(in);
  -table = stopTable;
  +this.stopWords = new HashSet(stopWords);
 }
This always allocates a new HashSet, which, if the stop list is large, 
and documents are small, could impact performance.

Perhaps we can replace this with something like:

public StopFilter(TokenStream in, Set stopWords) {
  this(in, stopWords instanceof HashSet ? ((HashSet)stopWords)
   : new HashSet(stopWords));
}
and then add another constructor:

private StopFilter(TokenStream in, HashSet stopWords) {
  super(in);
  this.stopWords = stopTable;
}
Also, if we want the implementation to always be a HashSet internally, 
for performance, we ought to declare the field to be a HashSet, no?

The competing goals here are:
  1. Not to expose publicly the implementation of the Set;
  2. Not to copy the contents of the Set when folks pass the value of 
makeStopSet.
  3. Use the most efficient implementation internally.

I think the changes above meet all of these.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


incomplete word match

2004-03-10 Thread Tomcat Programmer
I have a situation where I need to be able to find
incomplete word matches, for example a search for the
string 'ape' would return matches for 'grapes'
'naples' 'staples' etc.  I have been searching the
archives of this user list and can't seem to find any
example of someone doing this. 

At one point I recall finding someone's site (on
Google) who indicated that their search engine was
Lucene, and they offered the capability of doing this
type of matching. However I can't seem to find that
site again to save my life!  

Has anyone been successful in implementing this type
of matching with Lucene? If so, would you be able to
share some insight as to how you did it? 

Thanks in advance! 

-TP

__
Do you Yahoo!?
Yahoo! Search - Find what you’re looking for faster
http://search.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]