Re: int vs long and document ids on 64bit machines.

2004-03-11 Thread Doug Cutting
hui wrote:
If the document id is going to be changed, is it possible to define an
interface so the user could provide other implementation to replace the
default one? For example, the document unique timestamp or other fields as
long as they are long could be used.
I don't think that would be a good idea.  Lucene's index format requires 
document ids to increase as documents are added, and things are *much* 
more efficient when the numbering is dense.

In this way, we can easily get another
sorting option rather than the default score sorting. Still not sure whether
it is a good idea or not though.
There is good result sorting support in the latest CVS that will be in 1.4.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: int vs long and document ids on 64bit machines.

2004-03-11 Thread hui
If the document id is going to be changed, is it possible to define an
interface so the user could provide other implementation to replace the
default one? For example, the document unique timestamp or other fields as
long as they are long could be used. In this way, we can easily get another
sorting option rather than the default score sorting. Still not sure whether
it is a good idea or not though.

Regards,
hui

-Original Message-
From: Kevin A. Burton [mailto:[EMAIL PROTECTED] 
Sent: Thursday, March 11, 2004 2:14 PM
To: Lucene Users List
Subject: Re: int vs long and document ids on 64bit machines.

Doug Cutting wrote:

>
> Somone, not me, perhaps provided that rationalization, which isn't a 
> bad one.  In fact, the situation was more that, in 1997, when I 
> started Lucene, 2 billion documents seemed like a lot for a Java-based 
> search engine which was designed to scale to perhaps millions of 
> documents, but probably not to the world.  Java was slow then, remember?

Yes... agreed.

>> Does anyone know how JDK 1.4.2 works on Itanium, Opteron (AMD64)?
>> How hard would it be to build a lucene64 that used 64bit document 
>> handles (longs) for 64bit procesors?!  Is it just a recompile?  Will 
>> the file format break and need updating?!
>
>
> I think the file format is 64-bit safe.  But the code changes would be 
> quite numerous.  No doubt we should make this change someday.  Do you 
> anticipate more than 2 billion documents in your Lucene index sometime 
> soon, e.g., this year?
>
> Also, with Java, it's not just a recompile, it's a lot of code changes.

Weill ... the refactor should at LEAST be pretty easy... just start 
changing int->long and follow up until the code compiles.  Not sure if 
it's that easy.

>> Also ... what are the symptoms of a Lucene build using 64bit ints on 
>> 32bit processors.  Right now we're personally stuck on 32bit machines 
>> but I would like to see us migrate to 64 bit boxes over the next 6 
>> months...
>
>
> Java's int datatype is defined as 32 bit.  So there are no 64-bit 
> ints.  There are longs.  I doubt longs are much slower than ints to 
> deal with on most JVMs today.  However a long[] is twice as big as an 
> int[], and an array may only be indexed by an int.  Currently Lucene 
> uses a byte[] indexed by document number to store normalization 
> factors.  This would not work if document numbers are longs.  Filters 
> index bit vectors with document numbers, and that also would not work 
> if document numbers were longs.  Working around these will not only 
> take some code, it may also impact performance a bit.
>
> I suspect that Java will soon evolve to better embrace 64-bit 
> machines.  Someday assignment of longs will be atomic.  (This is 
> hinted at in the language spec.)  Someday arrays will probably be 
> indexable by longs. I'd prefer to wait until these changes happen 
> before changing Lucene's document numbers to longs.
>
At some point I might take a look at the code and see how hard it would 
be... Thanks for you notes... I'll probably use these in the future.

The main problem that with indexes that have lots of SMALL documents you 
could see yourself running out of ints.

Kevin

-- 

Please reply using PGP.

http://peerfear.org/pubkey.asc

NewsMonster - http://www.newsmonster.org/

Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
   AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
  IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: int vs long and document ids on 64bit machines.

2004-03-11 Thread Kevin A. Burton
Doug Cutting wrote:

Somone, not me, perhaps provided that rationalization, which isn't a 
bad one.  In fact, the situation was more that, in 1997, when I 
started Lucene, 2 billion documents seemed like a lot for a Java-based 
search engine which was designed to scale to perhaps millions of 
documents, but probably not to the world.  Java was slow then, remember?
Yes... agreed.

Does anyone know how JDK 1.4.2 works on Itanium, Opteron (AMD64)?
How hard would it be to build a lucene64 that used 64bit document 
handles (longs) for 64bit procesors?!  Is it just a recompile?  Will 
the file format break and need updating?!


I think the file format is 64-bit safe.  But the code changes would be 
quite numerous.  No doubt we should make this change someday.  Do you 
anticipate more than 2 billion documents in your Lucene index sometime 
soon, e.g., this year?

Also, with Java, it's not just a recompile, it's a lot of code changes.
Weill ... the refactor should at LEAST be pretty easy... just start 
changing int->long and follow up until the code compiles.  Not sure if 
it's that easy.

Also ... what are the symptoms of a Lucene build using 64bit ints on 
32bit processors.  Right now we're personally stuck on 32bit machines 
but I would like to see us migrate to 64 bit boxes over the next 6 
months...


Java's int datatype is defined as 32 bit.  So there are no 64-bit 
ints.  There are longs.  I doubt longs are much slower than ints to 
deal with on most JVMs today.  However a long[] is twice as big as an 
int[], and an array may only be indexed by an int.  Currently Lucene 
uses a byte[] indexed by document number to store normalization 
factors.  This would not work if document numbers are longs.  Filters 
index bit vectors with document numbers, and that also would not work 
if document numbers were longs.  Working around these will not only 
take some code, it may also impact performance a bit.

I suspect that Java will soon evolve to better embrace 64-bit 
machines.  Someday assignment of longs will be atomic.  (This is 
hinted at in the language spec.)  Someday arrays will probably be 
indexable by longs. I'd prefer to wait until these changes happen 
before changing Lucene's document numbers to longs.

At some point I might take a look at the code and see how hard it would 
be... Thanks for you notes... I'll probably use these in the future.

The main problem that with indexes that have lots of SMALL documents you 
could see yourself running out of ints.

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature


Re: int vs long and document ids on 64bit machines.

2004-03-11 Thread Doug Cutting
Kevin A. Burton wrote:
A discussion I had a while back had someone note (Doug?) that the 
decision to go with 32bit ints for document IDs was that on 32 bit 
machines that 64bits weren't threadsafe.
Somone, not me, perhaps provided that rationalization, which isn't a bad 
one.  In fact, the situation was more that, in 1997, when I started 
Lucene, 2 billion documents seemed like a lot for a Java-based search 
engine which was designed to scale to perhaps millions of documents, but 
probably not to the world.  Java was slow then, remember?

Does anyone know how JDK 1.4.2 works on Itanium, Opteron (AMD64)?
How hard would it be to build a lucene64 that used 64bit document 
handles (longs) for 64bit procesors?!  Is it just a recompile?  Will the 
file format break and need updating?!
I think the file format is 64-bit safe.  But the code changes would be 
quite numerous.  No doubt we should make this change someday.  Do you 
anticipate more than 2 billion documents in your Lucene index sometime 
soon, e.g., this year?

Also, with Java, it's not just a recompile, it's a lot of code changes.

Also ... what are the symptoms of a Lucene build using 64bit ints on 
32bit processors.  Right now we're personally stuck on 32bit machines 
but I would like to see us migrate to 64 bit boxes over the next 6 
months...
Java's int datatype is defined as 32 bit.  So there are no 64-bit ints. 
 There are longs.  I doubt longs are much slower than ints to deal with 
on most JVMs today.  However a long[] is twice as big as an int[], and 
an array may only be indexed by an int.  Currently Lucene uses a byte[] 
indexed by document number to store normalization factors.  This would 
not work if document numbers are longs.  Filters index bit vectors with 
document numbers, and that also would not work if document numbers were 
longs.  Working around these will not only take some code, it may also 
impact performance a bit.

I suspect that Java will soon evolve to better embrace 64-bit machines. 
 Someday assignment of longs will be atomic.  (This is hinted at in the 
language spec.)  Someday arrays will probably be indexable by longs. 
I'd prefer to wait until these changes happen before changing Lucene's 
document numbers to longs.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


int vs long and document ids on 64bit machines.

2004-03-11 Thread Kevin A. Burton
A discussion I had a while back had someone note (Doug?) that the 
decision to go with 32bit ints for document IDs was that on 32 bit 
machines that 64bits weren't threadsafe.

Does anyone know how JDK 1.4.2 works on Itanium, Opteron (AMD64)? 

How hard would it be to build a lucene64 that used 64bit document 
handles (longs) for 64bit procesors?!  Is it just a recompile?  Will the 
file format break and need updating?!

Also ... what are the symptoms of a Lucene build using 64bit ints on 
32bit processors.  Right now we're personally stuck on 32bit machines 
but I would like to see us migrate to 64 bit boxes over the next 6 
months...

Anyway... thinking out loud.

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature