RE: Ensuring stable timestamp ordering

2010-11-02 Thread Toke Eskildsen
Dennis Gearon [gear...@sbcglobal.net] wrote:
 how about a timrstamp with either a GUID appended on  the end of it?

Since long (8 bytes) is the largest atomic type supported by Java, this would 
have to be represented as a String (or rather BytesRef) and would take up 4 + 
32 bytes + 2 * 4 bytes from the internal BytesRef-attributes + some extra 
overhead. That is quite a large memory penalty to ensure unique timestamps.

Re: Ensuring stable timestamp ordering

2010-11-02 Thread Dennis Gearon
memory's cheap! (I know processing it is not' though )

 Dennis Gearon


Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better 
idea to learn from others’ mistakes, so you do not have to make them yourself. 
from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.



- Original Message 
From: Toke Eskildsen t...@statsbiblioteket.dk
To: solr-user@lucene.apache.org solr-user@lucene.apache.org
Sent: Mon, November 1, 2010 11:45:34 PM
Subject: RE: Ensuring stable timestamp ordering

Dennis Gearon [gear...@sbcglobal.net] wrote:
 how about a timrstamp with either a GUID appended on  the end of it?

Since long (8 bytes) is the largest atomic type supported by Java, this would 
have to be represented as a String (or rather BytesRef) and would take up 4 + 
32 
bytes + 2 * 4 bytes from the internal BytesRef-attributes + some extra 
overhead. 
That is quite a large memory penalty to ensure unique timestamps.


RE: Ensuring stable timestamp ordering

2010-11-01 Thread Dennis Gearon
how about a timrstamp with either a GUID appended on  the end of it?


Dennis Gearon

Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better idea to learn from others’ mistakes, so you do not have to make them 
yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'

EARTH has a Right To Life,
  otherwise we all die.


--- On Sun, 10/31/10, Toke Eskildsen t...@statsbiblioteket.dk wrote:

 From: Toke Eskildsen t...@statsbiblioteket.dk
 Subject: RE: Ensuring stable timestamp ordering
 To: solr-user@lucene.apache.org solr-user@lucene.apache.org
 Date: Sunday, October 31, 2010, 12:18 PM
 Dennis Gearon [gear...@sbcglobal.net]
 wrote:
  Even microseconds may not be enough on some really
 good, fast machine.
 
 True, especially since the timer might not provide
 microsecond granularity although the returned value is in
 microseconds. However, an unique timestamp generator should
 keep track of the previous timestamp to guard against
 duplicates. Uniqueness can thus be guaranteed by waiting a
 bit or cheating on the decimals. With microseconds can
 produce 1 million timestamps / second. While I agree that
 duplicates within microseconds can occur on a fast machine,
 guaranteeing uniqueness by waiting should only be a
 performance problem when the number of duplicates is high.
 That's still a few years off, I think.
 
 As Michael pointed out, using normal timestamps as unique
 IDs might not be such a great idea as it effectively locks
 index-building to a single JVM. By going the ugly route and
 expressing the time in nanos with only microsecond
 granularity and use the last 3 decimals for a builder ID
 this could be fixed. Not very clean though, as the contract
 is not expressed in the data themselves but must
 nevertheless be obeyed by all builders to avoid collisions.
 It also raises the question of who should assign the builder
 IDs. Not trivial in an anarchistic setup where new builders
 can be added by different controllers.
 
 Pragmatists might use the PID % 1000 or similar for the
 builder ID as it does not require coordination, but this is
 where the Birthday Paradox hits us again: The chance of two
 processes on different machines having the same PID is 10%
 if just 15 machines are used (1% for 5 machines, 50% for 37
 machines). I don't like those odds and that's assuming that
 the PIDs will be randomly distributed, which they won't. It
 could be lowered by reserving more decimals for the salt,
 but then we would decrease the maximum amount of timestamps
 / second, still without guaranteed uniqueness. Guys a lot
 smarter than me has spend time on the unique ID problem and
 it's clearly not easy: Java's UUID takes up 128 bits.
 
 - Toke


Re: Ensuring stable timestamp ordering

2010-10-31 Thread Erick Erickson
O, I didn't realize that, thanks!

Erick

On Sat, Oct 30, 2010 at 10:27 PM, Lance Norskog goks...@gmail.com wrote:

 Hi-

 NOW does not get re-run for each document. If you give a large upload
 batch, the same NOW is given to each document.

 It would be handy to have an auto-incrementing date field, so that
 each document would get a unique number and the timestamp would then
 be the unique ID of the document.

 On Sat, Oct 30, 2010 at 7:19 PM, Erick Erickson erickerick...@gmail.com
 wrote:
  What are the actual values in your index? I'm wondering if they
  all get the same values somehow, perhaps due to the granularity
  of your dates? And (and I'm really grasping at straws here) your
  commit is causing enough delay to have time intervals be greater
  than your granularity.
 
  Unfortunately, that  doesn't make much sense either. If you sort on a
  field, the tiebreaker should be the document ID order absent secondary
  sorts...
 
  So, can you post the results of adding debugQuery=on to your URL?
  Also, use the schema browser from the admin page to see what you
  actually have in your index.
 
  Not much help, but the best I can do this evening.
 
  Erick
 
  On Thu, Oct 28, 2010 at 9:58 PM, Michael Sokolov soko...@ifactory.com
 wrote:
 
  (Sorry - fumble finger sent too soon.)
 
 
  My confusion stems from the fact that in my test I insert a number of
  documents, and then retrieve them ordered by timestamp, and they don't
 come
  back in the same order they were inserted (the order seems random),
 unless
  I
  commit after each insert.
 
  Is that expected?  I could create my own timestamp values easily enough,
  but
  would just as soon not do so if I could use a pre-existing feature that
  seems tailor-made.
 
  -Mike
 
   -Original Message-
   From: Michael Sokolov [mailto:soko...@ifactory.com]
   Sent: Thursday, October 28, 2010 9:55 PM
   To: 'solr-user@lucene.apache.org'
   Subject: Ensuring stable timestamp ordering
  
   I'm curious what if any guarantees there are regarding the
   timestamp field that's defined in the sample solr
   schema.xml.  Just for completeness, the definition is:
  
 
 !-- Uncommenting the following will create a timestamp field using
 a default value of NOW to indicate when each document was
 indexed.
  --
field name=timestamp type=date indexed=true stored=true
  default=NOW multiValued=false/
 
 
 



 --
 Lance Norskog
 goks...@gmail.com



RE: Ensuring stable timestamp ordering

2010-10-31 Thread Toke Eskildsen
Lance Norskog [goks...@gmail.com] wrote:
 It would be handy to have an auto-incrementing date field, so that
 each document would get a unique number and the timestamp would then
 be the unique ID of the document.

If someone want to implement this, I'll just note that the granilarity of Solr 
dates is fixed to milliseconds:
http://lucene.apache.org/solr/api/org/apache/solr/schema/DateField.html

Using ms for unique timestamps means limiting the index rate to 1000 
documents/second. That might be okay for some applications but a serious 
limiter for other (our Lucene index update rate varies between 300 and 1600 
documents/second, depending on content, I am sure others have much higher 
rates). One could do tricks, but it is just plain ugly to use something like 
Tenths of milliseconds since epoch, so switching to longs and nanoseconds 
seems to be the clean choice if we want the timestamps to be true timestamps 
and not just a unique integer-ID generator.

Re: Ensuring stable timestamp ordering

2010-10-31 Thread Michael Sokolov
Hmm - personally, I wouldn't want to rely on timestamps as a unique-id 
generation scheme.  Might we not one day want to have distributed 
parallel indexing that merges lazily?  Keeping timestamps unique and in 
sync across multiple nodes would be a tough requirement. I would be 
happy simply having NOW be more fine-grained, and this does seem like 
something that would be nice to have in a fairly low level, but as I 
said, if it would introduce backward-compatibility problems, it's easy 
enough to create a timestamp field in the indexing feed.


Thank you for clarifying this.

-Mike


On 10/31/2010 11:33 AM, Toke Eskildsen wrote:

Lance Norskog [goks...@gmail.com] wrote:

It would be handy to have an auto-incrementing date field, so that
each document would get a unique number and the timestamp would then
be the unique ID of the document.

If someone want to implement this, I'll just note that the granilarity of Solr 
dates is fixed to milliseconds:
http://lucene.apache.org/solr/api/org/apache/solr/schema/DateField.html

Using ms for unique timestamps means limiting the index rate to 1000 documents/second. That might 
be okay for some applications but a serious limiter for other (our Lucene index update rate varies 
between 300 and 1600 documents/second, depending on content, I am sure others have much higher 
rates). One could do tricks, but it is just plain ugly to use something like Tenths of 
milliseconds since epoch, so switching to longs and nanoseconds seems to be the clean choice 
if we want the timestamps to be true timestamps and not just a unique integer-ID 
generator.




RE: Ensuring stable timestamp ordering

2010-10-31 Thread Dennis Gearon
Even microseconds may not be enough on some really good, fast machine.
Dennis Gearon

Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better idea to learn from others’ mistakes, so you do not have to make them 
yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'

EARTH has a Right To Life,
  otherwise we all die.


--- On Sun, 10/31/10, Toke Eskildsen t...@statsbiblioteket.dk wrote:

 From: Toke Eskildsen t...@statsbiblioteket.dk
 Subject: RE: Ensuring stable timestamp ordering
 To: solr-user@lucene.apache.org solr-user@lucene.apache.org
 Date: Sunday, October 31, 2010, 8:33 AM
 Lance Norskog [goks...@gmail.com]
 wrote:
  It would be handy to have an auto-incrementing date
 field, so that
  each document would get a unique number and the
 timestamp would then
  be the unique ID of the document.
 
 If someone want to implement this, I'll just note that the
 granilarity of Solr dates is fixed to milliseconds:
 http://lucene.apache.org/solr/api/org/apache/solr/schema/DateField.html
 
 Using ms for unique timestamps means limiting the index
 rate to 1000 documents/second. That might be okay for some
 applications but a serious limiter for other (our Lucene
 index update rate varies between 300 and 1600
 documents/second, depending on content, I am sure others
 have much higher rates). One could do tricks, but it is just
 plain ugly to use something like Tenths of milliseconds
 since epoch, so switching to longs and nanoseconds seems to
 be the clean choice if we want the timestamps to be true
 timestamps and not just a unique integer-ID generator.


RE: Ensuring stable timestamp ordering

2010-10-31 Thread Toke Eskildsen
Dennis Gearon [gear...@sbcglobal.net] wrote:
 Even microseconds may not be enough on some really good, fast machine.

True, especially since the timer might not provide microsecond granularity 
although the returned value is in microseconds. However, an unique timestamp 
generator should keep track of the previous timestamp to guard against 
duplicates. Uniqueness can thus be guaranteed by waiting a bit or cheating on 
the decimals. With microseconds can produce 1 million timestamps / second. 
While I agree that duplicates within microseconds can occur on a fast machine, 
guaranteeing uniqueness by waiting should only be a performance problem when 
the number of duplicates is high. That's still a few years off, I think.

As Michael pointed out, using normal timestamps as unique IDs might not be such 
a great idea as it effectively locks index-building to a single JVM. By going 
the ugly route and expressing the time in nanos with only microsecond 
granularity and use the last 3 decimals for a builder ID this could be fixed. 
Not very clean though, as the contract is not expressed in the data themselves 
but must nevertheless be obeyed by all builders to avoid collisions. It also 
raises the question of who should assign the builder IDs. Not trivial in an 
anarchistic setup where new builders can be added by different controllers.

Pragmatists might use the PID % 1000 or similar for the builder ID as it does 
not require coordination, but this is where the Birthday Paradox hits us again: 
The chance of two processes on different machines having the same PID is 10% if 
just 15 machines are used (1% for 5 machines, 50% for 37 machines). I don't 
like those odds and that's assuming that the PIDs will be randomly distributed, 
which they won't. It could be lowered by reserving more decimals for the salt, 
but then we would decrease the maximum amount of timestamps / second, still 
without guaranteed uniqueness. Guys a lot smarter than me has spend time on the 
unique ID problem and it's clearly not easy: Java's UUID takes up 128 bits.

- Toke

Re: Ensuring stable timestamp ordering

2010-10-30 Thread Erick Erickson
What are the actual values in your index? I'm wondering if they
all get the same values somehow, perhaps due to the granularity
of your dates? And (and I'm really grasping at straws here) your
commit is causing enough delay to have time intervals be greater
than your granularity.

Unfortunately, that  doesn't make much sense either. If you sort on a
field, the tiebreaker should be the document ID order absent secondary
sorts...

So, can you post the results of adding debugQuery=on to your URL?
Also, use the schema browser from the admin page to see what you
actually have in your index.

Not much help, but the best I can do this evening.

Erick

On Thu, Oct 28, 2010 at 9:58 PM, Michael Sokolov soko...@ifactory.comwrote:

 (Sorry - fumble finger sent too soon.)


 My confusion stems from the fact that in my test I insert a number of
 documents, and then retrieve them ordered by timestamp, and they don't come
 back in the same order they were inserted (the order seems random), unless
 I
 commit after each insert.

 Is that expected?  I could create my own timestamp values easily enough,
 but
 would just as soon not do so if I could use a pre-existing feature that
 seems tailor-made.

 -Mike

  -Original Message-
  From: Michael Sokolov [mailto:soko...@ifactory.com]
  Sent: Thursday, October 28, 2010 9:55 PM
  To: 'solr-user@lucene.apache.org'
  Subject: Ensuring stable timestamp ordering
 
  I'm curious what if any guarantees there are regarding the
  timestamp field that's defined in the sample solr
  schema.xml.  Just for completeness, the definition is:
 

!-- Uncommenting the following will create a timestamp field using
a default value of NOW to indicate when each document was indexed.
 --
   field name=timestamp type=date indexed=true stored=true
 default=NOW multiValued=false/




Re: Ensuring stable timestamp ordering

2010-10-30 Thread Lance Norskog
Hi-

NOW does not get re-run for each document. If you give a large upload
batch, the same NOW is given to each document.

It would be handy to have an auto-incrementing date field, so that
each document would get a unique number and the timestamp would then
be the unique ID of the document.

On Sat, Oct 30, 2010 at 7:19 PM, Erick Erickson erickerick...@gmail.com wrote:
 What are the actual values in your index? I'm wondering if they
 all get the same values somehow, perhaps due to the granularity
 of your dates? And (and I'm really grasping at straws here) your
 commit is causing enough delay to have time intervals be greater
 than your granularity.

 Unfortunately, that  doesn't make much sense either. If you sort on a
 field, the tiebreaker should be the document ID order absent secondary
 sorts...

 So, can you post the results of adding debugQuery=on to your URL?
 Also, use the schema browser from the admin page to see what you
 actually have in your index.

 Not much help, but the best I can do this evening.

 Erick

 On Thu, Oct 28, 2010 at 9:58 PM, Michael Sokolov soko...@ifactory.comwrote:

 (Sorry - fumble finger sent too soon.)


 My confusion stems from the fact that in my test I insert a number of
 documents, and then retrieve them ordered by timestamp, and they don't come
 back in the same order they were inserted (the order seems random), unless
 I
 commit after each insert.

 Is that expected?  I could create my own timestamp values easily enough,
 but
 would just as soon not do so if I could use a pre-existing feature that
 seems tailor-made.

 -Mike

  -Original Message-
  From: Michael Sokolov [mailto:soko...@ifactory.com]
  Sent: Thursday, October 28, 2010 9:55 PM
  To: 'solr-user@lucene.apache.org'
  Subject: Ensuring stable timestamp ordering
 
  I'm curious what if any guarantees there are regarding the
  timestamp field that's defined in the sample solr
  schema.xml.  Just for completeness, the definition is:
 

    !-- Uncommenting the following will create a timestamp field using
        a default value of NOW to indicate when each document was indexed.
     --
   field name=timestamp type=date indexed=true stored=true
 default=NOW multiValued=false/






-- 
Lance Norskog
goks...@gmail.com


RE: Ensuring stable timestamp ordering

2010-10-28 Thread Michael Sokolov
(Sorry - fumble finger sent too soon.)


My confusion stems from the fact that in my test I insert a number of
documents, and then retrieve them ordered by timestamp, and they don't come
back in the same order they were inserted (the order seems random), unless I
commit after each insert. 

Is that expected?  I could create my own timestamp values easily enough, but
would just as soon not do so if I could use a pre-existing feature that
seems tailor-made.

-Mike

 -Original Message-
 From: Michael Sokolov [mailto:soko...@ifactory.com] 
 Sent: Thursday, October 28, 2010 9:55 PM
 To: 'solr-user@lucene.apache.org'
 Subject: Ensuring stable timestamp ordering
 
 I'm curious what if any guarantees there are regarding the 
 timestamp field that's defined in the sample solr 
 schema.xml.  Just for completeness, the definition is:
 

   !-- Uncommenting the following will create a timestamp field using
a default value of NOW to indicate when each document was indexed.
 --
   field name=timestamp type=date indexed=true stored=true
default=NOW multiValued=false/