RE: Ensuring stable timestamp ordering
Dennis Gearon [gear...@sbcglobal.net] wrote: how about a timrstamp with either a GUID appended on the end of it? Since long (8 bytes) is the largest atomic type supported by Java, this would have to be represented as a String (or rather BytesRef) and would take up 4 + 32 bytes + 2 * 4 bytes from the internal BytesRef-attributes + some extra overhead. That is quite a large memory penalty to ensure unique timestamps.
Re: Ensuring stable timestamp ordering
memory's cheap! (I know processing it is not' though ) Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Toke Eskildsen t...@statsbiblioteket.dk To: solr-user@lucene.apache.org solr-user@lucene.apache.org Sent: Mon, November 1, 2010 11:45:34 PM Subject: RE: Ensuring stable timestamp ordering Dennis Gearon [gear...@sbcglobal.net] wrote: how about a timrstamp with either a GUID appended on the end of it? Since long (8 bytes) is the largest atomic type supported by Java, this would have to be represented as a String (or rather BytesRef) and would take up 4 + 32 bytes + 2 * 4 bytes from the internal BytesRef-attributes + some extra overhead. That is quite a large memory penalty to ensure unique timestamps.
RE: Ensuring stable timestamp ordering
how about a timrstamp with either a GUID appended on the end of it? Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die. --- On Sun, 10/31/10, Toke Eskildsen t...@statsbiblioteket.dk wrote: From: Toke Eskildsen t...@statsbiblioteket.dk Subject: RE: Ensuring stable timestamp ordering To: solr-user@lucene.apache.org solr-user@lucene.apache.org Date: Sunday, October 31, 2010, 12:18 PM Dennis Gearon [gear...@sbcglobal.net] wrote: Even microseconds may not be enough on some really good, fast machine. True, especially since the timer might not provide microsecond granularity although the returned value is in microseconds. However, an unique timestamp generator should keep track of the previous timestamp to guard against duplicates. Uniqueness can thus be guaranteed by waiting a bit or cheating on the decimals. With microseconds can produce 1 million timestamps / second. While I agree that duplicates within microseconds can occur on a fast machine, guaranteeing uniqueness by waiting should only be a performance problem when the number of duplicates is high. That's still a few years off, I think. As Michael pointed out, using normal timestamps as unique IDs might not be such a great idea as it effectively locks index-building to a single JVM. By going the ugly route and expressing the time in nanos with only microsecond granularity and use the last 3 decimals for a builder ID this could be fixed. Not very clean though, as the contract is not expressed in the data themselves but must nevertheless be obeyed by all builders to avoid collisions. It also raises the question of who should assign the builder IDs. Not trivial in an anarchistic setup where new builders can be added by different controllers. Pragmatists might use the PID % 1000 or similar for the builder ID as it does not require coordination, but this is where the Birthday Paradox hits us again: The chance of two processes on different machines having the same PID is 10% if just 15 machines are used (1% for 5 machines, 50% for 37 machines). I don't like those odds and that's assuming that the PIDs will be randomly distributed, which they won't. It could be lowered by reserving more decimals for the salt, but then we would decrease the maximum amount of timestamps / second, still without guaranteed uniqueness. Guys a lot smarter than me has spend time on the unique ID problem and it's clearly not easy: Java's UUID takes up 128 bits. - Toke
Re: Ensuring stable timestamp ordering
O, I didn't realize that, thanks! Erick On Sat, Oct 30, 2010 at 10:27 PM, Lance Norskog goks...@gmail.com wrote: Hi- NOW does not get re-run for each document. If you give a large upload batch, the same NOW is given to each document. It would be handy to have an auto-incrementing date field, so that each document would get a unique number and the timestamp would then be the unique ID of the document. On Sat, Oct 30, 2010 at 7:19 PM, Erick Erickson erickerick...@gmail.com wrote: What are the actual values in your index? I'm wondering if they all get the same values somehow, perhaps due to the granularity of your dates? And (and I'm really grasping at straws here) your commit is causing enough delay to have time intervals be greater than your granularity. Unfortunately, that doesn't make much sense either. If you sort on a field, the tiebreaker should be the document ID order absent secondary sorts... So, can you post the results of adding debugQuery=on to your URL? Also, use the schema browser from the admin page to see what you actually have in your index. Not much help, but the best I can do this evening. Erick On Thu, Oct 28, 2010 at 9:58 PM, Michael Sokolov soko...@ifactory.com wrote: (Sorry - fumble finger sent too soon.) My confusion stems from the fact that in my test I insert a number of documents, and then retrieve them ordered by timestamp, and they don't come back in the same order they were inserted (the order seems random), unless I commit after each insert. Is that expected? I could create my own timestamp values easily enough, but would just as soon not do so if I could use a pre-existing feature that seems tailor-made. -Mike -Original Message- From: Michael Sokolov [mailto:soko...@ifactory.com] Sent: Thursday, October 28, 2010 9:55 PM To: 'solr-user@lucene.apache.org' Subject: Ensuring stable timestamp ordering I'm curious what if any guarantees there are regarding the timestamp field that's defined in the sample solr schema.xml. Just for completeness, the definition is: !-- Uncommenting the following will create a timestamp field using a default value of NOW to indicate when each document was indexed. -- field name=timestamp type=date indexed=true stored=true default=NOW multiValued=false/ -- Lance Norskog goks...@gmail.com
RE: Ensuring stable timestamp ordering
Lance Norskog [goks...@gmail.com] wrote: It would be handy to have an auto-incrementing date field, so that each document would get a unique number and the timestamp would then be the unique ID of the document. If someone want to implement this, I'll just note that the granilarity of Solr dates is fixed to milliseconds: http://lucene.apache.org/solr/api/org/apache/solr/schema/DateField.html Using ms for unique timestamps means limiting the index rate to 1000 documents/second. That might be okay for some applications but a serious limiter for other (our Lucene index update rate varies between 300 and 1600 documents/second, depending on content, I am sure others have much higher rates). One could do tricks, but it is just plain ugly to use something like Tenths of milliseconds since epoch, so switching to longs and nanoseconds seems to be the clean choice if we want the timestamps to be true timestamps and not just a unique integer-ID generator.
Re: Ensuring stable timestamp ordering
Hmm - personally, I wouldn't want to rely on timestamps as a unique-id generation scheme. Might we not one day want to have distributed parallel indexing that merges lazily? Keeping timestamps unique and in sync across multiple nodes would be a tough requirement. I would be happy simply having NOW be more fine-grained, and this does seem like something that would be nice to have in a fairly low level, but as I said, if it would introduce backward-compatibility problems, it's easy enough to create a timestamp field in the indexing feed. Thank you for clarifying this. -Mike On 10/31/2010 11:33 AM, Toke Eskildsen wrote: Lance Norskog [goks...@gmail.com] wrote: It would be handy to have an auto-incrementing date field, so that each document would get a unique number and the timestamp would then be the unique ID of the document. If someone want to implement this, I'll just note that the granilarity of Solr dates is fixed to milliseconds: http://lucene.apache.org/solr/api/org/apache/solr/schema/DateField.html Using ms for unique timestamps means limiting the index rate to 1000 documents/second. That might be okay for some applications but a serious limiter for other (our Lucene index update rate varies between 300 and 1600 documents/second, depending on content, I am sure others have much higher rates). One could do tricks, but it is just plain ugly to use something like Tenths of milliseconds since epoch, so switching to longs and nanoseconds seems to be the clean choice if we want the timestamps to be true timestamps and not just a unique integer-ID generator.
RE: Ensuring stable timestamp ordering
Even microseconds may not be enough on some really good, fast machine. Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die. --- On Sun, 10/31/10, Toke Eskildsen t...@statsbiblioteket.dk wrote: From: Toke Eskildsen t...@statsbiblioteket.dk Subject: RE: Ensuring stable timestamp ordering To: solr-user@lucene.apache.org solr-user@lucene.apache.org Date: Sunday, October 31, 2010, 8:33 AM Lance Norskog [goks...@gmail.com] wrote: It would be handy to have an auto-incrementing date field, so that each document would get a unique number and the timestamp would then be the unique ID of the document. If someone want to implement this, I'll just note that the granilarity of Solr dates is fixed to milliseconds: http://lucene.apache.org/solr/api/org/apache/solr/schema/DateField.html Using ms for unique timestamps means limiting the index rate to 1000 documents/second. That might be okay for some applications but a serious limiter for other (our Lucene index update rate varies between 300 and 1600 documents/second, depending on content, I am sure others have much higher rates). One could do tricks, but it is just plain ugly to use something like Tenths of milliseconds since epoch, so switching to longs and nanoseconds seems to be the clean choice if we want the timestamps to be true timestamps and not just a unique integer-ID generator.
RE: Ensuring stable timestamp ordering
Dennis Gearon [gear...@sbcglobal.net] wrote: Even microseconds may not be enough on some really good, fast machine. True, especially since the timer might not provide microsecond granularity although the returned value is in microseconds. However, an unique timestamp generator should keep track of the previous timestamp to guard against duplicates. Uniqueness can thus be guaranteed by waiting a bit or cheating on the decimals. With microseconds can produce 1 million timestamps / second. While I agree that duplicates within microseconds can occur on a fast machine, guaranteeing uniqueness by waiting should only be a performance problem when the number of duplicates is high. That's still a few years off, I think. As Michael pointed out, using normal timestamps as unique IDs might not be such a great idea as it effectively locks index-building to a single JVM. By going the ugly route and expressing the time in nanos with only microsecond granularity and use the last 3 decimals for a builder ID this could be fixed. Not very clean though, as the contract is not expressed in the data themselves but must nevertheless be obeyed by all builders to avoid collisions. It also raises the question of who should assign the builder IDs. Not trivial in an anarchistic setup where new builders can be added by different controllers. Pragmatists might use the PID % 1000 or similar for the builder ID as it does not require coordination, but this is where the Birthday Paradox hits us again: The chance of two processes on different machines having the same PID is 10% if just 15 machines are used (1% for 5 machines, 50% for 37 machines). I don't like those odds and that's assuming that the PIDs will be randomly distributed, which they won't. It could be lowered by reserving more decimals for the salt, but then we would decrease the maximum amount of timestamps / second, still without guaranteed uniqueness. Guys a lot smarter than me has spend time on the unique ID problem and it's clearly not easy: Java's UUID takes up 128 bits. - Toke
Re: Ensuring stable timestamp ordering
What are the actual values in your index? I'm wondering if they all get the same values somehow, perhaps due to the granularity of your dates? And (and I'm really grasping at straws here) your commit is causing enough delay to have time intervals be greater than your granularity. Unfortunately, that doesn't make much sense either. If you sort on a field, the tiebreaker should be the document ID order absent secondary sorts... So, can you post the results of adding debugQuery=on to your URL? Also, use the schema browser from the admin page to see what you actually have in your index. Not much help, but the best I can do this evening. Erick On Thu, Oct 28, 2010 at 9:58 PM, Michael Sokolov soko...@ifactory.comwrote: (Sorry - fumble finger sent too soon.) My confusion stems from the fact that in my test I insert a number of documents, and then retrieve them ordered by timestamp, and they don't come back in the same order they were inserted (the order seems random), unless I commit after each insert. Is that expected? I could create my own timestamp values easily enough, but would just as soon not do so if I could use a pre-existing feature that seems tailor-made. -Mike -Original Message- From: Michael Sokolov [mailto:soko...@ifactory.com] Sent: Thursday, October 28, 2010 9:55 PM To: 'solr-user@lucene.apache.org' Subject: Ensuring stable timestamp ordering I'm curious what if any guarantees there are regarding the timestamp field that's defined in the sample solr schema.xml. Just for completeness, the definition is: !-- Uncommenting the following will create a timestamp field using a default value of NOW to indicate when each document was indexed. -- field name=timestamp type=date indexed=true stored=true default=NOW multiValued=false/
Re: Ensuring stable timestamp ordering
Hi- NOW does not get re-run for each document. If you give a large upload batch, the same NOW is given to each document. It would be handy to have an auto-incrementing date field, so that each document would get a unique number and the timestamp would then be the unique ID of the document. On Sat, Oct 30, 2010 at 7:19 PM, Erick Erickson erickerick...@gmail.com wrote: What are the actual values in your index? I'm wondering if they all get the same values somehow, perhaps due to the granularity of your dates? And (and I'm really grasping at straws here) your commit is causing enough delay to have time intervals be greater than your granularity. Unfortunately, that doesn't make much sense either. If you sort on a field, the tiebreaker should be the document ID order absent secondary sorts... So, can you post the results of adding debugQuery=on to your URL? Also, use the schema browser from the admin page to see what you actually have in your index. Not much help, but the best I can do this evening. Erick On Thu, Oct 28, 2010 at 9:58 PM, Michael Sokolov soko...@ifactory.comwrote: (Sorry - fumble finger sent too soon.) My confusion stems from the fact that in my test I insert a number of documents, and then retrieve them ordered by timestamp, and they don't come back in the same order they were inserted (the order seems random), unless I commit after each insert. Is that expected? I could create my own timestamp values easily enough, but would just as soon not do so if I could use a pre-existing feature that seems tailor-made. -Mike -Original Message- From: Michael Sokolov [mailto:soko...@ifactory.com] Sent: Thursday, October 28, 2010 9:55 PM To: 'solr-user@lucene.apache.org' Subject: Ensuring stable timestamp ordering I'm curious what if any guarantees there are regarding the timestamp field that's defined in the sample solr schema.xml. Just for completeness, the definition is: !-- Uncommenting the following will create a timestamp field using a default value of NOW to indicate when each document was indexed. -- field name=timestamp type=date indexed=true stored=true default=NOW multiValued=false/ -- Lance Norskog goks...@gmail.com
RE: Ensuring stable timestamp ordering
(Sorry - fumble finger sent too soon.) My confusion stems from the fact that in my test I insert a number of documents, and then retrieve them ordered by timestamp, and they don't come back in the same order they were inserted (the order seems random), unless I commit after each insert. Is that expected? I could create my own timestamp values easily enough, but would just as soon not do so if I could use a pre-existing feature that seems tailor-made. -Mike -Original Message- From: Michael Sokolov [mailto:soko...@ifactory.com] Sent: Thursday, October 28, 2010 9:55 PM To: 'solr-user@lucene.apache.org' Subject: Ensuring stable timestamp ordering I'm curious what if any guarantees there are regarding the timestamp field that's defined in the sample solr schema.xml. Just for completeness, the definition is: !-- Uncommenting the following will create a timestamp field using a default value of NOW to indicate when each document was indexed. -- field name=timestamp type=date indexed=true stored=true default=NOW multiValued=false/