On 3/15/2011 12:54 PM, onlinespend...@gmail.com wrote:
That's pretty interesting to use the autoincrementing document ID as a way to keep track of what has not been indexed in Solr. And you overwrite this document ID even when you modify an existing document. Very cool. I suppose the number can even rotate back to 0, as long as you handle that.
We use a bigint for the value, and the highest value is currently less than 300 million, so we don't expect it to ever rotate around to 0. My build system would not be able to handle wrapraound without manual intervention. If we have that problem, I think we'd have to renumber the entire database and reindex.
I am thinking of using a timestamp to achieve a similar thing. All documents that have been accessed after the last Solr index need to be added to the Solr index. In fact, each name-value pair in Cassandra has a timestamp associated with it, so I'm curious if I could simply use this.
As long as you can guarantee that it's all deterministic and idempotent, you can use anything you like. I hope you know what those words mean. :) It's important when using timestamps that the system that runs the build script is the same one that stores the last-used timestamp. That way you are guaranteed that you will never have things getting missed because of clock skew.
I'm curious how you handle the delta-imports. Do you have some routine that periodically checks for updates to your MySQL database via the document ID? Which language do you use for that?
The entire build system is written in Perl, where I am comfortable. I even wrote an object-oriented module that the scripts share. The update script runs every two minutes, from cron, indexing anything with a higher document ID than the one recorded during the last successful run. There are some other scripts that run on longer intervals and handle things like deletes and data redistribution into shards. These scripts kick off the build, then use the bare /dataimport URL to track when the import completes and whether it's successful.
Thanks, Shawn