[Solr Wiki] Update of "UniqueKey" by Lance Norskog

Apache Wiki Mon, 02 Feb 2009 21:28:10 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change 
notification.


The following page has been changed by Lance Norskog:
http://wiki.apache.org/solr/UniqueKey

------------------------------------------------------------------------------
  The Solr '''uniqueKey''' field encodes the ''identity semantics'' of a 
document. In database jargon, the ''primary key''.
  
  Different possibilities for unique key:
-  1. Raw text native to the document
+  * Raw text native to the document
-  1. UUID data generated from data in the document
+  * UUID data generated from data in the document
-  1. UUID key generated from the time of insertion
+  * UUID key generated from the time of insertion
  
  You do not necessarily need a unique key to make an index, but almost all 
indexes use one.
  
  In these use cases, no unique key is necessary.
-  1. Build an index from empty. Search for documents. When you have new 
documents to add, either add them or clear index and reindex from scratch. You 
know that you will never add the same document twice.
+  * Build an index from empty. Search for documents. When you have new 
documents to add, either add them or clear index and reindex from scratch. You 
know that you will never add the same document twice.
-  1. Sort documents found in a search against a numerical field.
+  * Sort documents found in a search against a numerical field.
  
  These use cases require a unique key.
-  1. Add documents incrementally. You do not rebuild the index from scratch 
but want to add new documents periodically. You may add the same document twice 
and it will only be stored once.
+  * Add documents incrementally. You do not rebuild the index from scratch but 
want to add new documents periodically. You may add the same document twice and 
it will only be stored once.
    An example is an RSS feed from a blog: the RSS feed will be polled, 
therefore the same article may appear more than once.
-  1. Do statistical analysis on indexed data.
+  * Do statistical analysis on indexed data.
    If the index is large, you might want to pick a subset for your research. A 
wildcard search on the key field does this.
-  1. Share document identity with other database systems in "vertical 
partition" style.
+  * Share document identity with other database systems in "vertical 
partition" style.
    As an example, store only index data but not original fields for large 
documents. To fetch documents found in Solr, you will need to store the same 
unique key in both the index and the database.
-  1. Change definition of document identity.
+  * Change definition of document identity.
    Use cases change, and you may want to change the identity of the documents. 
For example, an RSS feed for videos might change to give different entries for 
the same video in different sizes. You may decide that the different entries 
are really the same document.
    There is a saying in database design:''data sticks where it lands''. Once 
you store data in some format and container, it is very hard to change this 
decision. By adding a layer of indirection in the SOLR schema's identity, you 
give yourself the ability to change the innate identity of the document.
-  1. Multiple queries about the same document, with document id saved for 
future reference.
+  * Multiple queries about the same document, with document id saved for 
future reference.
-  1. Delete documentss.
+  * Delete documents.
  
- For these use cases you need to generate the key from fields in the document.
+ For these use cases you need to generate the key from fields in the document. 
The key should be a short unique string.
-  1. Allow different database systems to create identity keys that work in 
other systems.
+  * Allow different database systems to create identity keys that work in 
other systems.
    The documents may come from multiple sources, and be stored in multiple 
places. There may not be one convenient place in the indexing path to create a 
unique id. The different sources will need to separately implement the same 
algorithm.
  
+ =UUID=
+       UUID is short for Universal Unique IDentifier, which means a data value 
that is different for every document. The UUID standard 
[http://www.ietf.org/rfc/rfc4122.txt RFC-4122] includes several types of UUID 
with different input formats. There is a UUID field type in Solr 1.4 which 
implements version 4. Also, the ExtractingRequestHandler automatically creates 
UUID version 4. 
- Should use a cryptographic hashcode:
-       1) Statistical analysis use case, and limiting large queries to smaller 
queries.
-       2) Exported primary key for large sets. Database searches on this can 
be much faster than on the source field.
  
+ ==Cryptographic hash==
+       A cryptographic hashing algorithm can be thought of as creating N very 
random bits from the input data. The MD5 algorithm create 128 bits. This means 
that 2 input data sets have a chance of 1 in 2^128 of creating the same MD5. 
(MD5 has been "cracked" and there are techniques for creating collisions, but 
this is unlikely to happen in your Solr production.) There is a standard 
expression of this as 32 hexadecimal characters. 
[http://www.ietf.org/rfc/rfc1321.txt RFC-1321]. Several MD5 digest algorithm 
packages for various languages do not follow this standard. The UUID standard 
always includes the time at the creation of the UUID, which precludes some of 
the above use cases. You can cheat and ignore the clock requirement. It is 
probably better to use the UUID type than a raw string.
- UUID:
-       UUID is short for Universal Unique IDentifier, which means a data value 
that is different for every document.
-       The UUID standard http://www.ietf.org/rfc/rfc4122.txt includes several 
types of UUID with different input formats.
-       There is a UUID field type in Solr 1.3 which implements one of these 
formats.
-       Also, the ExtractingRequestHandler implements its own UUID format.
-       The standard always includes the time at the creation of the UUID, 
which precludes some of the above use cases.
-       You can cheat and ignore the clock requirement. It is probably better 
to use the UUID type than a raw string.
  
+  One advantage in using a crypto-generated unique key is that you can select 
a random subset of documents via wildcards. If the UUID data is saved as a 
string in the 32-character RFC format, 'd3adbe3fdeadb3e4deadbee4deadb3ef', the 
query "id:a*" will select a random 1/16 of the entire document set. "id:aa*" 
selects 1/256 of the document set, again very randomly. Statistical analysis 
and data extraction projects can use this to select small subsets instead of 
walking the entire index.
- Cryptographic hash:
-       A cryptographic hashing algorithm can be thought of as creating N very 
random bits from the input data.
-       The MD5 algorithm create 128 bits. This means that 2 input data sets 
have a chance of 1 in 2^128 of creating the same MD5. (MD5 has been "cracked" 
and there are techniques for creating collisions, but this is unlikely to 
happen in your Solr production.)
-       There is a standard expression of this as 32 hexadecimal characters. 
http://www.ietf.org/rfc/rfc1321.txt 
-       Several MD5 digest algorithm packages for various languages do not 
follow this standard.
  
-       One advantage in using a crypto-generated unique key is that you can 
select a random subset of documents via wildcards. If the UUID data is saved as 
a string in the 32-character RFC format, 'd3adbe3fdeadb3e4deadbee4deadb3ef', 
the query "id:a*" will select a random 1/16 of the entire document set. 
"id:aa*" selects 1/256 of the document set, again very randomly. Statistical 
analysis and data extraction projects can use this to select small subsets 
instead of walking the entire index.
- 
- 
- 
-

[Solr Wiki] Update of "UniqueKey" by Lance Norskog

Reply via email to