Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change 
notification.

The following page has been changed by Lance Norskog:
http://wiki.apache.org/solr/UniqueKey

------------------------------------------------------------------------------
  == UUID techniques ==
   UUID is short for Universal Unique IDentifier. The UUID standard 
[http://www.ietf.org/rfc/rfc4122.txt RFC-4122] includes several types of UUID 
with different input formats. There is a UUID field type in Solr 1.4 which 
implements version 4. Also, the ExtractingRequestHandler automatically creates 
UUID version 4. You can also implement a UUID string from a cryptographic hash.
  == Cryptographic hash ==
-  A cryptographic hashing algorithm can be thought of as creating N very 
random bits from the input data. The MD5 algorithm create 128 bits. This means 
that 2 input data sets have a chance of 1 in 2^128 of creating the same MD5. 
There is a standard expression of this as 32 hexadecimal characters. 
[http://www.ietf.org/rfc/rfc1321.txt RFC-1321]. Several MD5 digest algorithm 
packages for various languages do not follow this standard. The UUID standard 
always includes the time at the creation of the UUID, which precludes some of 
the above use cases. You can cheat and ignore the clock requirement. It is best 
to use the UUID text format: ''550e8400-e29b-41d4-a716-446655440000'' instead 
of ''550e8400e29b41d4a716446655440000''. (You will read many of these 
keys.)[[BR]]
+  A cryptographic hashing algorithm can be thought of as creating N very 
random bits from the input data. The MD5 algorithm create 128 bits. This means 
that 2 input data sets have a chance of 1 in 2^128 of creating the same MD5. 
There is a standard expression of this as 32 hexadecimal characters. 
[http://www.ietf.org/rfc/rfc1321.txt RFC-1321]. Several MD5 digest algorithm 
packages for various languages do not follow this standard. The UUID standard 
always includes the time at the creation of the UUID, which precludes some of 
the above use cases. You can cheat and ignore the clock requirement. It is best 
to use the UUID text format: ''550e8400-e29b-41d4-a716-446655440000'' instead 
of ''550e8400e29b41d4a716446655440000''. (You will read many of these keys.)
+ 
   One advantage in using a crypto-generated unique key is that you can select 
a random subset of documents via wildcards. If the UUID data is saved as a 
string in the 32-character RFC format, 'd3adbe3fdeadb3e4deadbee4deadb3ef', the 
query "id:a*" will select a random 1/16 of the entire document set. "id:aa*" 
selects 1/256 of the document set, again very randomly. Statistical analysis 
and data extraction projects can use this to select small subsets instead of 
walking the entire index.
  

Reply via email to