Hi.
I'm coming back on an already much debated subject, with a few questions
I couldn't find answers for.
I started working on a new system backed by CouchDB, and am questioning
our choice to use "meaningful"/structured IDs (as opposed to UUIDs). Our
data revolves around documents called "cases", which can relate to
various documents, like notes, findings, measures. So we build IDs
looking like:
- 1234_case
- 1234_finding_f2ac2351
- 1234_finding_aa928399
- 1234_note_22933cf5
- 1234_measure_928dca87
Colleagues say they initially went for UUIDs, then moved on to a
meaningful scheme for guess-ability, which enabled easier replication,
as well as a few views referencing IDs (thanks to knowledge of the
naming structure), which expand to full documents with include_docs=true.
On my side, as a NoSQL freshman and without the project history, I can't
help wanting to move back to UUIDs, because:
1. As we're leaning heavily on the *naming* of our documents, I have the
feeling we're hiding ourselves we're not properly structuring our data
in a way that is view-friendly. Feels like it's going to come back and
bite us later on.
2. As we are adding logic, we're starting to see unwieldy IDs
(hash1_thing1_hash2_thing2_hash3_thing3_hash4)
3. Currently, the information contained in the ID (in the above example:
caseId, type, hash) is currently *only* here. So to "extract" this
information we have repetitive-but-slightly-different "splitId"
functions that extract and type these ids (for example:
"1234_finding_f2ac2351" -> {"caseId": 1234, "type": "finding",
"contentId": "f2ac2351"}, which is painful.
3.1. The obvious solution is be to repeat {caseId, type, hash} as
document properties. Then I can use them without having to call
splitId(doc._id). But then there's duplicated data, which will have to
be updated jointly. Is it a problem or is it just the time for me to
learn to stop worrying and not care about this kind of minor duplication
in NoSQL land?
Then, looking at what the internet says (see references below),
a. Both [PDB] and [DC] say non-uuid IDs are convenient for bare-bones
_all_docs querying (e.g. for "all of Bob Dylan's albums released between
1964 and 1965", just {startkey: 'album_dylan_1965_', endkey:
'album_dylan_1964_\uffff'}).
True, but how often will I be able to use such simple queries? I feel
like I'm going to need views anyway.
b. Both [PDB] and [DC] say that a structured ID naming means usable
indexes "for free", taking no additional space compared to a solution
with random UUIDs complemented with views.
- Also, both note that using UUIDs (thus, needing views) means
failing to use the built-anyway index on _id. True.
- [DC] goes as far as saying that "getting rid of as many views
(relying on _all_docs instead) as you can is a worthwhile goal". Is this
a shared opinion?
c. [INOI] and [GUIDE] note that incremental IDs will yield better
performance on bulk document inserts. Okay.
d. [SO] proposes to "use UUIDs unless you have a good reason not to",
and recommends to base your choice on "Cost of changing ID vs. How
likely the ID is to change" (if the ID is likely to change a lot, use a
UUID to force yourself to not rely on it).
What do you think? What do you use in your own projects?
Thanks for your help, thanks for CouchDB, and happy end-of-year :)
References ----
[PDB] (section "Use and abuse your doc IDs")
http://pouchdb.com/2014/05/01/secondary-indexes-have-landed-in-pouchdb.html
[DC]
http://davidcaylor.com/2012/05/26/can-i-see-your-id-please-the-importance-of-couchdb-record-ids/
[GUIDE] http://guide.couchdb.org/draft/performance.html#bulk
[INOI]
http://blog.inoi.fi/2010/11/impact-of-document-ids-on-performance.html
[SO]
http://stackoverflow.com/questions/1963632/what-is-best-practice-when-creating-document-ids-in-couchdb/1964947#1964947
--
Ronan