Hi.

I'm coming back on an already much debated subject, with a few questions I couldn't find answers for.

I started working on a new system backed by CouchDB, and am questioning our choice to use "meaningful"/structured IDs (as opposed to UUIDs). Our data revolves around documents called "cases", which can relate to various documents, like notes, findings, measures. So we build IDs looking like:
 - 1234_case
 - 1234_finding_f2ac2351
 - 1234_finding_aa928399
 - 1234_note_22933cf5
 - 1234_measure_928dca87

Colleagues say they initially went for UUIDs, then moved on to a meaningful scheme for guess-ability, which enabled easier replication, as well as a few views referencing IDs (thanks to knowledge of the naming structure), which expand to full documents with include_docs=true.

On my side, as a NoSQL freshman and without the project history, I can't help wanting to move back to UUIDs, because:

1. As we're leaning heavily on the *naming* of our documents, I have the feeling we're hiding ourselves we're not properly structuring our data in a way that is view-friendly. Feels like it's going to come back and bite us later on.

2. As we are adding logic, we're starting to see unwieldy IDs (hash1_thing1_hash2_thing2_hash3_thing3_hash4)

3. Currently, the information contained in the ID (in the above example: caseId, type, hash) is currently *only* here. So to "extract" this information we have repetitive-but-slightly-different "splitId" functions that extract and type these ids (for example: "1234_finding_f2ac2351" -> {"caseId": 1234, "type": "finding", "contentId": "f2ac2351"}, which is painful.

3.1. The obvious solution is be to repeat {caseId, type, hash} as document properties. Then I can use them without having to call splitId(doc._id). But then there's duplicated data, which will have to be updated jointly. Is it a problem or is it just the time for me to learn to stop worrying and not care about this kind of minor duplication in NoSQL land?

Then, looking at what the internet says (see references below),

a. Both [PDB] and [DC] say non-uuid IDs are convenient for bare-bones _all_docs querying (e.g. for "all of Bob Dylan's albums released between 1964 and 1965", just {startkey: 'album_dylan_1965_', endkey: 'album_dylan_1964_\uffff'}). True, but how often will I be able to use such simple queries? I feel like I'm going to need views anyway.

b. Both [PDB] and [DC] say that a structured ID naming means usable indexes "for free", taking no additional space compared to a solution with random UUIDs complemented with views. - Also, both note that using UUIDs (thus, needing views) means failing to use the built-anyway index on _id. True. - [DC] goes as far as saying that "getting rid of as many views (relying on _all_docs instead) as you can is a worthwhile goal". Is this a shared opinion?

c. [INOI] and [GUIDE] note that incremental IDs will yield better performance on bulk document inserts. Okay.

d. [SO] proposes to "use UUIDs unless you have a good reason not to", and recommends to base your choice on "Cost of changing ID vs. How likely the ID is to change" (if the ID is likely to change a lot, use a UUID to force yourself to not rely on it).

What do you think? What do you use in your own projects?

Thanks for your help, thanks for CouchDB, and happy end-of-year :)

References ----

[PDB] (section "Use and abuse your doc IDs") http://pouchdb.com/2014/05/01/secondary-indexes-have-landed-in-pouchdb.html

[DC] http://davidcaylor.com/2012/05/26/can-i-see-your-id-please-the-importance-of-couchdb-record-ids/

[GUIDE] http://guide.couchdb.org/draft/performance.html#bulk

[INOI] http://blog.inoi.fi/2010/11/impact-of-document-ids-on-performance.html

[SO] http://stackoverflow.com/questions/1963632/what-is-best-practice-when-creating-document-ids-in-couchdb/1964947#1964947

--
Ronan

Reply via email to