Hi Thanks to you both for the answers so far. Indeed my setup is far more complex than I have exposed to date but I'm making it into bite sized chunks around the Use Cases that I think are the more challenging for me.
Although your answers were useful they don't quite hit the mark and that's probably because I didn't explain my problem well enough to start with! The database will contain entries from multiple lists (many thousands perhaps) so the _id will never be unique on a telephone number. Perhaps this might work though: GET /database/<list _id>#0123456789 or I could just keep the _id as a uuid and move this problem (find by list id and number) to the view. The view by list wont work for me. I need to be able to query the view with something like: GET /database/_design/portability/_view/NP?key=0123456789&list=<_id of list> In fact in some cases the problem is more complex than this as I need to search for "widest match": GET /database/_design/portability/_view/NP?key=0123456789&list=<_id of list>&min_width=5 which would return the widest match in: 0123456789 012345678 01234567 0123456 012345 01234 I even have another use case where I need to do a STARTS_WITH e.g. provide a key of 01234 and return true if there are any numbers that start 01234. This is a typical telecom problem and it would be good to document a Design Pattern for this Use Case. In fact there's a discussion for another day on how/where we could document this patterns and get peer reviews on them. Thanks again John On 24 Jul 2010, at 19:15, J Chris Anderson wrote: > > On Jul 24, 2010, at 7:41 AM, [email protected] wrote: > >> Hello, >> >> 1/ it's a little hard to answer this question, your setup is certainly a >> little more complex than what you expose in your email :-) However thousands >> of documents are gracefuly handled by CouchDB. >> >> 2/ At first sight your documents will look like : >> { "_id": 0123456789 , "list": "mylist", "type": "NP", "status":"portedIn", >> "operatorId":1234 } >> >> That way you can query your document by phone number : >> >> GET /database/0123456789 >> >> and have all documents belonging to the list "mylist" by creating a view >> that emits the "list" field : >> >> function (doc) { >> if ( doc.list && doc.type == "NP" ) { >> emit (doc.list,null); >> } >> } >> >> and fetching them with something like : >> >> GET /database/_design/portability/_view/NP?key="mylist"&include_docs=true >> >> 3/ When updating a document : the document is of course immediately >> available. However the view index won't be updated. In CouchDB view indexes >> are rebuilt on view query (not on document update). When you'll query >> CouchDB "give me all the documents of the view NP", Couch will take all >> documents that have changed (added, updated, deleted) since the last time >> you asked Couch for the view, and will update indexes accordingly. You have >> the option of fetching the view without rebuilding the index, with the >> "stale" parameter, but in this case, of course, you won't see the changes. >> During the rebuilt of the index, subsequent view queries are queued until >> the index is up to date. >> >> 4/ I setup CouchDB to parse network logs. A view took something like 25 >> minuts for 100 millions documents, on a Dell PowerEdge 2950 Xen Virtual >> Machine with two dedicated processors and 4gigs ram. Numbers can heavily >> vary according to the complexity of the view, so it's always hard (and >> dangerous) to give numbers. Moreover my indexes were not only numbers, but >> also strings. >> > > this is a good response. I'd only follow up to say that there are some > techniques you can use to further tune view-generation performance. one: > keysize and entropy can make a big difference. the view by list, as above, > looks pretty good on that front. > > CouchDB can also be configured to store view indexes on a separate disk from > the database file, which can reduce IO contention if you are at the edge of > what your hardware can do. > > Also, there is the option to query views with stale=ok, which will return a > query based on the latest snapshot, with low latency, so clients aren't > blocked waiting for generation to complete. then you can use a cron-job with > a regular view query and limit=1 to keep the index up to date. so clients > always see a fairly recent snapshot, with low latency. > >> >> What you should be aware of is that CouchDB requires maintenance tasks to >> keep great performances, it's called "compact" and should be run on >> databases (to rebuilt the db file that is append-only) and on databases >> views (to rebuild the index file that is append-only). During the compact, >> database is still available but performances are degraded (from my personnal >> experience). >> Also, a new replication engine is in the pipe and should greatly improve the >> replication experience. >> >> >> Mickael >> >> ----- Mail Original ----- >> De: "John" <[email protected]> >> À: [email protected] >> Envoyé: Samedi 24 Juillet 2010 11h37:56 GMT +01:00 Amsterdam / Berlin / >> Berne / Rome / Stockholm / Vienne >> Objet: Large lists of data >> >> Hi >> >> I'm currently evaluating couchdb as a candidate to replace the relational >> databases as used in our Telecom Applications. >> For most of our data I can see a good fit and we already expose our service >> provisioning as json over REST so we're well positioned for a migration. >> One area that concerns me though is whether this technology is suitable for >> our list data. An example of this is Mobile Number Portability where we have >> millions of rows of data representing ported numbers with some atrributes >> against each. >> >> We use the standard Relational approach to this and have an entries table >> that has a foreign key reference to a parent list. >> >> On our web services we do something like this: >> >> Create a List: >> >> PUT /cie-rest/provision/accounts/netdev/lists/mylist >> { "type": "NP"} >> >> To add a row to a list >> PUT /cie-rest/provision/accounts/netdev/lists/mylist/entries/0123456789 >> { "status":"portedIn", "operatorId":1234} >> >> If we want to add a lot of rows we just POST a document to the list. >> >> The list data is used when processing calls and it requires a fast lookup on >> the entries table which is obviously indexed. >> >> Anyway, I'd be interested in getting some opinions on: >> >> 1) Is couchdb the *right* technology for this job? (I know it can do it!) >> >> 2) I presume that the relationship I currently have in my relational >> database would remain the same for couch i.e. The entry document would ref >> the list document but maybe there's a better way to do this? >> >> 3) Number portability requires 15 min, 1 hour and daily syncs with a central >> number portability database. This can result in bulk updates of thousands of >> numbers. I'm concerned with how long it takes to build a couchdb index and >> to incrementally update it when the number of changes is large >> (Adds/removes). >> What does this mean to the availability of the number? i.e. Is the entry in >> the db but its unavailable to the application as it's entry in the index >> hasnt been built yet? >> >> 4) Telephone numbers like btrees so the index building should be quite fast >> and efficient I would of thought but does someone have anything more >> concrete in terms of how long it would take typically? I think that the >> bottleneck is the disk i/o and therefore it may be vastly different between >> my laptop and one of our beefy production servers but again I'd be >> interested in other peoples experience. >> >> Bit of a long one so thanks if you've read it to this point! There's a lot >> to like with couchdb (esp the replication for our use case) so I'm hoping >> that what i've asked above is feasible! >> >> Thanks >> >> John >> >> >
