Re: Large lists of data

John Sat, 24 Jul 2010 12:35:40 -0700

Hi

Thanks to you both for the answers so far. Indeed my setup is far more complex 
than I have exposed to date but I'm making it into bite sized chunks around the 
Use Cases that I think are the more challenging for me.


Although your answers were useful they don't quite hit the mark and that's 
probably because I didn't explain my problem well enough to start with!

The database will contain entries from multiple lists (many thousands perhaps) 
so the _id will never be unique on a telephone number. Perhaps this might work 
though:

GET /database/<list _id>#0123456789

or I could just keep the _id as a uuid and move this problem (find by list id 
and number) to the view.

The view by list wont work for me. I need to be able to query the view with 
something like:

GET  /database/_design/portability/_view/NP?key=0123456789&list=<_id of list>

In fact in some cases the problem is more complex than this as I need to search 
for "widest match":

GET  /database/_design/portability/_view/NP?key=0123456789&list=<_id of 
list>&min_width=5

which would return the widest match in:

0123456789
012345678
01234567
0123456
012345
01234


I even have another use case where I need to do a STARTS_WITH e.g. provide a 
key of 01234 and return true if there are any numbers that start 01234.

This is a typical telecom problem and it would be good to document a Design 
Pattern for this Use Case. In fact there's a discussion for another day on 
how/where we could document this patterns and get peer reviews on them.

Thanks again

John

On 24 Jul 2010, at 19:15, J Chris Anderson wrote:

> 
> On Jul 24, 2010, at 7:41 AM, [email protected] wrote:
> 
>> Hello,
>> 
>> 1/ it's a little hard to answer this question, your setup is certainly a 
>> little more complex than what you expose in your email :-) However thousands 
>> of documents are gracefuly handled by CouchDB.
>> 
>> 2/ At first sight your documents will look like :
>> { "_id": 0123456789 , "list": "mylist", "type": "NP", "status":"portedIn", 
>> "operatorId":1234 }
>> 
>> That way you can query your document by phone number :
>> 
>> GET /database/0123456789
>> 
>> and have all documents belonging to the list "mylist" by creating a view 
>> that emits the "list" field :
>> 
>> function (doc) {
>> if ( doc.list  && doc.type == "NP" ) {
>>   emit (doc.list,null);
>> }
>> }
>> 
>> and fetching them with something like :
>> 
>> GET /database/_design/portability/_view/NP?key="mylist"&include_docs=true
>> 
>> 3/ When updating a document : the document is of course immediately 
>> available. However the view index won't be updated. In CouchDB view indexes 
>> are rebuilt on view query (not on document update). When you'll query 
>> CouchDB "give me all the documents of the view NP", Couch will take all 
>> documents that have changed (added, updated, deleted) since the last time 
>> you asked Couch for the view, and will update indexes accordingly. You have 
>> the option of fetching the view without rebuilding the index, with the 
>> "stale" parameter, but in this case, of course, you won't see the changes. 
>> During the rebuilt of the index, subsequent view queries are queued until 
>> the index is up to date.
>> 
>> 4/ I setup CouchDB to parse network logs. A view took something like 25 
>> minuts for 100 millions documents, on a Dell PowerEdge 2950 Xen Virtual 
>> Machine with two dedicated processors and 4gigs ram. Numbers can heavily 
>> vary according to the complexity of the view, so it's always hard (and 
>> dangerous) to give numbers. Moreover my indexes were not only numbers, but 
>> also strings.
>> 
> 
> this is a good response. I'd only follow up to say that there are some 
> techniques you can use to further tune view-generation performance. one: 
> keysize and entropy can make a big difference. the view by list, as above, 
> looks pretty good on that front.
> 
> CouchDB can also be configured to store view indexes on a separate disk from 
> the database file, which can reduce IO contention if you are at the edge of 
> what your hardware can do.
> 
> Also, there is the option to query views with stale=ok, which will return a 
> query based on the latest snapshot, with low latency, so clients aren't 
> blocked waiting for generation to complete. then you can use a cron-job with 
> a regular view query and limit=1 to keep the index up to date. so clients 
> always see a fairly recent snapshot, with low latency.
> 
>> 
>> What you should be aware of is that CouchDB requires maintenance tasks to 
>> keep great performances, it's called "compact" and should be run on 
>> databases (to rebuilt the db file that is append-only) and on databases 
>> views (to rebuild the index file that is append-only). During the compact, 
>> database is still available but performances are degraded (from my personnal 
>> experience).
>> Also, a new replication engine is in the pipe and should greatly improve the 
>> replication experience.
>> 
>> 
>> Mickael
>> 
>> ----- Mail Original -----
>> De: "John" <[email protected]>
>> À: [email protected]
>> Envoyé: Samedi 24 Juillet 2010 11h37:56 GMT +01:00 Amsterdam / Berlin / 
>> Berne / Rome / Stockholm / Vienne
>> Objet: Large lists of data
>> 
>> Hi 
>> 
>> I'm currently evaluating couchdb as a candidate to replace the relational 
>> databases as used in our Telecom Applications.
>> For most of our data I can see a good fit and we already expose our service 
>> provisioning as json over REST so we're well positioned for a migration.
>> One area that concerns me though is whether this technology is suitable for 
>> our list data. An example of this is Mobile Number Portability where we have 
>> millions of rows of data representing ported numbers with some atrributes 
>> against each.
>> 
>> We use the standard Relational approach to this and have an entries table 
>> that has a foreign key reference to a parent list. 
>> 
>> On our web services we do something like this:
>> 
>> Create a List:
>> 
>> PUT /cie-rest/provision/accounts/netdev/lists/mylist
>> { "type": "NP"}
>> 
>> To add a row to a list 
>> PUT /cie-rest/provision/accounts/netdev/lists/mylist/entries/0123456789
>> { "status":"portedIn", "operatorId":1234}
>> 
>> If we want to add a lot of rows we just POST a document to the list.
>> 
>> The list data is used when processing calls and it requires a fast lookup on 
>> the entries table which is obviously indexed.
>> 
>> Anyway, I'd be interested in getting some opinions on:
>> 
>> 1) Is couchdb the *right* technology for this job? (I know it can do it!)
>> 
>> 2) I presume that the relationship I currently have in my relational 
>> database would remain the same for couch i.e. The entry document would ref 
>> the list document but maybe there's a better way to do this?
>> 
>> 3) Number portability requires 15 min, 1 hour and daily syncs with a central 
>> number portability database. This can result in bulk updates of thousands of 
>> numbers. I'm concerned with how long it takes to build a couchdb index and 
>> to incrementally update it when the number of changes is large 
>> (Adds/removes).  
>> What does this mean to the availability of the number? i.e. Is the entry in 
>> the db but its unavailable to the application as it's entry in the index 
>> hasnt been built yet?
>> 
>> 4) Telephone numbers like btrees so the index building should be quite fast 
>> and efficient I would of thought but does someone have anything more 
>> concrete in terms of how long it would take typically? I think that the 
>> bottleneck is the disk i/o and therefore it may be vastly different between 
>> my laptop and one of our beefy production servers but again I'd be 
>> interested in other peoples experience.
>> 
>> Bit of a long one so thanks if you've read it to this point! There's a lot 
>> to like with couchdb (esp the replication for our use case) so I'm hoping 
>> that what i've asked above is feasible!
>> 
>> Thanks
>> 
>> John
>> 
>> 
>

Re: Large lists of data

Reply via email to