Re: Large lists of data

mickael . bailly Sat, 24 Jul 2010 07:42:34 -0700

Hello,

1/ it's a little hard to answer this question, your setup is certainly a little 
more complex than what you expose in your email :-) However thousands of 
documents are gracefuly handled by CouchDB.


2/ At first sight your documents will look like :
{ "_id": 0123456789 , "list": "mylist", "type": "NP", "status":"portedIn", 
"operatorId":1234 }

That way you can query your document by phone number :

GET /database/0123456789

 and have all documents belonging to the list "mylist" by creating a view that 
emits the "list" field :

function (doc) {
  if ( doc.list  && doc.type == "NP" ) {
    emit (doc.list,null);
  }
}

and fetching them with something like :

GET /database/_design/portability/_view/NP?key="mylist"&include_docs=true

3/ When updating a document : the document is of course immediately available. 
However the view index won't be updated. In CouchDB view indexes are rebuilt on 
view query (not on document update). When you'll query CouchDB "give me all the 
documents of the view NP", Couch will take all documents that have changed 
(added, updated, deleted) since the last time you asked Couch for the view, and 
will update indexes accordingly. You have the option of fetching the view 
without rebuilding the index, with the "stale" parameter, but in this case, of 
course, you won't see the changes. During the rebuilt of the index, subsequent 
view queries are queued until the index is up to date.

4/ I setup CouchDB to parse network logs. A view took something like 25 minuts 
for 100 millions documents, on a Dell PowerEdge 2950 Xen Virtual Machine with 
two dedicated processors and 4gigs ram. Numbers can heavily vary according to 
the complexity of the view, so it's always hard (and dangerous) to give 
numbers. Moreover my indexes were not only numbers, but also strings.


What you should be aware of is that CouchDB requires maintenance tasks to keep 
great performances, it's called "compact" and should be run on databases (to 
rebuilt the db file that is append-only) and on databases views (to rebuild the 
index file that is append-only). During the compact, database is still 
available but performances are degraded (from my personnal experience).
Also, a new replication engine is in the pipe and should greatly improve the 
replication experience.


Mickael

----- Mail Original -----
De: "John" <[email protected]>
À: [email protected]
Envoyé: Samedi 24 Juillet 2010 11h37:56 GMT +01:00 Amsterdam / Berlin / Berne / 
Rome / Stockholm / Vienne
Objet: Large lists of data

Hi 

I'm currently evaluating couchdb as a candidate to replace the relational 
databases as used in our Telecom Applications.
For most of our data I can see a good fit and we already expose our service 
provisioning as json over REST so we're well positioned for a migration.
One area that concerns me though is whether this technology is suitable for our 
list data. An example of this is Mobile Number Portability where we have 
millions of rows of data representing ported numbers with some atrributes 
against each.

We use the standard Relational approach to this and have an entries table that 
has a foreign key reference to a parent list. 

On our web services we do something like this:

Create a List:

PUT /cie-rest/provision/accounts/netdev/lists/mylist
{ "type": "NP"}

To add a row to a list 
PUT /cie-rest/provision/accounts/netdev/lists/mylist/entries/0123456789
{ "status":"portedIn", "operatorId":1234}

If we want to add a lot of rows we just POST a document to the list.

The list data is used when processing calls and it requires a fast lookup on 
the entries table which is obviously indexed.

Anyway, I'd be interested in getting some opinions on:

1) Is couchdb the *right* technology for this job? (I know it can do it!)

2) I presume that the relationship I currently have in my relational database 
would remain the same for couch i.e. The entry document would ref the list 
document but maybe there's a better way to do this?

3) Number portability requires 15 min, 1 hour and daily syncs with a central 
number portability database. This can result in bulk updates of thousands of 
numbers. I'm concerned with how long it takes to build a couchdb index and to 
incrementally update it when the number of changes is large (Adds/removes).  
What does this mean to the availability of the number? i.e. Is the entry in the 
db but its unavailable to the application as it's entry in the index hasnt been 
built yet?

4) Telephone numbers like btrees so the index building should be quite fast and 
efficient I would of thought but does someone have anything more concrete in 
terms of how long it would take typically? I think that the bottleneck is the 
disk i/o and therefore it may be vastly different between my laptop and one of 
our beefy production servers but again I'd be interested in other peoples 
experience.

Bit of a long one so thanks if you've read it to this point! There's a lot to 
like with couchdb (esp the replication for our use case) so I'm hoping that 
what i've asked above is feasible!

Thanks

John

Re: Large lists of data

Reply via email to