Re: Large lists of data

J Chris Anderson Sat, 24 Jul 2010 13:20:45 -0700

On Jul 24, 2010, at 1:13 PM, John wrote:

> I'm not sure if that will give me what I want exactly.
> 
> I want to search for these exact numbers only nothing in-between
> 
>>> 0123456789
>>> 012345678
>>> 01234567
>>> 0123456
>>> 012345
>>> 01234
> 
> Therefore the most documents it could possibly return is 6 i.e. even though 
> startKey is "01234" and endkey is "0123456789" the number "012346" is not a 
> valid match. 
>


statykey "01234" endkey "012346" with inclusive_end = false should work, except 
it will bring in eg "0123457"

If "0123457" is not OK then I'm flummoxed as to how to describe your key 
requirements in a simple way, regardless of technology.

> Hope that makes sense!
> 
> John
> 
> On 24 Jul 2010, at 20:49, J Chris Anderson wrote:
> 
>> 
>> On Jul 24, 2010, at 12:35 PM, John wrote:
>> 
>>> Hi
>>> 
>>> Thanks to you both for the answers so far. Indeed my setup is far more 
>>> complex than I have exposed to date but I'm making it into bite sized 
>>> chunks around the Use Cases that I think are the more challenging for me.
>>> 
>>> Although your answers were useful they don't quite hit the mark and that's 
>>> probably because I didn't explain my problem well enough to start with!
>>> 
>>> The database will contain entries from multiple lists (many thousands 
>>> perhaps) so the _id will never be unique on a telephone number. Perhaps 
>>> this might work though:
>>> 
>>> GET /database/<list _id>#0123456789
>>> 
>>> or I could just keep the _id as a uuid and move this problem (find by list 
>>> id and number) to the view.
>>> 
>> 
>> for now I'd say just go with uuids and you can have a view by telephone 
>> number for direct (or starts_with) lookups.
>> 
>>> The view by list wont work for me. I need to be able to query the view with 
>>> something like:
>>> 
>>> GET  /database/_design/portability/_view/NP?key=0123456789&list=<_id of 
>>> list>
>>> 
>>> In fact in some cases the problem is more complex than this as I need to 
>>> search for "widest match":
>>> 
>>> GET  /database/_design/portability/_view/NP?key=0123456789&list=<_id of 
>>> list>&min_width=5
>>> 
>>> which would return the widest match in:
>>> 
>>> 0123456789
>>> 012345678
>>> 01234567
>>> 0123456
>>> 012345
>>> 01234
>>> 
>>> 
>>> I even have another use case where I need to do a STARTS_WITH e.g. provide 
>>> a key of 01234 and return true if there are any numbers that start 01234.
>>> 
>> 
>> this is easy. have a view like:
>> 
>> function(doc) {
>> emit([doc.list_id, doc.number], null) 
>> }
>> 
>> Then you can query with 
>> 
>> ?startkey=["mylist", "012"]&endkey=["mylist", "013"]
>> 
>> to get everything with a prefix of "012" in the "mylist" list. you can mess 
>> around with the endkey_inclusive (or is it inclusive_endkey) = true / false 
>> to not get the exact number "013" in your result set.
>> 
>> from this technique you can see how you could do starts-with against just 
>> phone numbers also, with a view like
>> 
>> function(doc_ {
>> emit(doc.number, null)
>> }
>> 
>> Note I have telephone numbers as strings in this example as a regular number 
>> 012 is the same as 12.
>> 
>>> This is a typical telecom problem and it would be good to document a Design 
>>> Pattern for this Use Case. In fact there's a discussion for another day on 
>>> how/where we could document this patterns and get peer reviews on them.
>>> 
>>> Thanks again
>>> 
>>> John
>>> 
>>> On 24 Jul 2010, at 19:15, J Chris Anderson wrote:
>>> 
>>>> 
>>>> On Jul 24, 2010, at 7:41 AM, [email protected] wrote:
>>>> 
>>>>> Hello,
>>>>> 
>>>>> 1/ it's a little hard to answer this question, your setup is certainly a 
>>>>> little more complex than what you expose in your email :-) However 
>>>>> thousands of documents are gracefuly handled by CouchDB.
>>>>> 
>>>>> 2/ At first sight your documents will look like :
>>>>> { "_id": 0123456789 , "list": "mylist", "type": "NP", 
>>>>> "status":"portedIn", "operatorId":1234 }
>>>>> 
>>>>> That way you can query your document by phone number :
>>>>> 
>>>>> GET /database/0123456789
>>>>> 
>>>>> and have all documents belonging to the list "mylist" by creating a view 
>>>>> that emits the "list" field :
>>>>> 
>>>>> function (doc) {
>>>>> if ( doc.list  && doc.type == "NP" ) {
>>>>> emit (doc.list,null);
>>>>> }
>>>>> }
>>>>> 
>>>>> and fetching them with something like :
>>>>> 
>>>>> GET /database/_design/portability/_view/NP?key="mylist"&include_docs=true
>>>>> 
>>>>> 3/ When updating a document : the document is of course immediately 
>>>>> available. However the view index won't be updated. In CouchDB view 
>>>>> indexes are rebuilt on view query (not on document update). When you'll 
>>>>> query CouchDB "give me all the documents of the view NP", Couch will take 
>>>>> all documents that have changed (added, updated, deleted) since the last 
>>>>> time you asked Couch for the view, and will update indexes accordingly. 
>>>>> You have the option of fetching the view without rebuilding the index, 
>>>>> with the "stale" parameter, but in this case, of course, you won't see 
>>>>> the changes. During the rebuilt of the index, subsequent view queries are 
>>>>> queued until the index is up to date.
>>>>> 
>>>>> 4/ I setup CouchDB to parse network logs. A view took something like 25 
>>>>> minuts for 100 millions documents, on a Dell PowerEdge 2950 Xen Virtual 
>>>>> Machine with two dedicated processors and 4gigs ram. Numbers can heavily 
>>>>> vary according to the complexity of the view, so it's always hard (and 
>>>>> dangerous) to give numbers. Moreover my indexes were not only numbers, 
>>>>> but also strings.
>>>>> 
>>>> 
>>>> this is a good response. I'd only follow up to say that there are some 
>>>> techniques you can use to further tune view-generation performance. one: 
>>>> keysize and entropy can make a big difference. the view by list, as above, 
>>>> looks pretty good on that front.
>>>> 
>>>> CouchDB can also be configured to store view indexes on a separate disk 
>>>> from the database file, which can reduce IO contention if you are at the 
>>>> edge of what your hardware can do.
>>>> 
>>>> Also, there is the option to query views with stale=ok, which will return 
>>>> a query based on the latest snapshot, with low latency, so clients aren't 
>>>> blocked waiting for generation to complete. then you can use a cron-job 
>>>> with a regular view query and limit=1 to keep the index up to date. so 
>>>> clients always see a fairly recent snapshot, with low latency.
>>>> 
>>>>> 
>>>>> What you should be aware of is that CouchDB requires maintenance tasks to 
>>>>> keep great performances, it's called "compact" and should be run on 
>>>>> databases (to rebuilt the db file that is append-only) and on databases 
>>>>> views (to rebuild the index file that is append-only). During the 
>>>>> compact, database is still available but performances are degraded (from 
>>>>> my personnal experience).
>>>>> Also, a new replication engine is in the pipe and should greatly improve 
>>>>> the replication experience.
>>>>> 
>>>>> 
>>>>> Mickael
>>>>> 
>>>>> ----- Mail Original -----
>>>>> De: "John" <[email protected]>
>>>>> À: [email protected]
>>>>> Envoyé: Samedi 24 Juillet 2010 11h37:56 GMT +01:00 Amsterdam / Berlin / 
>>>>> Berne / Rome / Stockholm / Vienne
>>>>> Objet: Large lists of data
>>>>> 
>>>>> Hi 
>>>>> 
>>>>> I'm currently evaluating couchdb as a candidate to replace the relational 
>>>>> databases as used in our Telecom Applications.
>>>>> For most of our data I can see a good fit and we already expose our 
>>>>> service provisioning as json over REST so we're well positioned for a 
>>>>> migration.
>>>>> One area that concerns me though is whether this technology is suitable 
>>>>> for our list data. An example of this is Mobile Number Portability where 
>>>>> we have millions of rows of data representing ported numbers with some 
>>>>> atrributes against each.
>>>>> 
>>>>> We use the standard Relational approach to this and have an entries table 
>>>>> that has a foreign key reference to a parent list. 
>>>>> 
>>>>> On our web services we do something like this:
>>>>> 
>>>>> Create a List:
>>>>> 
>>>>> PUT /cie-rest/provision/accounts/netdev/lists/mylist
>>>>> { "type": "NP"}
>>>>> 
>>>>> To add a row to a list 
>>>>> PUT /cie-rest/provision/accounts/netdev/lists/mylist/entries/0123456789
>>>>> { "status":"portedIn", "operatorId":1234}
>>>>> 
>>>>> If we want to add a lot of rows we just POST a document to the list.
>>>>> 
>>>>> The list data is used when processing calls and it requires a fast lookup 
>>>>> on the entries table which is obviously indexed.
>>>>> 
>>>>> Anyway, I'd be interested in getting some opinions on:
>>>>> 
>>>>> 1) Is couchdb the *right* technology for this job? (I know it can do it!)
>>>>> 
>>>>> 2) I presume that the relationship I currently have in my relational 
>>>>> database would remain the same for couch i.e. The entry document would 
>>>>> ref the list document but maybe there's a better way to do this?
>>>>> 
>>>>> 3) Number portability requires 15 min, 1 hour and daily syncs with a 
>>>>> central number portability database. This can result in bulk updates of 
>>>>> thousands of numbers. I'm concerned with how long it takes to build a 
>>>>> couchdb index and to incrementally update it when the number of changes 
>>>>> is large (Adds/removes).  
>>>>> What does this mean to the availability of the number? i.e. Is the entry 
>>>>> in the db but its unavailable to the application as it's entry in the 
>>>>> index hasnt been built yet?
>>>>> 
>>>>> 4) Telephone numbers like btrees so the index building should be quite 
>>>>> fast and efficient I would of thought but does someone have anything more 
>>>>> concrete in terms of how long it would take typically? I think that the 
>>>>> bottleneck is the disk i/o and therefore it may be vastly different 
>>>>> between my laptop and one of our beefy production servers but again I'd 
>>>>> be interested in other peoples experience.
>>>>> 
>>>>> Bit of a long one so thanks if you've read it to this point! There's a 
>>>>> lot to like with couchdb (esp the replication for our use case) so I'm 
>>>>> hoping that what i've asked above is feasible!
>>>>> 
>>>>> Thanks
>>>>> 
>>>>> John
>>>>> 
>>>>> 
>>>> 
>>> 
>> 
>

Re: Large lists of data

Reply via email to