Re: Large lists of data

John Logsdon Sat, 24 Jul 2010 13:38:02 -0700

In SQL it would be something like:

Select * from entries e
Where e.list-id=1234 
and key in ('0123456789', '012345678', '01234567', '0123456', '012345', '01234')



On 24 Jul 2010, at 21:20, J Chris Anderson <[email protected]> wrote:

> 
> On Jul 24, 2010, at 1:13 PM, John wrote:
> 
>> I'm not sure if that will give me what I want exactly.
>> 
>> I want to search for these exact numbers only nothing in-between
>> 
>>>> 0123456789
>>>> 012345678
>>>> 01234567
>>>> 0123456
>>>> 012345
>>>> 01234
>> 
>> Therefore the most documents it could possibly return is 6 i.e. even though 
>> startKey is "01234" and endkey is "0123456789" the number "012346" is not a 
>> valid match. 
>> 
> 
> statykey "01234" endkey "012346" with inclusive_end = false should work, 
> except it will bring in eg "0123457"
> 
> If "0123457" is not OK then I'm flummoxed as to how to describe your key 
> requirements in a simple way, regardless of technology.
> 
>> Hope that makes sense!
>> 
>> John
>> 
>> On 24 Jul 2010, at 20:49, J Chris Anderson wrote:
>> 
>>> 
>>> On Jul 24, 2010, at 12:35 PM, John wrote:
>>> 
>>>> Hi
>>>> 
>>>> Thanks to you both for the answers so far. Indeed my setup is far more 
>>>> complex than I have exposed to date but I'm making it into bite sized 
>>>> chunks around the Use Cases that I think are the more challenging for me.
>>>> 
>>>> Although your answers were useful they don't quite hit the mark and that's 
>>>> probably because I didn't explain my problem well enough to start with!
>>>> 
>>>> The database will contain entries from multiple lists (many thousands 
>>>> perhaps) so the _id will never be unique on a telephone number. Perhaps 
>>>> this might work though:
>>>> 
>>>> GET /database/<list _id>#0123456789
>>>> 
>>>> or I could just keep the _id as a uuid and move this problem (find by list 
>>>> id and number) to the view.
>>>> 
>>> 
>>> for now I'd say just go with uuids and you can have a view by telephone 
>>> number for direct (or starts_with) lookups.
>>> 
>>>> The view by list wont work for me. I need to be able to query the view 
>>>> with something like:
>>>> 
>>>> GET  /database/_design/portability/_view/NP?key=0123456789&list=<_id of 
>>>> list>
>>>> 
>>>> In fact in some cases the problem is more complex than this as I need to 
>>>> search for "widest match":
>>>> 
>>>> GET  /database/_design/portability/_view/NP?key=0123456789&list=<_id of 
>>>> list>&min_width=5
>>>> 
>>>> which would return the widest match in:
>>>> 
>>>> 0123456789
>>>> 012345678
>>>> 01234567
>>>> 0123456
>>>> 012345
>>>> 01234
>>>> 
>>>> 
>>>> I even have another use case where I need to do a STARTS_WITH e.g. provide 
>>>> a key of 01234 and return true if there are any numbers that start 01234.
>>>> 
>>> 
>>> this is easy. have a view like:
>>> 
>>> function(doc) {
>>> emit([doc.list_id, doc.number], null) 
>>> }
>>> 
>>> Then you can query with 
>>> 
>>> ?startkey=["mylist", "012"]&endkey=["mylist", "013"]
>>> 
>>> to get everything with a prefix of "012" in the "mylist" list. you can mess 
>>> around with the endkey_inclusive (or is it inclusive_endkey) = true / false 
>>> to not get the exact number "013" in your result set.
>>> 
>>> from this technique you can see how you could do starts-with against just 
>>> phone numbers also, with a view like
>>> 
>>> function(doc_ {
>>> emit(doc.number, null)
>>> }
>>> 
>>> Note I have telephone numbers as strings in this example as a regular 
>>> number 012 is the same as 12.
>>> 
>>>> This is a typical telecom problem and it would be good to document a 
>>>> Design Pattern for this Use Case. In fact there's a discussion for another 
>>>> day on how/where we could document this patterns and get peer reviews on 
>>>> them.
>>>> 
>>>> Thanks again
>>>> 
>>>> John
>>>> 
>>>> On 24 Jul 2010, at 19:15, J Chris Anderson wrote:
>>>> 
>>>>> 
>>>>> On Jul 24, 2010, at 7:41 AM, [email protected] wrote:
>>>>> 
>>>>>> Hello,
>>>>>> 
>>>>>> 1/ it's a little hard to answer this question, your setup is certainly a 
>>>>>> little more complex than what you expose in your email :-) However 
>>>>>> thousands of documents are gracefuly handled by CouchDB.
>>>>>> 
>>>>>> 2/ At first sight your documents will look like :
>>>>>> { "_id": 0123456789 , "list": "mylist", "type": "NP", 
>>>>>> "status":"portedIn", "operatorId":1234 }
>>>>>> 
>>>>>> That way you can query your document by phone number :
>>>>>> 
>>>>>> GET /database/0123456789
>>>>>> 
>>>>>> and have all documents belonging to the list "mylist" by creating a view 
>>>>>> that emits the "list" field :
>>>>>> 
>>>>>> function (doc) {
>>>>>> if ( doc.list  && doc.type == "NP" ) {
>>>>>> emit (doc.list,null);
>>>>>> }
>>>>>> }
>>>>>> 
>>>>>> and fetching them with something like :
>>>>>> 
>>>>>> GET /database/_design/portability/_view/NP?key="mylist"&include_docs=true
>>>>>> 
>>>>>> 3/ When updating a document : the document is of course immediately 
>>>>>> available. However the view index won't be updated. In CouchDB view 
>>>>>> indexes are rebuilt on view query (not on document update). When you'll 
>>>>>> query CouchDB "give me all the documents of the view NP", Couch will 
>>>>>> take all documents that have changed (added, updated, deleted) since the 
>>>>>> last time you asked Couch for the view, and will update indexes 
>>>>>> accordingly. You have the option of fetching the view without rebuilding 
>>>>>> the index, with the "stale" parameter, but in this case, of course, you 
>>>>>> won't see the changes. During the rebuilt of the index, subsequent view 
>>>>>> queries are queued until the index is up to date.
>>>>>> 
>>>>>> 4/ I setup CouchDB to parse network logs. A view took something like 25 
>>>>>> minuts for 100 millions documents, on a Dell PowerEdge 2950 Xen Virtual 
>>>>>> Machine with two dedicated processors and 4gigs ram. Numbers can heavily 
>>>>>> vary according to the complexity of the view, so it's always hard (and 
>>>>>> dangerous) to give numbers. Moreover my indexes were not only numbers, 
>>>>>> but also strings.
>>>>>> 
>>>>> 
>>>>> this is a good response. I'd only follow up to say that there are some 
>>>>> techniques you can use to further tune view-generation performance. one: 
>>>>> keysize and entropy can make a big difference. the view by list, as 
>>>>> above, looks pretty good on that front.
>>>>> 
>>>>> CouchDB can also be configured to store view indexes on a separate disk 
>>>>> from the database file, which can reduce IO contention if you are at the 
>>>>> edge of what your hardware can do.
>>>>> 
>>>>> Also, there is the option to query views with stale=ok, which will return 
>>>>> a query based on the latest snapshot, with low latency, so clients aren't 
>>>>> blocked waiting for generation to complete. then you can use a cron-job 
>>>>> with a regular view query and limit=1 to keep the index up to date. so 
>>>>> clients always see a fairly recent snapshot, with low latency.
>>>>> 
>>>>>> 
>>>>>> What you should be aware of is that CouchDB requires maintenance tasks 
>>>>>> to keep great performances, it's called "compact" and should be run on 
>>>>>> databases (to rebuilt the db file that is append-only) and on databases 
>>>>>> views (to rebuild the index file that is append-only). During the 
>>>>>> compact, database is still available but performances are degraded (from 
>>>>>> my personnal experience).
>>>>>> Also, a new replication engine is in the pipe and should greatly improve 
>>>>>> the replication experience.
>>>>>> 
>>>>>> 
>>>>>> Mickael
>>>>>> 
>>>>>> ----- Mail Original -----
>>>>>> De: "John" <[email protected]>
>>>>>> À: [email protected]
>>>>>> Envoyé: Samedi 24 Juillet 2010 11h37:56 GMT +01:00 Amsterdam / Berlin / 
>>>>>> Berne / Rome / Stockholm / Vienne
>>>>>> Objet: Large lists of data
>>>>>> 
>>>>>> Hi 
>>>>>> 
>>>>>> I'm currently evaluating couchdb as a candidate to replace the 
>>>>>> relational databases as used in our Telecom Applications.
>>>>>> For most of our data I can see a good fit and we already expose our 
>>>>>> service provisioning as json over REST so we're well positioned for a 
>>>>>> migration.
>>>>>> One area that concerns me though is whether this technology is suitable 
>>>>>> for our list data. An example of this is Mobile Number Portability where 
>>>>>> we have millions of rows of data representing ported numbers with some 
>>>>>> atrributes against each.
>>>>>> 
>>>>>> We use the standard Relational approach to this and have an entries 
>>>>>> table that has a foreign key reference to a parent list. 
>>>>>> 
>>>>>> On our web services we do something like this:
>>>>>> 
>>>>>> Create a List:
>>>>>> 
>>>>>> PUT /cie-rest/provision/accounts/netdev/lists/mylist
>>>>>> { "type": "NP"}
>>>>>> 
>>>>>> To add a row to a list 
>>>>>> PUT /cie-rest/provision/accounts/netdev/lists/mylist/entries/0123456789
>>>>>> { "status":"portedIn", "operatorId":1234}
>>>>>> 
>>>>>> If we want to add a lot of rows we just POST a document to the list.
>>>>>> 
>>>>>> The list data is used when processing calls and it requires a fast 
>>>>>> lookup on the entries table which is obviously indexed.
>>>>>> 
>>>>>> Anyway, I'd be interested in getting some opinions on:
>>>>>> 
>>>>>> 1) Is couchdb the *right* technology for this job? (I know it can do it!)
>>>>>> 
>>>>>> 2) I presume that the relationship I currently have in my relational 
>>>>>> database would remain the same for couch i.e. The entry document would 
>>>>>> ref the list document but maybe there's a better way to do this?
>>>>>> 
>>>>>> 3) Number portability requires 15 min, 1 hour and daily syncs with a 
>>>>>> central number portability database. This can result in bulk updates of 
>>>>>> thousands of numbers. I'm concerned with how long it takes to build a 
>>>>>> couchdb index and to incrementally update it when the number of changes 
>>>>>> is large (Adds/removes).  
>>>>>> What does this mean to the availability of the number? i.e. Is the entry 
>>>>>> in the db but its unavailable to the application as it's entry in the 
>>>>>> index hasnt been built yet?
>>>>>> 
>>>>>> 4) Telephone numbers like btrees so the index building should be quite 
>>>>>> fast and efficient I would of thought but does someone have anything 
>>>>>> more concrete in terms of how long it would take typically? I think that 
>>>>>> the bottleneck is the disk i/o and therefore it may be vastly different 
>>>>>> between my laptop and one of our beefy production servers but again I'd 
>>>>>> be interested in other peoples experience.
>>>>>> 
>>>>>> Bit of a long one so thanks if you've read it to this point! There's a 
>>>>>> lot to like with couchdb (esp the replication for our use case) so I'm 
>>>>>> hoping that what i've asked above is feasible!
>>>>>> 
>>>>>> Thanks
>>>>>> 
>>>>>> John
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
>

Re: Large lists of data

Reply via email to