Re: Large lists of data

John Sat, 24 Jul 2010 14:15:10 -0700

Thanks for the help Chris, much appreciated. I'll give it a whirl tomorrow and 
I'll put a bit more work into phrasing my problem next time!


John

On 24 Jul 2010, at 21:51, J Chris Anderson wrote:

> 
> On Jul 24, 2010, at 1:36 PM, John Logsdon wrote:
> 
>> In SQL it would be something like:
>> 
>> Select * from entries e
>> Where e.list-id=1234 
>> and key in ('0123456789', '012345678', '01234567', '0123456', '012345', 
>> '01234')
>> 
> 
> oh well in that case (you know all the keys in advance) you can do a 
> multi-key query.
> 
> we tunnel a semantic GET over an http POST, so you post
> 
> {
> "keys" : [array of keys]
> }
> 
> to the view.
> 
> For some reason I thought you were looking for a special kind of range query.
> 
> Chris
> 
>> 
>> On 24 Jul 2010, at 21:20, J Chris Anderson <[email protected]> wrote:
>> 
>>> 
>>> On Jul 24, 2010, at 1:13 PM, John wrote:
>>> 
>>>> I'm not sure if that will give me what I want exactly.
>>>> 
>>>> I want to search for these exact numbers only nothing in-between
>>>> 
>>>>>> 0123456789
>>>>>> 012345678
>>>>>> 01234567
>>>>>> 0123456
>>>>>> 012345
>>>>>> 01234
>>>> 
>>>> Therefore the most documents it could possibly return is 6 i.e. even 
>>>> though startKey is "01234" and endkey is "0123456789" the number "012346" 
>>>> is not a valid match. 
>>>> 
>>> 
>>> statykey "01234" endkey "012346" with inclusive_end = false should work, 
>>> except it will bring in eg "0123457"
>>> 
>>> If "0123457" is not OK then I'm flummoxed as to how to describe your key 
>>> requirements in a simple way, regardless of technology.
>>> 
>>>> Hope that makes sense!
>>>> 
>>>> John
>>>> 
>>>> On 24 Jul 2010, at 20:49, J Chris Anderson wrote:
>>>> 
>>>>> 
>>>>> On Jul 24, 2010, at 12:35 PM, John wrote:
>>>>> 
>>>>>> Hi
>>>>>> 
>>>>>> Thanks to you both for the answers so far. Indeed my setup is far more 
>>>>>> complex than I have exposed to date but I'm making it into bite sized 
>>>>>> chunks around the Use Cases that I think are the more challenging for me.
>>>>>> 
>>>>>> Although your answers were useful they don't quite hit the mark and 
>>>>>> that's probably because I didn't explain my problem well enough to start 
>>>>>> with!
>>>>>> 
>>>>>> The database will contain entries from multiple lists (many thousands 
>>>>>> perhaps) so the _id will never be unique on a telephone number. Perhaps 
>>>>>> this might work though:
>>>>>> 
>>>>>> GET /database/<list _id>#0123456789
>>>>>> 
>>>>>> or I could just keep the _id as a uuid and move this problem (find by 
>>>>>> list id and number) to the view.
>>>>>> 
>>>>> 
>>>>> for now I'd say just go with uuids and you can have a view by telephone 
>>>>> number for direct (or starts_with) lookups.
>>>>> 
>>>>>> The view by list wont work for me. I need to be able to query the view 
>>>>>> with something like:
>>>>>> 
>>>>>> GET  /database/_design/portability/_view/NP?key=0123456789&list=<_id of 
>>>>>> list>
>>>>>> 
>>>>>> In fact in some cases the problem is more complex than this as I need to 
>>>>>> search for "widest match":
>>>>>> 
>>>>>> GET  /database/_design/portability/_view/NP?key=0123456789&list=<_id of 
>>>>>> list>&min_width=5
>>>>>> 
>>>>>> which would return the widest match in:
>>>>>> 
>>>>>> 0123456789
>>>>>> 012345678
>>>>>> 01234567
>>>>>> 0123456
>>>>>> 012345
>>>>>> 01234
>>>>>> 
>>>>>> 
>>>>>> I even have another use case where I need to do a STARTS_WITH e.g. 
>>>>>> provide a key of 01234 and return true if there are any numbers that 
>>>>>> start 01234.
>>>>>> 
>>>>> 
>>>>> this is easy. have a view like:
>>>>> 
>>>>> function(doc) {
>>>>> emit([doc.list_id, doc.number], null) 
>>>>> }
>>>>> 
>>>>> Then you can query with 
>>>>> 
>>>>> ?startkey=["mylist", "012"]&endkey=["mylist", "013"]
>>>>> 
>>>>> to get everything with a prefix of "012" in the "mylist" list. you can 
>>>>> mess around with the endkey_inclusive (or is it inclusive_endkey) = true 
>>>>> / false to not get the exact number "013" in your result set.
>>>>> 
>>>>> from this technique you can see how you could do starts-with against just 
>>>>> phone numbers also, with a view like
>>>>> 
>>>>> function(doc_ {
>>>>> emit(doc.number, null)
>>>>> }
>>>>> 
>>>>> Note I have telephone numbers as strings in this example as a regular 
>>>>> number 012 is the same as 12.
>>>>> 
>>>>>> This is a typical telecom problem and it would be good to document a 
>>>>>> Design Pattern for this Use Case. In fact there's a discussion for 
>>>>>> another day on how/where we could document this patterns and get peer 
>>>>>> reviews on them.
>>>>>> 
>>>>>> Thanks again
>>>>>> 
>>>>>> John
>>>>>> 
>>>>>> On 24 Jul 2010, at 19:15, J Chris Anderson wrote:
>>>>>> 
>>>>>>> 
>>>>>>> On Jul 24, 2010, at 7:41 AM, [email protected] wrote:
>>>>>>> 
>>>>>>>> Hello,
>>>>>>>> 
>>>>>>>> 1/ it's a little hard to answer this question, your setup is certainly 
>>>>>>>> a little more complex than what you expose in your email :-) However 
>>>>>>>> thousands of documents are gracefuly handled by CouchDB.
>>>>>>>> 
>>>>>>>> 2/ At first sight your documents will look like :
>>>>>>>> { "_id": 0123456789 , "list": "mylist", "type": "NP", 
>>>>>>>> "status":"portedIn", "operatorId":1234 }
>>>>>>>> 
>>>>>>>> That way you can query your document by phone number :
>>>>>>>> 
>>>>>>>> GET /database/0123456789
>>>>>>>> 
>>>>>>>> and have all documents belonging to the list "mylist" by creating a 
>>>>>>>> view that emits the "list" field :
>>>>>>>> 
>>>>>>>> function (doc) {
>>>>>>>> if ( doc.list  && doc.type == "NP" ) {
>>>>>>>> emit (doc.list,null);
>>>>>>>> }
>>>>>>>> }
>>>>>>>> 
>>>>>>>> and fetching them with something like :
>>>>>>>> 
>>>>>>>> GET 
>>>>>>>> /database/_design/portability/_view/NP?key="mylist"&include_docs=true
>>>>>>>> 
>>>>>>>> 3/ When updating a document : the document is of course immediately 
>>>>>>>> available. However the view index won't be updated. In CouchDB view 
>>>>>>>> indexes are rebuilt on view query (not on document update). When 
>>>>>>>> you'll query CouchDB "give me all the documents of the view NP", Couch 
>>>>>>>> will take all documents that have changed (added, updated, deleted) 
>>>>>>>> since the last time you asked Couch for the view, and will update 
>>>>>>>> indexes accordingly. You have the option of fetching the view without 
>>>>>>>> rebuilding the index, with the "stale" parameter, but in this case, of 
>>>>>>>> course, you won't see the changes. During the rebuilt of the index, 
>>>>>>>> subsequent view queries are queued until the index is up to date.
>>>>>>>> 
>>>>>>>> 4/ I setup CouchDB to parse network logs. A view took something like 
>>>>>>>> 25 minuts for 100 millions documents, on a Dell PowerEdge 2950 Xen 
>>>>>>>> Virtual Machine with two dedicated processors and 4gigs ram. Numbers 
>>>>>>>> can heavily vary according to the complexity of the view, so it's 
>>>>>>>> always hard (and dangerous) to give numbers. Moreover my indexes were 
>>>>>>>> not only numbers, but also strings.
>>>>>>>> 
>>>>>>> 
>>>>>>> this is a good response. I'd only follow up to say that there are some 
>>>>>>> techniques you can use to further tune view-generation performance. 
>>>>>>> one: keysize and entropy can make a big difference. the view by list, 
>>>>>>> as above, looks pretty good on that front.
>>>>>>> 
>>>>>>> CouchDB can also be configured to store view indexes on a separate disk 
>>>>>>> from the database file, which can reduce IO contention if you are at 
>>>>>>> the edge of what your hardware can do.
>>>>>>> 
>>>>>>> Also, there is the option to query views with stale=ok, which will 
>>>>>>> return a query based on the latest snapshot, with low latency, so 
>>>>>>> clients aren't blocked waiting for generation to complete. then you can 
>>>>>>> use a cron-job with a regular view query and limit=1 to keep the index 
>>>>>>> up to date. so clients always see a fairly recent snapshot, with low 
>>>>>>> latency.
>>>>>>> 
>>>>>>>> 
>>>>>>>> What you should be aware of is that CouchDB requires maintenance tasks 
>>>>>>>> to keep great performances, it's called "compact" and should be run on 
>>>>>>>> databases (to rebuilt the db file that is append-only) and on 
>>>>>>>> databases views (to rebuild the index file that is append-only). 
>>>>>>>> During the compact, database is still available but performances are 
>>>>>>>> degraded (from my personnal experience).
>>>>>>>> Also, a new replication engine is in the pipe and should greatly 
>>>>>>>> improve the replication experience.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Mickael
>>>>>>>> 
>>>>>>>> ----- Mail Original -----
>>>>>>>> De: "John" <[email protected]>
>>>>>>>> À: [email protected]
>>>>>>>> Envoyé: Samedi 24 Juillet 2010 11h37:56 GMT +01:00 Amsterdam / Berlin 
>>>>>>>> / Berne / Rome / Stockholm / Vienne
>>>>>>>> Objet: Large lists of data
>>>>>>>> 
>>>>>>>> Hi 
>>>>>>>> 
>>>>>>>> I'm currently evaluating couchdb as a candidate to replace the 
>>>>>>>> relational databases as used in our Telecom Applications.
>>>>>>>> For most of our data I can see a good fit and we already expose our 
>>>>>>>> service provisioning as json over REST so we're well positioned for a 
>>>>>>>> migration.
>>>>>>>> One area that concerns me though is whether this technology is 
>>>>>>>> suitable for our list data. An example of this is Mobile Number 
>>>>>>>> Portability where we have millions of rows of data representing ported 
>>>>>>>> numbers with some atrributes against each.
>>>>>>>> 
>>>>>>>> We use the standard Relational approach to this and have an entries 
>>>>>>>> table that has a foreign key reference to a parent list. 
>>>>>>>> 
>>>>>>>> On our web services we do something like this:
>>>>>>>> 
>>>>>>>> Create a List:
>>>>>>>> 
>>>>>>>> PUT /cie-rest/provision/accounts/netdev/lists/mylist
>>>>>>>> { "type": "NP"}
>>>>>>>> 
>>>>>>>> To add a row to a list 
>>>>>>>> PUT /cie-rest/provision/accounts/netdev/lists/mylist/entries/0123456789
>>>>>>>> { "status":"portedIn", "operatorId":1234}
>>>>>>>> 
>>>>>>>> If we want to add a lot of rows we just POST a document to the list.
>>>>>>>> 
>>>>>>>> The list data is used when processing calls and it requires a fast 
>>>>>>>> lookup on the entries table which is obviously indexed.
>>>>>>>> 
>>>>>>>> Anyway, I'd be interested in getting some opinions on:
>>>>>>>> 
>>>>>>>> 1) Is couchdb the *right* technology for this job? (I know it can do 
>>>>>>>> it!)
>>>>>>>> 
>>>>>>>> 2) I presume that the relationship I currently have in my relational 
>>>>>>>> database would remain the same for couch i.e. The entry document would 
>>>>>>>> ref the list document but maybe there's a better way to do this?
>>>>>>>> 
>>>>>>>> 3) Number portability requires 15 min, 1 hour and daily syncs with a 
>>>>>>>> central number portability database. This can result in bulk updates 
>>>>>>>> of thousands of numbers. I'm concerned with how long it takes to build 
>>>>>>>> a couchdb index and to incrementally update it when the number of 
>>>>>>>> changes is large (Adds/removes).  
>>>>>>>> What does this mean to the availability of the number? i.e. Is the 
>>>>>>>> entry in the db but its unavailable to the application as it's entry 
>>>>>>>> in the index hasnt been built yet?
>>>>>>>> 
>>>>>>>> 4) Telephone numbers like btrees so the index building should be quite 
>>>>>>>> fast and efficient I would of thought but does someone have anything 
>>>>>>>> more concrete in terms of how long it would take typically? I think 
>>>>>>>> that the bottleneck is the disk i/o and therefore it may be vastly 
>>>>>>>> different between my laptop and one of our beefy production servers 
>>>>>>>> but again I'd be interested in other peoples experience.
>>>>>>>> 
>>>>>>>> Bit of a long one so thanks if you've read it to this point! There's a 
>>>>>>>> lot to like with couchdb (esp the replication for our use case) so I'm 
>>>>>>>> hoping that what i've asked above is feasible!
>>>>>>>> 
>>>>>>>> Thanks
>>>>>>>> 
>>>>>>>> John
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>

Re: Large lists of data

Reply via email to