On Jul 24, 2010, at 1:36 PM, John Logsdon wrote:
> In SQL it would be something like:
>
> Select * from entries e
> Where e.list-id=1234
> and key in ('0123456789', '012345678', '01234567', '0123456', '012345',
> '01234')
>
oh well in that case (you know all the keys in advance) you can do a multi-key
query.
we tunnel a semantic GET over an http POST, so you post
{
"keys" : [array of keys]
}
to the view.
For some reason I thought you were looking for a special kind of range query.
Chris
>
> On 24 Jul 2010, at 21:20, J Chris Anderson <[email protected]> wrote:
>
>>
>> On Jul 24, 2010, at 1:13 PM, John wrote:
>>
>>> I'm not sure if that will give me what I want exactly.
>>>
>>> I want to search for these exact numbers only nothing in-between
>>>
>>>>> 0123456789
>>>>> 012345678
>>>>> 01234567
>>>>> 0123456
>>>>> 012345
>>>>> 01234
>>>
>>> Therefore the most documents it could possibly return is 6 i.e. even though
>>> startKey is "01234" and endkey is "0123456789" the number "012346" is not a
>>> valid match.
>>>
>>
>> statykey "01234" endkey "012346" with inclusive_end = false should work,
>> except it will bring in eg "0123457"
>>
>> If "0123457" is not OK then I'm flummoxed as to how to describe your key
>> requirements in a simple way, regardless of technology.
>>
>>> Hope that makes sense!
>>>
>>> John
>>>
>>> On 24 Jul 2010, at 20:49, J Chris Anderson wrote:
>>>
>>>>
>>>> On Jul 24, 2010, at 12:35 PM, John wrote:
>>>>
>>>>> Hi
>>>>>
>>>>> Thanks to you both for the answers so far. Indeed my setup is far more
>>>>> complex than I have exposed to date but I'm making it into bite sized
>>>>> chunks around the Use Cases that I think are the more challenging for me.
>>>>>
>>>>> Although your answers were useful they don't quite hit the mark and
>>>>> that's probably because I didn't explain my problem well enough to start
>>>>> with!
>>>>>
>>>>> The database will contain entries from multiple lists (many thousands
>>>>> perhaps) so the _id will never be unique on a telephone number. Perhaps
>>>>> this might work though:
>>>>>
>>>>> GET /database/<list _id>#0123456789
>>>>>
>>>>> or I could just keep the _id as a uuid and move this problem (find by
>>>>> list id and number) to the view.
>>>>>
>>>>
>>>> for now I'd say just go with uuids and you can have a view by telephone
>>>> number for direct (or starts_with) lookups.
>>>>
>>>>> The view by list wont work for me. I need to be able to query the view
>>>>> with something like:
>>>>>
>>>>> GET /database/_design/portability/_view/NP?key=0123456789&list=<_id of
>>>>> list>
>>>>>
>>>>> In fact in some cases the problem is more complex than this as I need to
>>>>> search for "widest match":
>>>>>
>>>>> GET /database/_design/portability/_view/NP?key=0123456789&list=<_id of
>>>>> list>&min_width=5
>>>>>
>>>>> which would return the widest match in:
>>>>>
>>>>> 0123456789
>>>>> 012345678
>>>>> 01234567
>>>>> 0123456
>>>>> 012345
>>>>> 01234
>>>>>
>>>>>
>>>>> I even have another use case where I need to do a STARTS_WITH e.g.
>>>>> provide a key of 01234 and return true if there are any numbers that
>>>>> start 01234.
>>>>>
>>>>
>>>> this is easy. have a view like:
>>>>
>>>> function(doc) {
>>>> emit([doc.list_id, doc.number], null)
>>>> }
>>>>
>>>> Then you can query with
>>>>
>>>> ?startkey=["mylist", "012"]&endkey=["mylist", "013"]
>>>>
>>>> to get everything with a prefix of "012" in the "mylist" list. you can
>>>> mess around with the endkey_inclusive (or is it inclusive_endkey) = true /
>>>> false to not get the exact number "013" in your result set.
>>>>
>>>> from this technique you can see how you could do starts-with against just
>>>> phone numbers also, with a view like
>>>>
>>>> function(doc_ {
>>>> emit(doc.number, null)
>>>> }
>>>>
>>>> Note I have telephone numbers as strings in this example as a regular
>>>> number 012 is the same as 12.
>>>>
>>>>> This is a typical telecom problem and it would be good to document a
>>>>> Design Pattern for this Use Case. In fact there's a discussion for
>>>>> another day on how/where we could document this patterns and get peer
>>>>> reviews on them.
>>>>>
>>>>> Thanks again
>>>>>
>>>>> John
>>>>>
>>>>> On 24 Jul 2010, at 19:15, J Chris Anderson wrote:
>>>>>
>>>>>>
>>>>>> On Jul 24, 2010, at 7:41 AM, [email protected] wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> 1/ it's a little hard to answer this question, your setup is certainly
>>>>>>> a little more complex than what you expose in your email :-) However
>>>>>>> thousands of documents are gracefuly handled by CouchDB.
>>>>>>>
>>>>>>> 2/ At first sight your documents will look like :
>>>>>>> { "_id": 0123456789 , "list": "mylist", "type": "NP",
>>>>>>> "status":"portedIn", "operatorId":1234 }
>>>>>>>
>>>>>>> That way you can query your document by phone number :
>>>>>>>
>>>>>>> GET /database/0123456789
>>>>>>>
>>>>>>> and have all documents belonging to the list "mylist" by creating a
>>>>>>> view that emits the "list" field :
>>>>>>>
>>>>>>> function (doc) {
>>>>>>> if ( doc.list && doc.type == "NP" ) {
>>>>>>> emit (doc.list,null);
>>>>>>> }
>>>>>>> }
>>>>>>>
>>>>>>> and fetching them with something like :
>>>>>>>
>>>>>>> GET
>>>>>>> /database/_design/portability/_view/NP?key="mylist"&include_docs=true
>>>>>>>
>>>>>>> 3/ When updating a document : the document is of course immediately
>>>>>>> available. However the view index won't be updated. In CouchDB view
>>>>>>> indexes are rebuilt on view query (not on document update). When you'll
>>>>>>> query CouchDB "give me all the documents of the view NP", Couch will
>>>>>>> take all documents that have changed (added, updated, deleted) since
>>>>>>> the last time you asked Couch for the view, and will update indexes
>>>>>>> accordingly. You have the option of fetching the view without
>>>>>>> rebuilding the index, with the "stale" parameter, but in this case, of
>>>>>>> course, you won't see the changes. During the rebuilt of the index,
>>>>>>> subsequent view queries are queued until the index is up to date.
>>>>>>>
>>>>>>> 4/ I setup CouchDB to parse network logs. A view took something like 25
>>>>>>> minuts for 100 millions documents, on a Dell PowerEdge 2950 Xen Virtual
>>>>>>> Machine with two dedicated processors and 4gigs ram. Numbers can
>>>>>>> heavily vary according to the complexity of the view, so it's always
>>>>>>> hard (and dangerous) to give numbers. Moreover my indexes were not only
>>>>>>> numbers, but also strings.
>>>>>>>
>>>>>>
>>>>>> this is a good response. I'd only follow up to say that there are some
>>>>>> techniques you can use to further tune view-generation performance. one:
>>>>>> keysize and entropy can make a big difference. the view by list, as
>>>>>> above, looks pretty good on that front.
>>>>>>
>>>>>> CouchDB can also be configured to store view indexes on a separate disk
>>>>>> from the database file, which can reduce IO contention if you are at the
>>>>>> edge of what your hardware can do.
>>>>>>
>>>>>> Also, there is the option to query views with stale=ok, which will
>>>>>> return a query based on the latest snapshot, with low latency, so
>>>>>> clients aren't blocked waiting for generation to complete. then you can
>>>>>> use a cron-job with a regular view query and limit=1 to keep the index
>>>>>> up to date. so clients always see a fairly recent snapshot, with low
>>>>>> latency.
>>>>>>
>>>>>>>
>>>>>>> What you should be aware of is that CouchDB requires maintenance tasks
>>>>>>> to keep great performances, it's called "compact" and should be run on
>>>>>>> databases (to rebuilt the db file that is append-only) and on databases
>>>>>>> views (to rebuild the index file that is append-only). During the
>>>>>>> compact, database is still available but performances are degraded
>>>>>>> (from my personnal experience).
>>>>>>> Also, a new replication engine is in the pipe and should greatly
>>>>>>> improve the replication experience.
>>>>>>>
>>>>>>>
>>>>>>> Mickael
>>>>>>>
>>>>>>> ----- Mail Original -----
>>>>>>> De: "John" <[email protected]>
>>>>>>> À: [email protected]
>>>>>>> Envoyé: Samedi 24 Juillet 2010 11h37:56 GMT +01:00 Amsterdam / Berlin /
>>>>>>> Berne / Rome / Stockholm / Vienne
>>>>>>> Objet: Large lists of data
>>>>>>>
>>>>>>> Hi
>>>>>>>
>>>>>>> I'm currently evaluating couchdb as a candidate to replace the
>>>>>>> relational databases as used in our Telecom Applications.
>>>>>>> For most of our data I can see a good fit and we already expose our
>>>>>>> service provisioning as json over REST so we're well positioned for a
>>>>>>> migration.
>>>>>>> One area that concerns me though is whether this technology is suitable
>>>>>>> for our list data. An example of this is Mobile Number Portability
>>>>>>> where we have millions of rows of data representing ported numbers with
>>>>>>> some atrributes against each.
>>>>>>>
>>>>>>> We use the standard Relational approach to this and have an entries
>>>>>>> table that has a foreign key reference to a parent list.
>>>>>>>
>>>>>>> On our web services we do something like this:
>>>>>>>
>>>>>>> Create a List:
>>>>>>>
>>>>>>> PUT /cie-rest/provision/accounts/netdev/lists/mylist
>>>>>>> { "type": "NP"}
>>>>>>>
>>>>>>> To add a row to a list
>>>>>>> PUT /cie-rest/provision/accounts/netdev/lists/mylist/entries/0123456789
>>>>>>> { "status":"portedIn", "operatorId":1234}
>>>>>>>
>>>>>>> If we want to add a lot of rows we just POST a document to the list.
>>>>>>>
>>>>>>> The list data is used when processing calls and it requires a fast
>>>>>>> lookup on the entries table which is obviously indexed.
>>>>>>>
>>>>>>> Anyway, I'd be interested in getting some opinions on:
>>>>>>>
>>>>>>> 1) Is couchdb the *right* technology for this job? (I know it can do
>>>>>>> it!)
>>>>>>>
>>>>>>> 2) I presume that the relationship I currently have in my relational
>>>>>>> database would remain the same for couch i.e. The entry document would
>>>>>>> ref the list document but maybe there's a better way to do this?
>>>>>>>
>>>>>>> 3) Number portability requires 15 min, 1 hour and daily syncs with a
>>>>>>> central number portability database. This can result in bulk updates of
>>>>>>> thousands of numbers. I'm concerned with how long it takes to build a
>>>>>>> couchdb index and to incrementally update it when the number of changes
>>>>>>> is large (Adds/removes).
>>>>>>> What does this mean to the availability of the number? i.e. Is the
>>>>>>> entry in the db but its unavailable to the application as it's entry in
>>>>>>> the index hasnt been built yet?
>>>>>>>
>>>>>>> 4) Telephone numbers like btrees so the index building should be quite
>>>>>>> fast and efficient I would of thought but does someone have anything
>>>>>>> more concrete in terms of how long it would take typically? I think
>>>>>>> that the bottleneck is the disk i/o and therefore it may be vastly
>>>>>>> different between my laptop and one of our beefy production servers but
>>>>>>> again I'd be interested in other peoples experience.
>>>>>>>
>>>>>>> Bit of a long one so thanks if you've read it to this point! There's a
>>>>>>> lot to like with couchdb (esp the replication for our use case) so I'm
>>>>>>> hoping that what i've asked above is feasible!
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>> John
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>