On Jul 24, 2010, at 1:13 PM, John wrote:
> I'm not sure if that will give me what I want exactly.
>
> I want to search for these exact numbers only nothing in-between
>
>>> 0123456789
>>> 012345678
>>> 01234567
>>> 0123456
>>> 012345
>>> 01234
>
> Therefore the most documents it could possibly return is 6 i.e. even though
> startKey is "01234" and endkey is "0123456789" the number "012346" is not a
> valid match.
>
statykey "01234" endkey "012346" with inclusive_end = false should work, except
it will bring in eg "0123457"
If "0123457" is not OK then I'm flummoxed as to how to describe your key
requirements in a simple way, regardless of technology.
> Hope that makes sense!
>
> John
>
> On 24 Jul 2010, at 20:49, J Chris Anderson wrote:
>
>>
>> On Jul 24, 2010, at 12:35 PM, John wrote:
>>
>>> Hi
>>>
>>> Thanks to you both for the answers so far. Indeed my setup is far more
>>> complex than I have exposed to date but I'm making it into bite sized
>>> chunks around the Use Cases that I think are the more challenging for me.
>>>
>>> Although your answers were useful they don't quite hit the mark and that's
>>> probably because I didn't explain my problem well enough to start with!
>>>
>>> The database will contain entries from multiple lists (many thousands
>>> perhaps) so the _id will never be unique on a telephone number. Perhaps
>>> this might work though:
>>>
>>> GET /database/<list _id>#0123456789
>>>
>>> or I could just keep the _id as a uuid and move this problem (find by list
>>> id and number) to the view.
>>>
>>
>> for now I'd say just go with uuids and you can have a view by telephone
>> number for direct (or starts_with) lookups.
>>
>>> The view by list wont work for me. I need to be able to query the view with
>>> something like:
>>>
>>> GET /database/_design/portability/_view/NP?key=0123456789&list=<_id of
>>> list>
>>>
>>> In fact in some cases the problem is more complex than this as I need to
>>> search for "widest match":
>>>
>>> GET /database/_design/portability/_view/NP?key=0123456789&list=<_id of
>>> list>&min_width=5
>>>
>>> which would return the widest match in:
>>>
>>> 0123456789
>>> 012345678
>>> 01234567
>>> 0123456
>>> 012345
>>> 01234
>>>
>>>
>>> I even have another use case where I need to do a STARTS_WITH e.g. provide
>>> a key of 01234 and return true if there are any numbers that start 01234.
>>>
>>
>> this is easy. have a view like:
>>
>> function(doc) {
>> emit([doc.list_id, doc.number], null)
>> }
>>
>> Then you can query with
>>
>> ?startkey=["mylist", "012"]&endkey=["mylist", "013"]
>>
>> to get everything with a prefix of "012" in the "mylist" list. you can mess
>> around with the endkey_inclusive (or is it inclusive_endkey) = true / false
>> to not get the exact number "013" in your result set.
>>
>> from this technique you can see how you could do starts-with against just
>> phone numbers also, with a view like
>>
>> function(doc_ {
>> emit(doc.number, null)
>> }
>>
>> Note I have telephone numbers as strings in this example as a regular number
>> 012 is the same as 12.
>>
>>> This is a typical telecom problem and it would be good to document a Design
>>> Pattern for this Use Case. In fact there's a discussion for another day on
>>> how/where we could document this patterns and get peer reviews on them.
>>>
>>> Thanks again
>>>
>>> John
>>>
>>> On 24 Jul 2010, at 19:15, J Chris Anderson wrote:
>>>
>>>>
>>>> On Jul 24, 2010, at 7:41 AM, [email protected] wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> 1/ it's a little hard to answer this question, your setup is certainly a
>>>>> little more complex than what you expose in your email :-) However
>>>>> thousands of documents are gracefuly handled by CouchDB.
>>>>>
>>>>> 2/ At first sight your documents will look like :
>>>>> { "_id": 0123456789 , "list": "mylist", "type": "NP",
>>>>> "status":"portedIn", "operatorId":1234 }
>>>>>
>>>>> That way you can query your document by phone number :
>>>>>
>>>>> GET /database/0123456789
>>>>>
>>>>> and have all documents belonging to the list "mylist" by creating a view
>>>>> that emits the "list" field :
>>>>>
>>>>> function (doc) {
>>>>> if ( doc.list && doc.type == "NP" ) {
>>>>> emit (doc.list,null);
>>>>> }
>>>>> }
>>>>>
>>>>> and fetching them with something like :
>>>>>
>>>>> GET /database/_design/portability/_view/NP?key="mylist"&include_docs=true
>>>>>
>>>>> 3/ When updating a document : the document is of course immediately
>>>>> available. However the view index won't be updated. In CouchDB view
>>>>> indexes are rebuilt on view query (not on document update). When you'll
>>>>> query CouchDB "give me all the documents of the view NP", Couch will take
>>>>> all documents that have changed (added, updated, deleted) since the last
>>>>> time you asked Couch for the view, and will update indexes accordingly.
>>>>> You have the option of fetching the view without rebuilding the index,
>>>>> with the "stale" parameter, but in this case, of course, you won't see
>>>>> the changes. During the rebuilt of the index, subsequent view queries are
>>>>> queued until the index is up to date.
>>>>>
>>>>> 4/ I setup CouchDB to parse network logs. A view took something like 25
>>>>> minuts for 100 millions documents, on a Dell PowerEdge 2950 Xen Virtual
>>>>> Machine with two dedicated processors and 4gigs ram. Numbers can heavily
>>>>> vary according to the complexity of the view, so it's always hard (and
>>>>> dangerous) to give numbers. Moreover my indexes were not only numbers,
>>>>> but also strings.
>>>>>
>>>>
>>>> this is a good response. I'd only follow up to say that there are some
>>>> techniques you can use to further tune view-generation performance. one:
>>>> keysize and entropy can make a big difference. the view by list, as above,
>>>> looks pretty good on that front.
>>>>
>>>> CouchDB can also be configured to store view indexes on a separate disk
>>>> from the database file, which can reduce IO contention if you are at the
>>>> edge of what your hardware can do.
>>>>
>>>> Also, there is the option to query views with stale=ok, which will return
>>>> a query based on the latest snapshot, with low latency, so clients aren't
>>>> blocked waiting for generation to complete. then you can use a cron-job
>>>> with a regular view query and limit=1 to keep the index up to date. so
>>>> clients always see a fairly recent snapshot, with low latency.
>>>>
>>>>>
>>>>> What you should be aware of is that CouchDB requires maintenance tasks to
>>>>> keep great performances, it's called "compact" and should be run on
>>>>> databases (to rebuilt the db file that is append-only) and on databases
>>>>> views (to rebuild the index file that is append-only). During the
>>>>> compact, database is still available but performances are degraded (from
>>>>> my personnal experience).
>>>>> Also, a new replication engine is in the pipe and should greatly improve
>>>>> the replication experience.
>>>>>
>>>>>
>>>>> Mickael
>>>>>
>>>>> ----- Mail Original -----
>>>>> De: "John" <[email protected]>
>>>>> À: [email protected]
>>>>> Envoyé: Samedi 24 Juillet 2010 11h37:56 GMT +01:00 Amsterdam / Berlin /
>>>>> Berne / Rome / Stockholm / Vienne
>>>>> Objet: Large lists of data
>>>>>
>>>>> Hi
>>>>>
>>>>> I'm currently evaluating couchdb as a candidate to replace the relational
>>>>> databases as used in our Telecom Applications.
>>>>> For most of our data I can see a good fit and we already expose our
>>>>> service provisioning as json over REST so we're well positioned for a
>>>>> migration.
>>>>> One area that concerns me though is whether this technology is suitable
>>>>> for our list data. An example of this is Mobile Number Portability where
>>>>> we have millions of rows of data representing ported numbers with some
>>>>> atrributes against each.
>>>>>
>>>>> We use the standard Relational approach to this and have an entries table
>>>>> that has a foreign key reference to a parent list.
>>>>>
>>>>> On our web services we do something like this:
>>>>>
>>>>> Create a List:
>>>>>
>>>>> PUT /cie-rest/provision/accounts/netdev/lists/mylist
>>>>> { "type": "NP"}
>>>>>
>>>>> To add a row to a list
>>>>> PUT /cie-rest/provision/accounts/netdev/lists/mylist/entries/0123456789
>>>>> { "status":"portedIn", "operatorId":1234}
>>>>>
>>>>> If we want to add a lot of rows we just POST a document to the list.
>>>>>
>>>>> The list data is used when processing calls and it requires a fast lookup
>>>>> on the entries table which is obviously indexed.
>>>>>
>>>>> Anyway, I'd be interested in getting some opinions on:
>>>>>
>>>>> 1) Is couchdb the *right* technology for this job? (I know it can do it!)
>>>>>
>>>>> 2) I presume that the relationship I currently have in my relational
>>>>> database would remain the same for couch i.e. The entry document would
>>>>> ref the list document but maybe there's a better way to do this?
>>>>>
>>>>> 3) Number portability requires 15 min, 1 hour and daily syncs with a
>>>>> central number portability database. This can result in bulk updates of
>>>>> thousands of numbers. I'm concerned with how long it takes to build a
>>>>> couchdb index and to incrementally update it when the number of changes
>>>>> is large (Adds/removes).
>>>>> What does this mean to the availability of the number? i.e. Is the entry
>>>>> in the db but its unavailable to the application as it's entry in the
>>>>> index hasnt been built yet?
>>>>>
>>>>> 4) Telephone numbers like btrees so the index building should be quite
>>>>> fast and efficient I would of thought but does someone have anything more
>>>>> concrete in terms of how long it would take typically? I think that the
>>>>> bottleneck is the disk i/o and therefore it may be vastly different
>>>>> between my laptop and one of our beefy production servers but again I'd
>>>>> be interested in other peoples experience.
>>>>>
>>>>> Bit of a long one so thanks if you've read it to this point! There's a
>>>>> lot to like with couchdb (esp the replication for our use case) so I'm
>>>>> hoping that what i've asked above is feasible!
>>>>>
>>>>> Thanks
>>>>>
>>>>> John
>>>>>
>>>>>
>>>>
>>>
>>
>