RE: Am I doing something fundamentally wrong?

Mike Kimber Fri, 25 May 2012 07:56:22 -0700

It's ok on the all_docs, Matthieu RAKOTOJAONA provided the answer in another 
post i.e yes.

Thanks

Mike 

-----Original Message-----
From: Mike Kimber [mailto:[email protected]] 
Sent: 25 May 2012 15:44
To: [email protected]
Subject: RE: Am I doing something fundamentally wrong?

:-) e-mail is not a very good form of communication; apologies. As I said in my 
original post we live in a world of "clever" work around's and  the map's you 
are referring to are my "clever" work around for the time it takes to build 
views against our data set so that I can do ad-hoc analytics/information 
discover on our data set. 

I place the doc._id in the emit KEY as its guaranteed unique to each document. 
I create a map with header information in and then I create a detailed map out 
of a array that's in my document, this has the same document doc._id as its 
KEY. I then use the Luciddb couchdb connector ( 
https://github.com/dynamobi/conn-couchdb ) to pull the two map/views into 
luciddb tables (the other NVP become columns) at which point I can join them on 
the doc._id KEY and then start changing the grouping of data in real time 
(seconds vs a 16 hour view rebuild). Now when I have found what I'm looking for 
(data and logic) and are happy that it's going to remain static (i.e. I don't 
have to change the grouping/index members and/or order) I/we create a proper 
map reduce of which we now have 13 now and an example can be found at: 

https://gist.github.com/2788255

Oddly enough whilst these have far more complex JS than the map I provided 
before (https://gist.github.com/2774485) they take a similar time to build 
(although the design doc takes 38 GB (post compaction) of space vs the 8GB of 
raw documents!! i.e. they have less data in them so why are they so much 
bigger) which suggest to me its the size of the documents that iterating 
through large documents is an issue. Even incremental view update against 95 
new documents takes 3.5 minutes (100% CPU burn), so I don't get instant data 
access unless I use ?stale=update_after

So can I use I use a list function with all_docs and avoid view build for data 
discover?

I know a couchdb view is equivalent to a DBMS index which is again why I 
questioning why it takes so long to build them and why they use so much space

I have couchdb lucene installed and its excellent, but only compounds my 
questions re view sizes and view generation as I can index the whole of  all my 
documents in far less time than it takes to run the map reduces and it only 
uses 2.4GB of disk space!!

Clearly I seem to be a bit of a loan voice on this as everyone skirts around 
the why do views take so long to build, why do they only run on one CPU and why 
do they take up so much space, but to my thinking view optimization would save 
a lot of CPU cycles and disk space which would cut cloudants operational costs, 
allow some of that to be passed on to the customer and also benefit the wider 
community also. The cloud is a utility model and cost management is key, plus 
it helps the environment which is a finite resource!

If I could just get an answer to the all_docs question I leave you alone then!

Thanks 

Mike 

PS: if it's any consolation one of the chaps who works for me had a similar 
experience with Mongodb, although the use case was different :-). In that 
instance they moved to hadoop which to me is a bit overkill for my paltry 50GB 
of data (8.5 GB compressed!

-----Original Message-----
From: Robert Newson [mailto:[email protected]] 
Sent: 25 May 2012 11:29
To: [email protected]
Subject: Re: Am I doing something fundamentally wrong?

Hi Mike,

Several posters have been trying to tell you that you didn't need to
build either of the views you posted. A view is to allow you to
retrieve data efficiently by things other than the document id (or,
with a reduce, to efficiently access aggregated data, sum, count and
the like). In both of the views you posted you key by id. Instead of
having either view you can use the _all_docs view with
include_docs=true. This view is built in lock-step with your updates,
so it's never stale.

These views would be worth having if super-low latency to those
document fragments. If that's the case, then the cost is the view
build time, but it doesn't sound like you need it.

You say you don't know the ids you want to query, but your views are
keyed on doc._id (same as _all_docs). I don't understand that. From
this distance, it's seems you've built views you don't need to build
and you have to read them in the entirety looking for the data you
wanted. If that's true, or even half true, then it would explain your
bad experience so far.

Finally, I will close by saying that couchdb views are akin to SQL's
'CREATE INDEX'. Careful database design  includes choosing which
indexes to build (and which type) ahead of time. It's rare, and
painful, to add indexes after the fact. For ad-hoc analysis, I wrote
https://github.com/rnewson/couchdb-lucene.

B.

On 25 May 2012 10:52, Mike Kimber <[email protected]> wrote:
> I have done this; one view per design doc. I then query each one and wait. 
> Currently there are 2 design docs and so 2 CPU's burn at 100%. The other 2 
> CPU's/cores do nothing.
>
> On the all_docs option that's the point I don't know the id's I want to 
> analyse the attributes in my 80K set off documents to find the documents that 
> are relevant. Matthieu Rakotojaona summed it up in response to the other half 
> of this post which you can see at:
>
> http://mail-archives.apache.org/mod_mbox/couchdb-user/201205.mbox/%3ccamizln1ele9nzt3hoyxemuoxgrfh2dwyafevcyo-it3mpss...@mail.gmail.com%3E
>
> Thanks
>
> Mike
>
> -----Original Message-----
> From: Sean Copenhaver [mailto:[email protected]]
> Sent: 24 May 2012 15:20
> To: [email protected]
> Subject: Re: Am I doing something fundamentally wrong?
>
> I believe multiple design documents will build views concurrently but one
> design document is basically done sequentially by the change sequence...
> not positive.
>
> So you could try splitting out your views into multiple design documents
> and hit them to see if that helps spread out the CPU usage. I want to say a
> lot of the CPU usage is the serialization process that is happening
> communicating from CouchDB's core to the view engine process.
>
> Anyway with the list you specify any view and all_docs is a view with all
> documents in a database. So if you know the ids you want to work with you
> can doe a normal view query with a list function.
> http://wiki.apache.org/couchdb/HTTP_Document_API#all_docs
>
> That's what Robert was trying to get at.
>
> On Thu, May 24, 2012 at 9:55 AM, Mike Kimber <[email protected]> wrote:
>
>> Robert,
>>
>> Couchdb Lists work on top of views (and look great by the way), however
>> that brings me back to my initial post (causes an error on this mailing
>> list for some reason but you can find a copy here
>> http://mail-archives.apache.org/mod_mbox/couchdb-user/201205.mbox/%[email protected]%3E)
>> :-). Namely generating a view (well a design document with views in it) on
>> our data set takes between 6 (simple view) and 16 hours, takes up a lot of
>> disk space for what seems a small amount of data and burns a CPU at 100%
>> for the full time it runs i.e. no IO contention and can't use multiple
>> cores/cpus. So again am I doing something fundamentally wrong or is this
>> just the way Couch works and most people don't have a data set like ours so
>> it does not take that long to create views or does Big Couch solve the
>> issue (although it would seem 10 big couch nodes would still take an hour)
>>
>> Looks like you work at Cloudant, so hopefully you might be able to provide
>> some answers based on real world experience?
>>
>> Mike
>>
>>
>>
>> -----Original Message-----
>> From: Robert Newson [mailto:[email protected]]
>> Sent: 24 May 2012 12:08
>> To: [email protected]
>> Subject: Re: Am I doing something fundamentally wrong?
>>
>> Or use a list function;
>>
>> http://wiki.apache.org/couchdb/Formatting_with_Show_and_List
>>
>> You can use one with _all_docs and you can POST an array of ids too.
>>
>> http://wiki.apache.org/couchdb/HTTP_view_API
>>
>> > Since 0.9 you can also issue POST requests to views where you can send
>> the following JSON structure in the body:
>> > {"keys": ["key1", "key2", ...]}
>>
>> B.
>>
>> On 24 May 2012 11:58, Mike Kimber <[email protected]> wrote:
>> > Looking at Show documentation and running a quick test I don't think
>> this helps as Show has to be referenced by a doc._id or view key. If these
>> aren't provided it returns null. This makes sense as its for generation of
>> a html, XML page/doc etc.
>> >
>> > So I'd have to  get a list of all doc ID's I want and then call the show
>> function for each and to get a filtered list I need a view.
>> >
>> > Mike
>> >
>> > -----Original Message-----
>> > From: Mike Kimber [mailto:[email protected]]
>> > Sent: 24 May 2012 10:47
>> > To: [email protected]
>> > Subject: RE: Am I doing something fundamentally wrong?
>> >
>> > Aurélien,
>> >
>> > Thanks for the response and apologies I didn't get a notification
>> (e-mail) of my original post (or the 2nd one) or your response. When I look
>> at my original post in Google Reader is has "An error occurred while
>> fetching this message, sorry !", so there must be something in the e-mail
>> that the mailing list system does not like.
>> >
>> > In response to your original response " I'm a bit puzzled by the fact
>> that your map functions use the document ID". I do this because I load the
>> data into Luciddb and this allows me to join between tables. This is not my
>> end game this is just a compromise due to the time it takes to generate a
>> view and my need to play/discover with the data.
>> >
>> > I will look at show to see if It helps, however it does not really
>> answer my original questions and it does not remove the more general issue
>> that view build takes a very long time, it only uses a single CPU and uses
>> a bucket load of space even with compression on (no idea why when it has a
>> lot less data than the original)
>> >
>> > Thanks
>> >
>> > Mike
>> >
>> > -----Original Message-----
>> > From: Aurélien Bénel [mailto:[email protected]]
>> > Sent: 24 May 2012 07:40
>> > To: [email protected]
>> > Subject: Re: Am I doing something fundamentally wrong?
>> >
>> > Hi Mike,
>> >
>> >> Didn't seem to get there first time so having another go
>> >
>> > As I wrote in my earlier post, the use of 'map' functions in both of
>> your examples is overkill.
>> > Use 'show' functions instead.They won't require an index to be built.
>> >
>> >
>> > Regards,
>> >
>> > Aurélien
>>
>
>
>
> --
> "The limits of language are the limits of one's world. " - Ludwig von
> Wittgenstein
>
> "Water is fluid, soft and yielding. But water will wear away rock, which is
> rigid and cannot yield. As a rule, whatever is fluid, soft and yielding
> will overcome whatever is rigid and hard. This is another paradox: what is
> soft is strong." - Lao-Tzu

RE: Am I doing something fundamentally wrong?

Reply via email to