RE: Am I doing something fundamentally wrong?

Keith Gable Fri, 25 May 2012 08:00:35 -0700

View generation only takes a second for me... my data set is half a GB or
so and it takes up around 800MB including about a dozen views.


One thing to remember is never to emit whole documents or doc IDs. Those
are included for free, so emitting a doc or doc id is a waste of CPU cycles
and space.
On May 25, 2012 9:44 AM, "Mike Kimber" <[email protected]> wrote:

> :-) e-mail is not a very good form of communication; apologies. As I said
> in my original post we live in a world of "clever" work around's and  the
> map's you are referring to are my "clever" work around for the time it
> takes to build views against our data set so that I can do ad-hoc
> analytics/information discover on our data set.
>
> I place the doc._id in the emit KEY as its guaranteed unique to each
> document. I create a map with header information in and then I create a
> detailed map out of a array that's in my document, this has the same
> document doc._id as its KEY. I then use the Luciddb couchdb connector (
> https://github.com/dynamobi/conn-couchdb ) to pull the two map/views into
> luciddb tables (the other NVP become columns) at which point I can join
> them on the doc._id KEY and then start changing the grouping of data in
> real time (seconds vs a 16 hour view rebuild). Now when I have found what
> I'm looking for (data and logic) and are happy that it's going to remain
> static (i.e. I don't have to change the grouping/index members and/or
> order) I/we create a proper map reduce of which we now have 13 now and an
> example can be found at:
>
> https://gist.github.com/2788255
>
> Oddly enough whilst these have far more complex JS than the map I provided
> before (https://gist.github.com/2774485) they take a similar time to
> build (although the design doc takes 38 GB (post compaction) of space vs
> the 8GB of raw documents!! i.e. they have less data in them so why are they
> so much bigger) which suggest to me its the size of the documents that
> iterating through large documents is an issue. Even incremental view update
> against 95 new documents takes 3.5 minutes (100% CPU burn), so I don't get
> instant data access unless I use ?stale=update_after
>
> So can I use I use a list function with all_docs and avoid view build for
> data discover?
>
> I know a couchdb view is equivalent to a DBMS index which is again why I
> questioning why it takes so long to build them and why they use so much
> space
>
> I have couchdb lucene installed and its excellent, but only compounds my
> questions re view sizes and view generation as I can index the whole of
>  all my documents in far less time than it takes to run the map reduces and
> it only uses 2.4GB of disk space!!
>
> Clearly I seem to be a bit of a loan voice on this as everyone skirts
> around the why do views take so long to build, why do they only run on one
> CPU and why do they take up so much space, but to my thinking view
> optimization would save a lot of CPU cycles and disk space which would cut
> cloudants operational costs, allow some of that to be passed on to the
> customer and also benefit the wider community also. The cloud is a utility
> model and cost management is key, plus it helps the environment which is a
> finite resource!
>
> If I could just get an answer to the all_docs question I leave you alone
> then!
>
> Thanks
>
> Mike
>
> PS: if it's any consolation one of the chaps who works for me had a
> similar experience with Mongodb, although the use case was different :-).
> In that instance they moved to hadoop which to me is a bit overkill for my
> paltry 50GB of data (8.5 GB compressed!
>
> -----Original Message-----
> From: Robert Newson [mailto:[email protected]]
> Sent: 25 May 2012 11:29
> To: [email protected]
> Subject: Re: Am I doing something fundamentally wrong?
>
> Hi Mike,
>
> Several posters have been trying to tell you that you didn't need to
> build either of the views you posted. A view is to allow you to
> retrieve data efficiently by things other than the document id (or,
> with a reduce, to efficiently access aggregated data, sum, count and
> the like). In both of the views you posted you key by id. Instead of
> having either view you can use the _all_docs view with
> include_docs=true. This view is built in lock-step with your updates,
> so it's never stale.
>
> These views would be worth having if super-low latency to those
> document fragments. If that's the case, then the cost is the view
> build time, but it doesn't sound like you need it.
>
> You say you don't know the ids you want to query, but your views are
> keyed on doc._id (same as _all_docs). I don't understand that. From
> this distance, it's seems you've built views you don't need to build
> and you have to read them in the entirety looking for the data you
> wanted. If that's true, or even half true, then it would explain your
> bad experience so far.
>
> Finally, I will close by saying that couchdb views are akin to SQL's
> 'CREATE INDEX'. Careful database design  includes choosing which
> indexes to build (and which type) ahead of time. It's rare, and
> painful, to add indexes after the fact. For ad-hoc analysis, I wrote
> https://github.com/rnewson/couchdb-lucene.
>
> B.
>
>
> On 25 May 2012 10:52, Mike Kimber <[email protected]> wrote:
> > I have done this; one view per design doc. I then query each one and
> wait. Currently there are 2 design docs and so 2 CPU's burn at 100%. The
> other 2 CPU's/cores do nothing.
> >
> > On the all_docs option that's the point I don't know the id's I want to
> analyse the attributes in my 80K set off documents to find the documents
> that are relevant. Matthieu Rakotojaona summed it up in response to the
> other half of this post which you can see at:
> >
> >
> http://mail-archives.apache.org/mod_mbox/couchdb-user/201205.mbox/%3ccamizln1ele9nzt3hoyxemuoxgrfh2dwyafevcyo-it3mpss...@mail.gmail.com%3E
> >
> > Thanks
> >
> > Mike
> >
> > -----Original Message-----
> > From: Sean Copenhaver [mailto:[email protected]]
> > Sent: 24 May 2012 15:20
> > To: [email protected]
> > Subject: Re: Am I doing something fundamentally wrong?
> >
> > I believe multiple design documents will build views concurrently but one
> > design document is basically done sequentially by the change sequence...
> > not positive.
> >
> > So you could try splitting out your views into multiple design documents
> > and hit them to see if that helps spread out the CPU usage. I want to
> say a
> > lot of the CPU usage is the serialization process that is happening
> > communicating from CouchDB's core to the view engine process.
> >
> > Anyway with the list you specify any view and all_docs is a view with all
> > documents in a database. So if you know the ids you want to work with you
> > can doe a normal view query with a list function.
> > http://wiki.apache.org/couchdb/HTTP_Document_API#all_docs
> >
> > That's what Robert was trying to get at.
> >
> > On Thu, May 24, 2012 at 9:55 AM, Mike Kimber <[email protected]> wrote:
> >
> >> Robert,
> >>
> >> Couchdb Lists work on top of views (and look great by the way), however
> >> that brings me back to my initial post (causes an error on this mailing
> >> list for some reason but you can find a copy here
> >>
> http://mail-archives.apache.org/mod_mbox/couchdb-user/201205.mbox/%[email protected]%3E
> )
> >> :-). Namely generating a view (well a design document with views in it)
> on
> >> our data set takes between 6 (simple view) and 16 hours, takes up a lot
> of
> >> disk space for what seems a small amount of data and burns a CPU at 100%
> >> for the full time it runs i.e. no IO contention and can't use multiple
> >> cores/cpus. So again am I doing something fundamentally wrong or is this
> >> just the way Couch works and most people don't have a data set like
> ours so
> >> it does not take that long to create views or does Big Couch solve the
> >> issue (although it would seem 10 big couch nodes would still take an
> hour)
> >>
> >> Looks like you work at Cloudant, so hopefully you might be able to
> provide
> >> some answers based on real world experience?
> >>
> >> Mike
> >>
> >>
> >>
> >> -----Original Message-----
> >> From: Robert Newson [mailto:[email protected]]
> >> Sent: 24 May 2012 12:08
> >> To: [email protected]
> >> Subject: Re: Am I doing something fundamentally wrong?
> >>
> >> Or use a list function;
> >>
> >> http://wiki.apache.org/couchdb/Formatting_with_Show_and_List
> >>
> >> You can use one with _all_docs and you can POST an array of ids too.
> >>
> >> http://wiki.apache.org/couchdb/HTTP_view_API
> >>
> >> > Since 0.9 you can also issue POST requests to views where you can send
> >> the following JSON structure in the body:
> >> > {"keys": ["key1", "key2", ...]}
> >>
> >> B.
> >>
> >> On 24 May 2012 11:58, Mike Kimber <[email protected]> wrote:
> >> > Looking at Show documentation and running a quick test I don't think
> >> this helps as Show has to be referenced by a doc._id or view key. If
> these
> >> aren't provided it returns null. This makes sense as its for generation
> of
> >> a html, XML page/doc etc.
> >> >
> >> > So I'd have to  get a list of all doc ID's I want and then call the
> show
> >> function for each and to get a filtered list I need a view.
> >> >
> >> > Mike
> >> >
> >> > -----Original Message-----
> >> > From: Mike Kimber [mailto:[email protected]]
> >> > Sent: 24 May 2012 10:47
> >> > To: [email protected]
> >> > Subject: RE: Am I doing something fundamentally wrong?
> >> >
> >> > Aurélien,
> >> >
> >> > Thanks for the response and apologies I didn't get a notification
> >> (e-mail) of my original post (or the 2nd one) or your response. When I
> look
> >> at my original post in Google Reader is has "An error occurred while
> >> fetching this message, sorry !", so there must be something in the
> e-mail
> >> that the mailing list system does not like.
> >> >
> >> > In response to your original response " I'm a bit puzzled by the fact
> >> that your map functions use the document ID". I do this because I load
> the
> >> data into Luciddb and this allows me to join between tables. This is
> not my
> >> end game this is just a compromise due to the time it takes to generate
> a
> >> view and my need to play/discover with the data.
> >> >
> >> > I will look at show to see if It helps, however it does not really
> >> answer my original questions and it does not remove the more general
> issue
> >> that view build takes a very long time, it only uses a single CPU and
> uses
> >> a bucket load of space even with compression on (no idea why when it
> has a
> >> lot less data than the original)
> >> >
> >> > Thanks
> >> >
> >> > Mike
> >> >
> >> > -----Original Message-----
> >> > From: Aurélien Bénel [mailto:[email protected]]
> >> > Sent: 24 May 2012 07:40
> >> > To: [email protected]
> >> > Subject: Re: Am I doing something fundamentally wrong?
> >> >
> >> > Hi Mike,
> >> >
> >> >> Didn't seem to get there first time so having another go
> >> >
> >> > As I wrote in my earlier post, the use of 'map' functions in both of
> >> your examples is overkill.
> >> > Use 'show' functions instead.They won't require an index to be built.
> >> >
> >> >
> >> > Regards,
> >> >
> >> > Aurélien
> >>
> >
> >
> >
> > --
> > "The limits of language are the limits of one's world. " - Ludwig von
> > Wittgenstein
> >
> > "Water is fluid, soft and yielding. But water will wear away rock, which
> is
> > rigid and cannot yield. As a rule, whatever is fluid, soft and yielding
> > will overcome whatever is rigid and hard. This is another paradox: what
> is
> > soft is strong." - Lao-Tzu
>

RE: Am I doing something fundamentally wrong?

Reply via email to