View generation only takes a second for me... my data set is half a GB or so and it takes up around 800MB including about a dozen views.
One thing to remember is never to emit whole documents or doc IDs. Those are included for free, so emitting a doc or doc id is a waste of CPU cycles and space. On May 25, 2012 9:44 AM, "Mike Kimber" <[email protected]> wrote: > :-) e-mail is not a very good form of communication; apologies. As I said > in my original post we live in a world of "clever" work around's and the > map's you are referring to are my "clever" work around for the time it > takes to build views against our data set so that I can do ad-hoc > analytics/information discover on our data set. > > I place the doc._id in the emit KEY as its guaranteed unique to each > document. I create a map with header information in and then I create a > detailed map out of a array that's in my document, this has the same > document doc._id as its KEY. I then use the Luciddb couchdb connector ( > https://github.com/dynamobi/conn-couchdb ) to pull the two map/views into > luciddb tables (the other NVP become columns) at which point I can join > them on the doc._id KEY and then start changing the grouping of data in > real time (seconds vs a 16 hour view rebuild). Now when I have found what > I'm looking for (data and logic) and are happy that it's going to remain > static (i.e. I don't have to change the grouping/index members and/or > order) I/we create a proper map reduce of which we now have 13 now and an > example can be found at: > > https://gist.github.com/2788255 > > Oddly enough whilst these have far more complex JS than the map I provided > before (https://gist.github.com/2774485) they take a similar time to > build (although the design doc takes 38 GB (post compaction) of space vs > the 8GB of raw documents!! i.e. they have less data in them so why are they > so much bigger) which suggest to me its the size of the documents that > iterating through large documents is an issue. Even incremental view update > against 95 new documents takes 3.5 minutes (100% CPU burn), so I don't get > instant data access unless I use ?stale=update_after > > So can I use I use a list function with all_docs and avoid view build for > data discover? > > I know a couchdb view is equivalent to a DBMS index which is again why I > questioning why it takes so long to build them and why they use so much > space > > I have couchdb lucene installed and its excellent, but only compounds my > questions re view sizes and view generation as I can index the whole of > all my documents in far less time than it takes to run the map reduces and > it only uses 2.4GB of disk space!! > > Clearly I seem to be a bit of a loan voice on this as everyone skirts > around the why do views take so long to build, why do they only run on one > CPU and why do they take up so much space, but to my thinking view > optimization would save a lot of CPU cycles and disk space which would cut > cloudants operational costs, allow some of that to be passed on to the > customer and also benefit the wider community also. The cloud is a utility > model and cost management is key, plus it helps the environment which is a > finite resource! > > If I could just get an answer to the all_docs question I leave you alone > then! > > Thanks > > Mike > > PS: if it's any consolation one of the chaps who works for me had a > similar experience with Mongodb, although the use case was different :-). > In that instance they moved to hadoop which to me is a bit overkill for my > paltry 50GB of data (8.5 GB compressed! > > -----Original Message----- > From: Robert Newson [mailto:[email protected]] > Sent: 25 May 2012 11:29 > To: [email protected] > Subject: Re: Am I doing something fundamentally wrong? > > Hi Mike, > > Several posters have been trying to tell you that you didn't need to > build either of the views you posted. A view is to allow you to > retrieve data efficiently by things other than the document id (or, > with a reduce, to efficiently access aggregated data, sum, count and > the like). In both of the views you posted you key by id. Instead of > having either view you can use the _all_docs view with > include_docs=true. This view is built in lock-step with your updates, > so it's never stale. > > These views would be worth having if super-low latency to those > document fragments. If that's the case, then the cost is the view > build time, but it doesn't sound like you need it. > > You say you don't know the ids you want to query, but your views are > keyed on doc._id (same as _all_docs). I don't understand that. From > this distance, it's seems you've built views you don't need to build > and you have to read them in the entirety looking for the data you > wanted. If that's true, or even half true, then it would explain your > bad experience so far. > > Finally, I will close by saying that couchdb views are akin to SQL's > 'CREATE INDEX'. Careful database design includes choosing which > indexes to build (and which type) ahead of time. It's rare, and > painful, to add indexes after the fact. For ad-hoc analysis, I wrote > https://github.com/rnewson/couchdb-lucene. > > B. > > > On 25 May 2012 10:52, Mike Kimber <[email protected]> wrote: > > I have done this; one view per design doc. I then query each one and > wait. Currently there are 2 design docs and so 2 CPU's burn at 100%. The > other 2 CPU's/cores do nothing. > > > > On the all_docs option that's the point I don't know the id's I want to > analyse the attributes in my 80K set off documents to find the documents > that are relevant. Matthieu Rakotojaona summed it up in response to the > other half of this post which you can see at: > > > > > http://mail-archives.apache.org/mod_mbox/couchdb-user/201205.mbox/%3ccamizln1ele9nzt3hoyxemuoxgrfh2dwyafevcyo-it3mpss...@mail.gmail.com%3E > > > > Thanks > > > > Mike > > > > -----Original Message----- > > From: Sean Copenhaver [mailto:[email protected]] > > Sent: 24 May 2012 15:20 > > To: [email protected] > > Subject: Re: Am I doing something fundamentally wrong? > > > > I believe multiple design documents will build views concurrently but one > > design document is basically done sequentially by the change sequence... > > not positive. > > > > So you could try splitting out your views into multiple design documents > > and hit them to see if that helps spread out the CPU usage. I want to > say a > > lot of the CPU usage is the serialization process that is happening > > communicating from CouchDB's core to the view engine process. > > > > Anyway with the list you specify any view and all_docs is a view with all > > documents in a database. So if you know the ids you want to work with you > > can doe a normal view query with a list function. > > http://wiki.apache.org/couchdb/HTTP_Document_API#all_docs > > > > That's what Robert was trying to get at. > > > > On Thu, May 24, 2012 at 9:55 AM, Mike Kimber <[email protected]> wrote: > > > >> Robert, > >> > >> Couchdb Lists work on top of views (and look great by the way), however > >> that brings me back to my initial post (causes an error on this mailing > >> list for some reason but you can find a copy here > >> > http://mail-archives.apache.org/mod_mbox/couchdb-user/201205.mbox/%[email protected]%3E > ) > >> :-). Namely generating a view (well a design document with views in it) > on > >> our data set takes between 6 (simple view) and 16 hours, takes up a lot > of > >> disk space for what seems a small amount of data and burns a CPU at 100% > >> for the full time it runs i.e. no IO contention and can't use multiple > >> cores/cpus. So again am I doing something fundamentally wrong or is this > >> just the way Couch works and most people don't have a data set like > ours so > >> it does not take that long to create views or does Big Couch solve the > >> issue (although it would seem 10 big couch nodes would still take an > hour) > >> > >> Looks like you work at Cloudant, so hopefully you might be able to > provide > >> some answers based on real world experience? > >> > >> Mike > >> > >> > >> > >> -----Original Message----- > >> From: Robert Newson [mailto:[email protected]] > >> Sent: 24 May 2012 12:08 > >> To: [email protected] > >> Subject: Re: Am I doing something fundamentally wrong? > >> > >> Or use a list function; > >> > >> http://wiki.apache.org/couchdb/Formatting_with_Show_and_List > >> > >> You can use one with _all_docs and you can POST an array of ids too. > >> > >> http://wiki.apache.org/couchdb/HTTP_view_API > >> > >> > Since 0.9 you can also issue POST requests to views where you can send > >> the following JSON structure in the body: > >> > {"keys": ["key1", "key2", ...]} > >> > >> B. > >> > >> On 24 May 2012 11:58, Mike Kimber <[email protected]> wrote: > >> > Looking at Show documentation and running a quick test I don't think > >> this helps as Show has to be referenced by a doc._id or view key. If > these > >> aren't provided it returns null. This makes sense as its for generation > of > >> a html, XML page/doc etc. > >> > > >> > So I'd have to get a list of all doc ID's I want and then call the > show > >> function for each and to get a filtered list I need a view. > >> > > >> > Mike > >> > > >> > -----Original Message----- > >> > From: Mike Kimber [mailto:[email protected]] > >> > Sent: 24 May 2012 10:47 > >> > To: [email protected] > >> > Subject: RE: Am I doing something fundamentally wrong? > >> > > >> > Aurélien, > >> > > >> > Thanks for the response and apologies I didn't get a notification > >> (e-mail) of my original post (or the 2nd one) or your response. When I > look > >> at my original post in Google Reader is has "An error occurred while > >> fetching this message, sorry !", so there must be something in the > e-mail > >> that the mailing list system does not like. > >> > > >> > In response to your original response " I'm a bit puzzled by the fact > >> that your map functions use the document ID". I do this because I load > the > >> data into Luciddb and this allows me to join between tables. This is > not my > >> end game this is just a compromise due to the time it takes to generate > a > >> view and my need to play/discover with the data. > >> > > >> > I will look at show to see if It helps, however it does not really > >> answer my original questions and it does not remove the more general > issue > >> that view build takes a very long time, it only uses a single CPU and > uses > >> a bucket load of space even with compression on (no idea why when it > has a > >> lot less data than the original) > >> > > >> > Thanks > >> > > >> > Mike > >> > > >> > -----Original Message----- > >> > From: Aurélien Bénel [mailto:[email protected]] > >> > Sent: 24 May 2012 07:40 > >> > To: [email protected] > >> > Subject: Re: Am I doing something fundamentally wrong? > >> > > >> > Hi Mike, > >> > > >> >> Didn't seem to get there first time so having another go > >> > > >> > As I wrote in my earlier post, the use of 'map' functions in both of > >> your examples is overkill. > >> > Use 'show' functions instead.They won't require an index to be built. > >> > > >> > > >> > Regards, > >> > > >> > Aurélien > >> > > > > > > > > -- > > "The limits of language are the limits of one's world. " - Ludwig von > > Wittgenstein > > > > "Water is fluid, soft and yielding. But water will wear away rock, which > is > > rigid and cannot yield. As a rule, whatever is fluid, soft and yielding > > will overcome whatever is rigid and hard. This is another paradox: what > is > > soft is strong." - Lao-Tzu >
