Re: getting most recent doc

Adam Kocoloski Mon, 19 Apr 2010 07:23:06 -0700

On Apr 19, 2010, at 10:10 AM, Eric Casteleijn wrote:

> On 04/19/2010 09:41 AM, Adam Kocoloski wrote:
>> On Apr 17, 2010, at 11:09 AM, Eric Casteleijn wrote:
>> 
>>> On 04/16/2010 04:46 AM, wolfgang haefelinger wrote:
>>>> Thanks Robert
>>>> 
>>>> for your answer. However, it is not exactly what I was looking for
>>>> (due to my inappropriate problem description).
>>>> 
>>>> Firstly, I do want to have the document instead of the time stamp in
>>>> order to avoid that additional document fetch. That's obviously easy
>>>> to fix:
>>>> 
>>>> function(doc) { //
>>>>  emit([doc.name, doc.timestamp], doc);
>>>> }
>>> 
>>> Don't do that, it's unnecessary, because you can always call any view with 
>>> '?include_docs=true' and it will add a 'doc' member to each row, containing 
>>> the document, and worse than that, it's harmful, as it makes the indexes 
>>> stored on disk many times larger than they need to be. (Depending on the 
>>> size of your documents this can really make a huge difference, anecdotal 
>>> evidence suggests: gwibber used to do this, and when I changed it, the 
>>> indexes stored on disk decreased some 90% in size.)
>>> 
>>> If you always want the whole document, just emit null for a value and 
>>> always call the view with include_docs.
>>> 
>>> If there are cases where you don't want the whole document, decide which 
>>> data you need and only emit that.
>> 
>> Hi Eric, I don't think its correct to have a blanket recommendation to 
>> always use include_docs=true.  For large range queries on a view the query 
>> performance will be much better - up to 10x better throughput on large DBs 
>> in my experience - if the doc is already included.  Yes, the view index will 
>> balloon in size, but some people may be willing to make that tradeoff.  
>> Cheers,
> 
> Oops, thanks for catching that Adam, and my apologies, that was rather 
> myopic. I didn't think about the other side of the tradeoff, but that makes a 
> lot of sense.
> 
> I still wonder in that case if there is something you can do to shrink the 
> stored views somewhat: gwibber had a number of views that emitted the whole 
> document, but those documents (typically representing a twitter or identi.ca 
> message) weren't very large in themselves. My database, after compaction was 
> something between 70 and 80 MB, whereas the indexes took over a GB. Since 
> gwibber+desktopcouch run on the desktop, where only one client typically 
> talks to couch, I still think we made the right decision to sacrifice speed 
> for diskspace. On a server, both are important though, considering we host 
> multiple couchdbs per user. Luckily we don't compute the views for the 
> gwibber dbs server side, but I'm sure it's something we'll run into again 
> elsewhere.
>


Were the view indices also compacted?  If so, that's very surprising to me.  I 
should double-check our numbers, but I seem to remember the compacted view 
indices for our case (which had similarly-sized documents) being comparable in 
size to the DBs.

There are a few things we can do to decrease the size of uncompacted view 
indices.  Chief among those is to put a lower bound on the size of a view index 
write, as reported by Henrik Jensen last month (COUCHDB-700).  Cheers,

Adam

> -- 
> eric casteleijn
> https://code.launchpad.net/~thisfred
> Canonical Ltd.
>

Re: getting most recent doc

Reply via email to