We've solved the problem using Jim's approach, but at a small cost: we had to round up dates to the beginning of each month (not day, as he suggested). So when we ran a reduce with grouping, then output of a view shortened to a much smaller amount of rows, which we then fed down to list function, which in return collated countries and returned them.
Basically now we are limited to making queries on a per month basis, but that's fine in our case. As for benchmark, this way of doing it proved to be very fast. Thanks everyone! On Tue, Apr 16, 2013 at 7:23 PM, muji <[email protected]> wrote: > That depends upon your requirements and data. If the requirement is to find > data across *any* date range then it will potentially be slow, however if > you only every need to be able to query with a maximum date range of say a > year (ie, your date ranges do not go over a year) then you could use > filtered replication to create databases containing only the entries for a > specific year. > > Still not sure if that helps you. > > I am working on the same problem(s) with an analytics application that uses > couchdb and luckily for me the client reports for any date range only need > to be generated once. Realtime analysis is for *now* (today) otherwise run > a separate process to generate reports for the given date range and then > subsequently return the (cached) generated report. > > I am not sure I completely understand your use case, but you may want to > consider caching result sets for dates in the past, maybe in redis? After > all, once the date has expired the data is fixed right? Or not? > > > On 16 April 2013 11:50, Andrey Kuprianov <[email protected] > >wrote: > > > Muji, what happens if you have several hundred transactions per day in a > > variety of different countries over several years? Then your view > > processing is going to be very slow. We are looking for a near real-time > > solution > > > > > > On Tue, Apr 16, 2013 at 5:42 PM, muji <[email protected]> wrote: > > > > > I believe you need to query with startkey and endkey as complex keys > > > (assuming YYYY-MM-DD): > > > > > > startkey=[startyear,startmonth,startday] > > > endkey=[endyear,endmonth,endday,{}] > > > > > > Then you can extract the countries from the key returned with each row > > (it > > > will be the last element in the array). You will also need to set the > > group > > > view parameter (group_level=4?) for distinct values. > > > > > > Then you should not need to write a custom reduce function. > > > > > > The startkey and endkey must be proper JSON (and URL) encoded values. > > > > > > My understanding is that is the correct approach. > > > > > > Cheers! > > > > > > > > > On 16 April 2013 05:46, Andrey Kuprianov <[email protected] > > > >wrote: > > > > > > > Nope, I need distinct values over a period of time. Not per day. > > > > > > > > > > > > On Tue, Apr 16, 2013 at 11:30 AM, Keith Gable < > > > [email protected] > > > > >wrote: > > > > > > > > > It gives you distinct countries per day. Is that not what you want? > > > With > > > > > reduce, it should be really fast once the view is built. > > > > > On Apr 15, 2013 9:05 PM, "Andrey Kuprianov" < > > > [email protected] > > > > > > > > > > wrote: > > > > > > > > > > > @Keith your method will not give me distinct countries and even > > with > > > > > reduce > > > > > > and after being fed to list function it's still slow > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Apr 16, 2013 at 2:27 AM, Wendall Cada < > [email protected] > > > > > > > > wrote: > > > > > > > > > > > > > I agree with this approach. I do something similar using _sum: > > > > > > > > > > > > > > emit([doc.country_name, toDay(doc.timestamp)], 1); > > > > > > > > > > > > > > The toDay() method is basically a floor of the day value. > Since I > > > > don't > > > > > > > store ts in UTC (Because of an idiotic error some years back) I > > > also > > > > > do a > > > > > > > tz offset to correct the day value in my toDay() method. > > > > > > > > > > > > > > Using reduce is by far the fastest method for this. I don't see > > any > > > > > issue > > > > > > > with getting this to scale. > > > > > > > > > > > > > > Overall, I think I rather prefer the method Keith shows, as it > > > would > > > > > > > depend on the values returned in the date object versus other > > > > possibly > > > > > > > inaccurate means using math. > > > > > > > > > > > > > > Wendall > > > > > > > > > > > > > > > > > > > > > On 04/15/2013 07:18 AM, Keith Gable wrote: > > > > > > > > > > > > > >> Output keys like so: > > > > > > >> > > > > > > >> [2010, 7, 10, "Australia"] > > > > > > >> > > > > > > >> Reduce function would be _count. > > > > > > >> > > > > > > >> startkey=[year,month,day,null] > > > > > > >> endkey=[year,month,day,{}] > > > > > > >> > > > > > > >> --- > > > > > > >> Keith Gable > > > > > > >> A+, Network+, and Storage+ Certified Professional > > > > > > >> Apple Certified Technical Coordinator > > > > > > >> Mobile Application Developer / Web Developer > > > > > > >> > > > > > > >> > > > > > > >> On Sun, Apr 14, 2013 at 8:37 PM, Andrey Kuprianov < > > > > > > >> [email protected]> wrote: > > > > > > >> > > > > > > >> Hi guys, > > > > > > >>> > > > > > > >>> Just for the sake of a debate. Here's the question. There are > > > > > > >>> transactions. > > > > > > >>> Among all other attributes there's timestamp (when > transaction > > > was > > > > > > made; > > > > > > >>> in > > > > > > >>> seconds) and a country name (from where the transaction was > > > made). > > > > > So, > > > > > > >>> for > > > > > > >>> instance, > > > > > > >>> > > > > > > >>> { > > > > > > >>> . . . . > > > > > > >>> "timestamp": 1332806400 > > > > > > >>> "country_name": "Australia", > > > > > > >>> . . . . > > > > > > >>> } > > > > > > >>> > > > > > > >>> Question is: how does one get unique / distinct country names > > in > > > > > > between > > > > > > >>> dates? For example, give me all country names in between > > > > 10-Jul-2010 > > > > > > and > > > > > > >>> 21-Jan-2013. > > > > > > >>> > > > > > > >>> My solution was to write a custom reduce function and set > > > > > > >>> reduce_limit=false, so that i can enumerate all countries > > without > > > > > > hitting > > > > > > >>> the overflow exception. It works great! However, such > solutions > > > are > > > > > > >>> frowned > > > > > > >>> upon by everyone around. Has anyone a better idea on how to > > > tackle > > > > > this > > > > > > >>> efficiently? > > > > > > >>> > > > > > > >>> Andrey > > > > > > >>> > > > > > > >>> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > mischa (aka muji). > > > > > > > > > -- > mischa (aka muji). >
