Re: Options for Iterative Map Reduce

James Marca Thu, 13 Dec 2012 16:44:48 -0800

On Wed, Dec 12, 2012 at 05:07:08PM -0500, nicholas a. evans wrote:
> On Wed, Dec 12, 2012 at 4:03 PM, James Marca
> <[email protected]> wrote:
> > I feel your pain but cannot offer any help.  I also use your option 5:
> > I use node.js to manually store view output into a separate db, with
> > the doc _ids equal to the key of the view output, so that I can limit
> > updates to only those things that change.
> 
> Thanks James.  Do you apply the changes incrementally, and if so how
> do you detect which view rows (in the source DB) have changes so you
> don't need to download the whole reduced/grouped view?  And to the
> point of my last email, how do you detect missing view rows in the
> source DB and delete them from the chained DB?


My app is somewhat of a special case, which is why I was willing to
roll my own.  I am averaging multiple imputation runs to generate a
single estimate of a parameter.  So when "something changes", yes, I
have to re-collect all the view outputs because that means I cranked
up the imputation engine and re-ran more estimates.

I get new data in yearly and monthly batches, and the data is large so
I tend to store my imputation outputs in one database per county per
year.  This hand-sharding means that when adding new data I don't have
to worry about avoiding old data in the view collation step.

If I'm just running a new month of data (hypothetical, I've only done
years at a time so far), then I set the view parameters to get from
the start time to the end time (my view keys are the detector id and
the hour the data was collected) to limit the work I have to do.

Finally, given that my application is gathering real world
measurements, I never delete things, so worring about rows in the
collated db that should go away is also not an issue for me.

If I had to do that, I would create a view on my collation db that
spits out some unique key (detector, day, for example) I can check
against the source databases.  Or else, I'd increment a field in the
collated document, say the date stamp of the update or something, and
then have a view that spit out docs sorted by that field, and run a
background job to reap the ones that are out of date.

Sorry, that is probably totally unhelpful.

Regards,
James Marca

pgpI63nAE0d30.pgp
Description: PGP signature

Re: Options for Iterative Map Reduce

Reply via email to