spaced queries?

Mike Bayer Wed, 14 Oct 2015 13:26:41 -0700


On 10/14/15 12:55 PM, jason kirtland wrote:
> If you can partition the rows numerically, this is trivially easily to
> implement using redis as the orchestrator.
> 
> For example if you have integer PKs, you might have a loop like:
> 
>     offset = 0
>     while offset < tablesize:
>         for row in query[offset:batchsize]:
>             migrate(row)
>         commit()
>         offset += batchsize
> 
> With redis orchestrating, you use a key in redis and INCRBY to reliably
> distribute batches to an arbitrary number of workers on an arbitrary
> number of hosts.
> 
>    while True:
>        offset = redis.incrby('migration-offset', batchsize)
>        rows = query[offset:batchsize]
>        if not rows:
>            break
>        for row in rows:
>            migrate(row)
>        commit()
> 
> INCRBY is atomic and returns the adjusted value, so every invocation of
> this script that calls into redis and INCRBYs by, say, 1000, has its own
> chunk of 1000 to work on. For a starting value of -1000 and four
> invocations, you'd see 0, 1000, 2000 and 3000.
> 
> I'll typically do this on one invocation, see that it's running well and
> that I chose a performant batch size, and then spin up additional
> workers on more cores until the migration hits the overall throughput
> required.


What am I missing that one wouldn't use say multiprocessing.Pool() to do
this kind of thing in the general sense?   If we're only talking about
5-10 runners they could just as well be local forked processes.



> 
> 
> 
> On Wed, Oct 14, 2015 at 9:32 AM, Jonathan Vanasco <[email protected]
> <mailto:[email protected]>> wrote:
> 
>     I have to run a script on 2MM objects to update the database.  Not
>     really a schema migration, more like changing the internal data
>     representation in the fields.
> 
>     There's a bit of post-processing and bottlenecks involved, so doing
>     everything one-at-a-time will take a few days.
> 
>     I'd like to split this out into 5-10 'task runners' that are each
>     responsible for a a section of the database (ie, every 5th record). 
>     That should considerably drop the runtime.
> 
>     I thought I had seen a recipe for this somewhere, but checked and
>     couldn't find anything.  That leads me to question if this is a good
>     idea or not.  Anyone have thoughts/pointers?
> 
>     -- 
>     You received this message because you are subscribed to the Google
>     Groups "sqlalchemy" group.
>     To unsubscribe from this group and stop receiving emails from it,
>     send an email to [email protected]
>     <mailto:[email protected]>.
>     To post to this group, send email to [email protected]
>     <mailto:[email protected]>.
>     Visit this group at http://groups.google.com/group/sqlalchemy.
>     For more options, visit https://groups.google.com/d/optout.
> 
> 
> -- 
> You received this message because you are subscribed to the Google
> Groups "sqlalchemy" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to [email protected]
> <mailto:[email protected]>.
> To post to this group, send email to [email protected]
> <mailto:[email protected]>.
> Visit this group at http://groups.google.com/group/sqlalchemy.
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/d/optout.

Re: [sqlalchemy] running parallel migrations using sharded/partioned/spaced queries?

Reply via email to