Ah, so simple — thanks Jeremy! That's working perfectly and is about 100 times faster than the one-row-at-a-time method.

Best,
Rob

On 20 Feb 2015, at 18:30, Jeremy Evans wrote:

On Friday, February 20, 2015 at 8:33:50 AM UTC-8, Rob Miller wrote:

Hello,

I'm performing an ETL task part of which involves taking data from a table in one database and importing it, with a mild transformation—just some
de-normalisation really—into a new table in a reporting database.

If these tables were in the same database, I'd do something like:

first\_table = DB[:first\_table].select(:foo, :bar, :whatever).join(:etc)
DB.create\_table :second\_table, as: first\_table

This results in a single query which is naturally relatively fast.

I can't do a single query for my cross-database example, because that
would involve having the entire dataset in memory and we're talking about
~40 million rows.

The first and obvious solution was to used paged\_each and just iterate
over the result set, inserting each time:

DB.create\_table :second\_table do
 # snip
end

first\_table = DB[:first\_table].select(:foo, :bar, :whatever).each |row|
 DB[:second\_table].insert(...)
end

This works fine, but naturally is much slower since I'm inserting only one
row at a time.

Is there some way to improve this? My instinct would be to chunk things while still streaming (like Ruby's each\_slice), so I could iterate over, say, 1,000 records at a time and use import to import them to the second table, but I'm not sure how to do that with (or without) paged\_each. But
I'm not even sure if that's the right approach.

Thoughts appreciated!

Many thanks,
Rob

You can do that with paged each:

DB[:first\_table].select(:foo, :bar,
:whatever).enum\_for(:paged\_each).each\_slice(1000){|slice|
OTHER\_DB[:second\_table].multi\_insert(slice)}

Note that this may not work on some adapters, as some adapters do not
support issuing a new query while iterating over the results of a previous
query.  One easy workaround is to use a new thread:

DB[:first\_table].select(:foo, :bar,
:whatever).enum\_for(:paged\_each).each\_slice(1000){|slice|
Thread.new{OTHER\_DB[:second\_table].multi\_insert(slice)}.join}

This works because a new thread will automatically use a new connection. There are other ways if you don't want to use a new thread, such as using
the sharding support.

Thanks,
Jeremy

--
You received this message because you are subscribed to the Google Groups "sequel-talk" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/sequel-talk.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups 
"sequel-talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/sequel-talk.
For more options, visit https://groups.google.com/d/optout.

Reply via email to