Ah, so simple — thanks Jeremy! That's working perfectly and is about
100 times faster than the one-row-at-a-time method.
Best,
Rob
On 20 Feb 2015, at 18:30, Jeremy Evans wrote:
On Friday, February 20, 2015 at 8:33:50 AM UTC-8, Rob Miller wrote:
Hello,
I'm performing an ETL task part of which involves taking data from a
table
in one database and importing it, with a mild transformation—just
some
de-normalisation really—into a new table in a reporting database.
If these tables were in the same database, I'd do something like:
first\_table = DB[:first\_table].select(:foo, :bar,
:whatever).join(:etc)
DB.create\_table :second\_table, as: first\_table
This results in a single query which is naturally relatively fast.
I can't do a single query for my cross-database example, because that
would involve having the entire dataset in memory and we're talking
about
~40 million rows.
The first and obvious solution was to used paged\_each and just
iterate
over the result set, inserting each time:
DB.create\_table :second\_table do
# snip
end
first\_table = DB[:first\_table].select(:foo, :bar, :whatever).each
|row|
DB[:second\_table].insert(...)
end
This works fine, but naturally is much slower since I'm inserting
only one
row at a time.
Is there some way to improve this? My instinct would be to chunk
things
while still streaming (like Ruby's each\_slice), so I could iterate
over,
say, 1,000 records at a time and use import to import them to the
second
table, but I'm not sure how to do that with (or without) paged\_each.
But
I'm not even sure if that's the right approach.
Thoughts appreciated!
Many thanks,
Rob
You can do that with paged each:
DB[:first\_table].select(:foo, :bar,
:whatever).enum\_for(:paged\_each).each\_slice(1000){|slice|
OTHER\_DB[:second\_table].multi\_insert(slice)}
Note that this may not work on some adapters, as some adapters do not
support issuing a new query while iterating over the results of a
previous
query. One easy workaround is to use a new thread:
DB[:first\_table].select(:foo, :bar,
:whatever).enum\_for(:paged\_each).each\_slice(1000){|slice|
Thread.new{OTHER\_DB[:second\_table].multi\_insert(slice)}.join}
This works because a new thread will automatically use a new
connection.
There are other ways if you don't want to use a new thread, such as
using
the sharding support.
Thanks,
Jeremy
--
You received this message because you are subscribed to the Google
Groups "sequel-talk" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/sequel-talk.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups
"sequel-talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/sequel-talk.
For more options, visit https://groups.google.com/d/optout.