On Friday, February 20, 2015 at 8:33:50 AM UTC-8, Rob Miller wrote:
>
> Hello,
>
> I'm performing an ETL task part of which involves taking data from a table
> in one database and importing it, with a mild transformation—just some
> de-normalisation really—into a new table in a reporting database.
>
> If these tables were in the same database, I'd do something like:
>
> first_table = DB[:first_table].select(:foo, :bar, :whatever).join(:etc)
> DB.create_table :second_table, as: first_table
>
> This results in a single query which is naturally relatively fast.
>
> I can't do a single query for my cross-database example, because that
> would involve having the entire dataset in memory and we're talking about
> ~40 million rows.
>
> The first and obvious solution was to used paged_each and just iterate
> over the result set, inserting each time:
>
> DB.create_table :second_table do
> # snip
> end
>
> first_table = DB[:first_table].select(:foo, :bar, :whatever).each |row|
> DB[:second_table].insert(...)
> end
>
> This works fine, but naturally is much slower since I'm inserting only one
> row at a time.
>
> Is there some way to improve this? My instinct would be to chunk things
> while still streaming (like Ruby's each_slice), so I could iterate over,
> say, 1,000 records at a time and use import to import them to the second
> table, but I'm not sure how to do that with (or without) paged_each. But
> I'm not even sure if that's the right approach.
>
> Thoughts appreciated!
>
> Many thanks,
> Rob
>
You can do that with paged each:
DB[:first_table].select(:foo, :bar,
:whatever).enum_for(:paged_each).each_slice(1000){|slice|
OTHER_DB[:second_table].multi_insert(slice)}
Note that this may not work on some adapters, as some adapters do not
support issuing a new query while iterating over the results of a previous
query. One easy workaround is to use a new thread:
DB[:first_table].select(:foo, :bar,
:whatever).enum_for(:paged_each).each_slice(1000){|slice|
Thread.new{OTHER_DB[:second_table].multi_insert(slice)}.join}
This works because a new thread will automatically use a new connection.
There are other ways if you don't want to use a new thread, such as using
the sharding support.
Thanks,
Jeremy
--
You received this message because you are subscribed to the Google Groups
"sequel-talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/sequel-talk.
For more options, visit https://groups.google.com/d/optout.