Hello,
I'm performing an ETL task part of which involves taking data from a
table in one database and importing it, with a mild
transformation—just some de-normalisation really—into a new table in
a reporting database.
If these tables were in the same database, I'd do something like:
first_table = DB[:first_table].select(:foo, :bar, :whatever).join(:etc)
DB.create_table :second_table, as: first_table
This results in a single query which is naturally relatively fast.
I can't do a single query for my cross-database example, because that
would involve having the entire dataset in memory and we're talking
about ~40 million rows.
The first and obvious solution was to used `paged_each` and just iterate
over the result set, inserting each time:
DB.create_table :second_table do
# snip
end
first_table = DB[:first_table].select(:foo, :bar, :whatever).each |row|
DB[:second_table].insert(...)
end
This works fine, but naturally is much slower since I'm inserting only
one row at a time.
Is there some way to improve this? My instinct would be to chunk things
while still streaming (like Ruby's `each_slice`), so I could iterate
over, say, 1,000 records at a time and use `import` to import them to
the second table, but I'm not sure how to do that with (or without)
`paged_each`. But I'm not even sure if that's the right approach.
Thoughts appreciated!
Many thanks,
Rob
--
You received this message because you are subscribed to the Google Groups
"sequel-talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/sequel-talk.
For more options, visit https://groups.google.com/d/optout.