Hello,

I'm performing an ETL task part of which involves taking data from a table in one database and importing it, with a mild transformation—just some de-normalisation really—into a new table in a reporting database.

If these tables were in the same database, I'd do something like:

        first_table = DB[:first_table].select(:foo, :bar, :whatever).join(:etc)
        DB.create_table :second_table, as: first_table

This results in a single query which is naturally relatively fast.

I can't do a single query for my cross-database example, because that would involve having the entire dataset in memory and we're talking about ~40 million rows.

The first and obvious solution was to used `paged_each` and just iterate over the result set, inserting each time:

        DB.create_table :second_table do
                # snip
        end

        first_table = DB[:first_table].select(:foo, :bar, :whatever).each |row|
                DB[:second_table].insert(...)
        end

This works fine, but naturally is much slower since I'm inserting only one row at a time.

Is there some way to improve this? My instinct would be to chunk things while still streaming (like Ruby's `each_slice`), so I could iterate over, say, 1,000 records at a time and use `import` to import them to the second table, but I'm not sure how to do that with (or without) `paged_each`. But I'm not even sure if that's the right approach.

Thoughts appreciated!

Many thanks,
Rob

--
You received this message because you are subscribed to the Google Groups 
"sequel-talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/sequel-talk.
For more options, visit https://groups.google.com/d/optout.

Reply via email to