Re: Chunked streaming

Jeremy Evans Fri, 20 Feb 2015 10:31:05 -0800

On Friday, February 20, 2015 at 8:33:50 AM UTC-8, Rob Miller wrote:
>
> Hello,
>
> I'm performing an ETL task part of which involves taking data from a table 
> in one database and importing it, with a mild transformation—just some 
> de-normalisation really—into a new table in a reporting database.
>
> If these tables were in the same database, I'd do something like:
>
> first_table = DB[:first_table].select(:foo, :bar, :whatever).join(:etc)
> DB.create_table :second_table, as: first_table
>
> This results in a single query which is naturally relatively fast.
>
> I can't do a single query for my cross-database example, because that 
> would involve having the entire dataset in memory and we're talking about 
> ~40 million rows.
>
> The first and obvious solution was to used paged_each and just iterate 
> over the result set, inserting each time:
>
> DB.create_table :second_table do
>     # snip
> end
>
> first_table = DB[:first_table].select(:foo, :bar, :whatever).each |row|
>     DB[:second_table].insert(...)
> end
>
> This works fine, but naturally is much slower since I'm inserting only one 
> row at a time.
>
> Is there some way to improve this? My instinct would be to chunk things 
> while still streaming (like Ruby's each_slice), so I could iterate over, 
> say, 1,000 records at a time and use import to import them to the second 
> table, but I'm not sure how to do that with (or without) paged_each. But 
> I'm not even sure if that's the right approach.
>
> Thoughts appreciated!
>
> Many thanks,
> Rob
>
You can do that with paged each:


DB[:first_table].select(:foo, :bar, 
:whatever).enum_for(:paged_each).each_slice(1000){|slice| 
OTHER_DB[:second_table].multi_insert(slice)}

Note that this may not work on some adapters, as some adapters do not 
support issuing a new query while iterating over the results of a previous 
query.  One easy workaround is to use a new thread:

DB[:first_table].select(:foo, :bar, 
:whatever).enum_for(:paged_each).each_slice(1000){|slice| 
Thread.new{OTHER_DB[:second_table].multi_insert(slice)}.join}

This works because a new thread will automatically use a new connection. 
 There are other ways if you don't want to use a new thread, such as using 
the sharding support.

Thanks,
Jeremy

-- 
You received this message because you are subscribed to the Google Groups 
"sequel-talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/sequel-talk.
For more options, visit https://groups.google.com/d/optout.

Re: Chunked streaming

Reply via email to