Re: Model.import best practice, lots of bad data (not my fault!)

Brandt Lofton Sun, 06 Jan 2013 08:58:31 -0800

After reading some more source and docs, I don't think Sequels 
:commit_every or :slice options are going to help me out.


I'm thinking I'm going to need two models for the table I'm writing to. 
 One for bulk, and one for individual inserts, 
with appropriate after_commit and after_rollback hooks.

Again, feel free to shoot me down if I'm on the wrong path!

On Sunday, January 6, 2013 10:33:32 AM UTC-6, Brandt Lofton wrote:
>
> Well the approach is working already interfacing directly with ruby-oci8. 
> I'm in more of a refactor all the things mode right now.
>
> On of the biggest problems for me, is that my test db is an oracle 
> instance over a slow VPN connecting to a underpowered / busy database in a 
> lab god knows where. This gem/app is only ~700 lines of code and the test 
> suite takes > 2 mins to run :(  So I have a couple of different reasons for 
> looking into Sequel.  One of the biggies is that I'm hoping I can test 
> against a local mysql db with the same schema 95% of the time, and then 
> only test against oracle when I need to verify portability.
>
> So I guess I should have been more specific.  I'm looking for a best 
> practice as in, how would one of you guys write this code block?  This is 
> what I'm doing with ruby-oci8 directly - 
> https://gist.github.com/5be872c941c6e6a0ad28 , the piece of code that 
> calls this bulk insert, slices the array.  So basically 
> @all_the_things.slice(500) { |s| bulk_insert(s); these_failed(s) if 
> things_go_boom }
>
> The biggest hang up, is that with all the other stupid things I have to 
> deal with, is that I need to batch insert, but I also HAVE to know what 
> records failed and store info on those.  The later is what makes this a 
> pain.
>
> For instance.  If I bulk_insert(100,000) records and then commit.  It 
> behaves in a sorta nice way in that it will return an error, a row count of 
> how many where successful inserted, and "discard" the unsuccessful ones. 
>  By discard, I mean it provides no information about what elements of the 
> array/records failed.
>
> This alone should be enough, just throwing away anything records that 
> failed. But nooooooo.  The clients sending the data in are so fubar'd they 
> consistently send duplicate data.  So this not only causes issues with the 
> transactional db, but when it data gets moved to the warehouse.
>
> So basically.  I'm wondering if I could refactor out a ton of code and 
> with Sequel in my tool belt.  I'm looking at this right now - 
> http://sequel.rubyforge.org/rdoc/classes/Sequel/Database.html#method-i-after_commitand
>  after_rollback
>
> So given @stats = 50,000 "rows to be inserted"
>
> How can I accomplish / can this be done?
>
> DB[:stats].import(cols, @stats, :commit_every => 500)
>
> While at the same time:
>
> after_commit do |slice_of_500???|
> after_rollback do |something_else|
>
> On Saturday, January 5, 2013 9:59:23 PM UTC-6, Jeremy Evans wrote:
>>
>> On Friday, January 4, 2013 7:48:43 PM UTC-8, Brandt Lofton wrote: 
>> > Alright, so I'm new to Sequel and I'm enjoying it quite a bit after 
>> getting over having to make a connection before defining a model class. 
>> > 
>> > 
>> > So here is the logic I need to use. 
>> > 
>> > 
>> > Collect x rows of data, x = ~30k usually 
>> > Loop through these records in batches of say, ~500 
>> > Model import(those 500 records)Import Fail? Move these bad 500 to a low 
>> priority queue so I can loop through them 1 by 1Do another 500 
>> > I know what your asking... Why do you have to do this?  Because I 
>> didn't design the part that the data comes from, or the database it ends up 
>> in. 
>> > I cannot "detect" this bad data because it is violating duplicate key 
>> constraints on the db side.  So I don't know if its bad until the database 
>> tells me so. 
>> > 
>> > 
>> > So anyways, I have been playing with the options Sequel has available. 
>>  DB.transaction, Model.import(:slice => 1000) etc and trying to figure out 
>> what's going to do the trick. 
>> > 
>> > 
>> > But for once I'm going to take a step back and try some advice first. 
>>  Anybody more experienced in this regard have any best practice advice for 
>> me before I go much further? I'm over my wtf quota for today... 
>>
>> Your approach sounds reasonable.  If you are using MySQL, you might want 
>> to use insert_ignore, which will skip duplicate records automatically. 
>>
>> Thanks, 
>> Jeremy 
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"sequel-talk" group.
To view this discussion on the web visit 
https://groups.google.com/d/msg/sequel-talk/-/41IBln7UcdIJ.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/sequel-talk?hl=en.

Re: Model.import best practice, lots of bad data (not my fault!)

Reply via email to