Re: Big dataset performance?

Jeremy Evans Mon, 10 Apr 2017 15:10:18 -0700

On Monday, April 10, 2017 at 2:55:09 PM UTC-7, Andriy Tyurnikov wrote:
>
> Hello everyone, and thank you so much for your effort with sequel.
>
> While benchmarking in-memory processing of big datasets (100K - 200K rows) 
> in active-record and sequel
> we've noticed surprisingly good results of sql_query gem, which does 
> nothing but invoking ActiveRecord::Base.connection.execute(SQL).entries,
> which suggest that deserialization process is suboptimal (for big datasets 
> at least) in both sequel and activerecord, which is fairly surprising
> since creating 1_000_000 ruby objects doesn't seem that expensive (even 
> with exception of Date.new);
>
> With increase of resulting dataset 
> "ActiveRecord::Base.connection.execute(SQL).entries" demonstrates fairly 
> small cost of results processing, while both ORMs degrade, when used for 
> resulting object instantiation:
>
> "
>
> Benchmark.measure {DB[:orders].limit(100000).map{|i| i[:id]}}
>
> D, [2017-04-11T00:07:11.743980 #38269] DEBUG -- : (1.468975s) SELECT * 
> FROM "orders" LIMIT 100000
>
> => #<Benchmark::Tms:0x007fc061756768 @label="", @real=13.32571900000039, 
> @cstime=0.0, @cutime=0.0, @stime=0.47000000000000597, 
> @utime=11.819999999999993, @total=12.29>
> "
>
> 1) With that in mind - could someone please express an opinion on reasons 
> of potential performance loss in such case?
>


Creating objects is one of the more expensive things you can do in ruby, 
and creating either Sequel::Model or ActiveRecord instances is going to be 
expensive if done for many objects.

In Sequel, the most similar code to the 
ActiveRecord::Base.connection.execute call would be:

  DB.sychronize{|conn| conn.execute(query)} # assuming the underlying 
connection supports an #execute method

However it isn't exactly the same as ActiveRecord generally abstracts the 
connection object, whereas Sequel uses the raw connection object provided 
by the driver (in most adapters).

One of the reasons that Sequel tends to be faster than ActiveRecord when 
retrieving objects is that it does less work when creating instances. 
 However, it's still going to be slower than working with the driver 
directly, as it has to:

  1) build symbol keyed hashes for each row (1 hash per row)
  2) do typecasting of values (if the driver doesn't do that) (potentially 
1 or more objects per row per column)
  3) wrap each hash in a Sequel::Model instance (if using Sequel::Model) (1 
object per row)

For the fastest possible code, use DB.synchronize to get access to the 
connection object directly, and/or drop down to using C.

If you are using Sequel with the pg driver, you probably also want to load 
sequel_pg (a C extension that significantly speeds things up).
 

> 2) While jeremyevans/simple_orm_benchmark is farly good illustration of 
> 'sequel' superiority, I wonder if anyone 
> explored ORM performance in terms of detailed cost of 
> networking/parsing/result object allocation?
>

I certainly would be interested in such an analysis.

Thanks,
Jeremy 

-- 
You received this message because you are subscribed to the Google Groups 
"sequel-talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/sequel-talk.
For more options, visit https://groups.google.com/d/optout.

Re: Big dataset performance?

Reply via email to