Re: Question about Dataset#columns implementation / performance

Roland Swingler Thu, 16 Sep 2010 15:20:47 -0700

> What database are you using?

MySQL - the bane of your life ;)


> It currently does an unfiltered,
> unordered, limited to 1 select, which should be fastest on most
> databases For certain
> databases that don't have a syntax for LIMIT and need to be emulated
> with subselects, it's possible performance is bad if they don't have a
> good query optimizer.

This is probably the case for most databases and most queries. The
sort of queries I'm running are aggregate and join heavy, so:

DB[:table].select(:foo, :bar, :baz, 
:sum[:quux]).join(...).join(...).join(...).filter(...).group(:foo, :bar, :baz)

Datawarehousing/reporting style queries. This may be a MySQL-specific
problem, but the grouping and lack of a WHERE clause forces the query
to do full table scans building the whole resultset and then ordering,
even if you limit the result to 1 row. Filtering makes a huge
difference to this sort of query in MySQL (as you can see from the
difference between 2ms and 8s). I don't know if this sort of query
would have similar problems on other databases.

> I think Sequel's current default behavior is best.  We should probably
> modify the adapter you are using, though.

Fair enough. I hacked up a rough implementation this afternoon, which
solved my problems, but I haven't got it to the stage where I'm
confident it handles every edge case for general consumption. I can
polish this up and submit a patch for the MySQL adapter.

Cheers,
Roland

On Sep 16, 5:16 pm, Jeremy Evans <[email protected]> wrote:
> On Sep 16, 3:57 am, Roland Swingler <[email protected]> wrote:
>
> > Hi,
>
> > I've had a performance issue in one of my apps, and have traced it
> > back to Dataset#columns. This method seems to remove any filtering
> > from the dataset, and then query for the first row to get the column
> > symbols. This is problematic, because while my filtered query takes
> > about 2ms, my unfiltered query takes about 8sec, just to get the
> > column names.
>
> What database are you using?  It currently does an unfiltered,
> unordered, limited to 1 select, which should be fastest on most
> databases, which can choose any row in the table.  For certain
> databases that don't have a syntax for LIMIT and need to be emulated
> with subselects, it's possible performance is bad if they don't have a
> good query optimizer.  In those cases we can probably override
> Dataset#columns in the adapter.
>
> > So here are some thoughts as to how this could be improved, and I'm
> > really asking for problems with these methods that I haven't thought
> > of that would make #columns return incorrect results.
>
> > 1. If I've specified the columns I want returned using
> > select(:foo, :bar.as(:baz)), then do we not *always* already know what
> > the column names will be, (i.e. [:foo, :baz])? Even if I do an
> > unaliased calculation/sql-function usage like :sum[:foo], is the
> > column name not just the sql version of the calculation?
>
> For simple cases (all literal symbols), you might be able to.
> However, in most cases the names are database dependent.
>
> > 2. If I haven't specified the columns, or have a '*' somewhere, can we
> > not get the column names from the schema?
>
> Sequel actually does this for models in simple cases, but not for
> datasets.  I suppose it could do so for datasets as well.  In general
> the columns query is fast, so you don't need to.  And if you are going
> to be reusing the same dataset more than once, you should only be
> creating it once and using it multiple times, in which case you only
> have to pay the columns penalty once.
>
> > 3. If the above approaches have flaws, would it not be better to keep
> > the filtering on the dataset, as that will presumably be faster in
> > almost all cases?
>
> In most cases, an unfiltered, unordered, limited-to-one query will be
> fastest.  If you add a filter, the database has to check that filter
> (either via table or index scan).
>
> > I'm happy to create a patch that uses a combination of the above
> > approaches, as long as no-one can think of any showstoppers with them.
>
> I think Sequel's current default behavior is best.  We should probably
> modify the adapter you are using, though.
>
> Jeremy

-- 
You received this message because you are subscribed to the Google Groups 
"sequel-talk" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/sequel-talk?hl=en.

Re: Question about Dataset#columns implementation / performance

Reply via email to