Re: Question about Dataset#columns implementation / performance

Jeremy Evans Thu, 16 Sep 2010 09:16:26 -0700

On Sep 16, 3:57 am, Roland Swingler <[email protected]> wrote:
> Hi,
>
> I've had a performance issue in one of my apps, and have traced it
> back to Dataset#columns. This method seems to remove any filtering
> from the dataset, and then query for the first row to get the column
> symbols. This is problematic, because while my filtered query takes
> about 2ms, my unfiltered query takes about 8sec, just to get the
> column names.


What database are you using?  It currently does an unfiltered,
unordered, limited to 1 select, which should be fastest on most
databases, which can choose any row in the table.  For certain
databases that don't have a syntax for LIMIT and need to be emulated
with subselects, it's possible performance is bad if they don't have a
good query optimizer.  In those cases we can probably override
Dataset#columns in the adapter.

> So here are some thoughts as to how this could be improved, and I'm
> really asking for problems with these methods that I haven't thought
> of that would make #columns return incorrect results.
>
> 1. If I've specified the columns I want returned using
> select(:foo, :bar.as(:baz)), then do we not *always* already know what
> the column names will be, (i.e. [:foo, :baz])? Even if I do an
> unaliased calculation/sql-function usage like :sum[:foo], is the
> column name not just the sql version of the calculation?

For simple cases (all literal symbols), you might be able to.
However, in most cases the names are database dependent.

> 2. If I haven't specified the columns, or have a '*' somewhere, can we
> not get the column names from the schema?

Sequel actually does this for models in simple cases, but not for
datasets.  I suppose it could do so for datasets as well.  In general
the columns query is fast, so you don't need to.  And if you are going
to be reusing the same dataset more than once, you should only be
creating it once and using it multiple times, in which case you only
have to pay the columns penalty once.

> 3. If the above approaches have flaws, would it not be better to keep
> the filtering on the dataset, as that will presumably be faster in
> almost all cases?

In most cases, an unfiltered, unordered, limited-to-one query will be
fastest.  If you add a filter, the database has to check that filter
(either via table or index scan).

> I'm happy to create a patch that uses a combination of the above
> approaches, as long as no-one can think of any showstoppers with them.

I think Sequel's current default behavior is best.  We should probably
modify the adapter you are using, though.

Jeremy

-- 
You received this message because you are subscribed to the Google Groups 
"sequel-talk" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/sequel-talk?hl=en.

Re: Question about Dataset#columns implementation / performance

Reply via email to