Aman,

Sorry for slow reply.

Here is the SQL query.

SELECT count(1) FROM `o` a WHERE (a.DX1 IN ('39891', '4280', '4281',
'42820', '42821', '42822', '42823', '42830', '42831', '42832', '42833',
'42840', '42841', '42842', '42843', '4289') OR a.DX2 IN ('39891', '4280',
'4281', '42820', '42821', '42822', '42823', '42830', '42831', '42832',
'42833', '42840', '42841', '42842', '42843', '4289') OR a.DX3 IN ('39891',
'4280', '4281', '42820', '42821', '42822', '42823', '42830', '42831',
'42832', '42833', '42840', '42841', '42842', '42843', '4289') OR a.DX4 IN
('39891', '4280', '4281', '42820', '42821', '42822', '42823', '42830',
'42831', '42832', '42833', '42840', '42841', '42842', '42843', '4289') OR
a.DX5 IN ('39891', '4280', '4281', '42820', '42821', '42822', '42823',
'42830', '42831', '42832', '42833', '42840', '42841', '42842', '42843',
'4289')) AND a.`year` BETWEEN 2009 AND 2013;

For the view definition, I tried to convert_from to UTF and cast as Varchar.
For integer type, I casted.
All resulted in similar slow performance.

Let me know if you have any insights.

Thanks,

Sungwook


On Mon, Aug 24, 2015 at 7:20 PM, Aman Sinha <[email protected]> wrote:

> I was about to say that for IN lists of size 20 or more, Drill uses a more
> efficient Values operator instead of OR conditions but then realized the OR
> filter is referencing 4 different columns : $1..$4 and each of those
> individual lists is less than 20.  Sungwook,  can you please provide the
> SQL query and any view definitions or anything that goes with it ?  It is
> difficult to figure out things without the full picture.
> thanks,
> Aman
>
> On Mon, Aug 24, 2015 at 5:10 PM, Ted Dunning <[email protected]>
> wrote:
>
> > On Mon, Aug 24, 2015 at 4:50 PM, Sungwook Yoon <[email protected]>
> wrote:
> >
> > > Still, the performance drop down due to OR filtering is just
> > astounding...
> > >
> >
> > That is what query optimizers are for and why getting them to work well
> is
> > important.
> >
> > The difference in performance that you are observing is not surprising
> > given the redundant work that you are seeing. Using the OR operator
> > prevents any significant short-circuiting and the repeated conversion
> > operations that are happening make the evaluation much more expensive
> than
> > it would otherwise be (a dozen extra copies where only one is needed).
> >
> > Other queries that can be subject to similar problems include common
> table
> > expressions that read the same (large) input file many times.  So far,
> > Drill doesn't optimize all such expressions well.
> >
>

Reply via email to