Just a plug here for a tool that started as a part of Drill and which might
help people build test cases.

Log-synth makes it very easy to build realistic data, flat or nested. It
creates realistic data such as street addresses, zip codes and SSN's.  It
also includes lots of tables of auxiliary data so that you can enrich these
fields as you produce them.

To use it, you create a schema in JSON that tells how to generate the data
you want and then run a simple command. You can generate 10 lines of data
or 10 billion in CSV, TSV, or JSON formats or by expanding templates.  Data
can be flat or complex. Tables can link to each other cleanly.

Creating test data where you can't share the original is fairly easy using
log-synth.

See https://github.com/tdunning/log-synth for details.



On Mon, Aug 24, 2015 at 5:20 PM, Aman Sinha <[email protected]> wrote:

> I was about to say that for IN lists of size 20 or more, Drill uses a more
> efficient Values operator instead of OR conditions but then realized the OR
> filter is referencing 4 different columns : $1..$4 and each of those
> individual lists is less than 20.  Sungwook,  can you please provide the
> SQL query and any view definitions or anything that goes with it ?  It is
> difficult to figure out things without the full picture.
> thanks,
> Aman
>
> On Mon, Aug 24, 2015 at 5:10 PM, Ted Dunning <[email protected]>
> wrote:
>
> > On Mon, Aug 24, 2015 at 4:50 PM, Sungwook Yoon <[email protected]>
> wrote:
> >
> > > Still, the performance drop down due to OR filtering is just
> > astounding...
> > >
> >
> > That is what query optimizers are for and why getting them to work well
> is
> > important.
> >
> > The difference in performance that you are observing is not surprising
> > given the redundant work that you are seeing. Using the OR operator
> > prevents any significant short-circuiting and the repeated conversion
> > operations that are happening make the evaluation much more expensive
> than
> > it would otherwise be (a dozen extra copies where only one is needed).
> >
> > Other queries that can be subject to similar problems include common
> table
> > expressions that read the same (large) input file many times.  So far,
> > Drill doesn't optimize all such expressions well.
> >
>

Reply via email to