Good plug for log-synth.  I'm still hoping for a storage plugin... (I
suppose I have to write some docs on the api first :)

I think the isue in many cases is not generating the right data.  It more
often is generating the right query.  The easiest to solve these issues is
to continue to get real world example queries like the one Sunwook provided
above.

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Mon, Aug 24, 2015 at 6:29 PM, Ted Dunning <[email protected]> wrote:

> Just a plug here for a tool that started as a part of Drill and which might
> help people build test cases.
>
> Log-synth makes it very easy to build realistic data, flat or nested. It
> creates realistic data such as street addresses, zip codes and SSN's.  It
> also includes lots of tables of auxiliary data so that you can enrich these
> fields as you produce them.
>
> To use it, you create a schema in JSON that tells how to generate the data
> you want and then run a simple command. You can generate 10 lines of data
> or 10 billion in CSV, TSV, or JSON formats or by expanding templates.  Data
> can be flat or complex. Tables can link to each other cleanly.
>
> Creating test data where you can't share the original is fairly easy using
> log-synth.
>
> See https://github.com/tdunning/log-synth for details.
>
>
>
> On Mon, Aug 24, 2015 at 5:20 PM, Aman Sinha <[email protected]> wrote:
>
> > I was about to say that for IN lists of size 20 or more, Drill uses a
> more
> > efficient Values operator instead of OR conditions but then realized the
> OR
> > filter is referencing 4 different columns : $1..$4 and each of those
> > individual lists is less than 20.  Sungwook,  can you please provide the
> > SQL query and any view definitions or anything that goes with it ?  It is
> > difficult to figure out things without the full picture.
> > thanks,
> > Aman
> >
> > On Mon, Aug 24, 2015 at 5:10 PM, Ted Dunning <[email protected]>
> > wrote:
> >
> > > On Mon, Aug 24, 2015 at 4:50 PM, Sungwook Yoon <[email protected]>
> > wrote:
> > >
> > > > Still, the performance drop down due to OR filtering is just
> > > astounding...
> > > >
> > >
> > > That is what query optimizers are for and why getting them to work well
> > is
> > > important.
> > >
> > > The difference in performance that you are observing is not surprising
> > > given the redundant work that you are seeing. Using the OR operator
> > > prevents any significant short-circuiting and the repeated conversion
> > > operations that are happening make the evaluation much more expensive
> > than
> > > it would otherwise be (a dozen extra copies where only one is needed).
> > >
> > > Other queries that can be subject to similar problems include common
> > table
> > > expressions that read the same (large) input file many times.  So far,
> > > Drill doesn't optimize all such expressions well.
> > >
> >
>

Reply via email to