Re: Efficient joins in Drill - avoiding the massive overhead of scan based joins

rahul challapalli Sun, 17 Jan 2016 11:45:57 -0800

The level of parallelization in the lucene plugin is a segment.

Stefan,


I think it would be more accurate if you rewrite your join query so that we
push the join keys into the lucene group scan and then compare the numbers.
Something like the below

   select * from tbl1 a left join (select * from tbl2 where tbl2.col1 in
(select col1 from tbl1)) b where a.col1 = b.col1;

- Rahul

On Sun, Jan 17, 2016 at 11:20 AM, Jacques Nadeau <[email protected]> wrote:

> Can you give more detail about the join stats themselves? You also state
> 20x slower but I'm trying to understand what that means. 20x slower than
> what? Are you parallelizing the Lucene read or is this a single reader?
>
> For example:
>
> I have a join.
> The left side has a billion rows.
> The right side has 10 million rows.
> When applying the join condition, only 10k rows are needed from the right
> side.
>
> How long does it take to read a few million records from Lucene? (Recently
> with Elastic we've been seeing ~50-100k/second per thread when only
> retrieving a single stored field.)
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
> On Sat, Jan 16, 2016 at 12:11 PM, Stefán Baxter <[email protected]
> >
> wrote:
>
> > Hi Jacques,
> >
> > Thank you for taking the time, it's appreciated.
> >
> > I'm trying to contribute to the Lucene reader for Drill (Started by Rahul
> > Challapalli). We would like to use it for storage of metadata used in our
> > Drill setup.
> > This is perfectly suited for our needs as the metadata is already
> available
> > in Lucene document+indexes and it's tenant specific (So this is not the
> > global metadata that should reside in Postgres/HBase or something
> similar)
> >
> > I think it's best that I confess that I'm not sure what I'm looking for
> or
> > how to ask for it, at least not in proper Drill terms.
> >
> > The Lucene reader is working but the joins currently rely on full scan
> > which introduces ~20 time longer execution time on simple data sets (few
> > million records) so I need to get the index based joins going but I don't
> > know how.
> >
> > We have resources to do this now but our knowlidge of Drill is limited
> and
> > I could not, in my initial scan of the project, find any use
> > of DrillJoinRel that indicated indexes were involved (please forgive me
> if
> > this is a false assumption).
> >
> > Can you please clarify things for me a bit:
> >
> >    - Is the JDBC connector already doing proper pushdown of filters for
> >    joins? (If so then I must really get my reading glasses on)
> >    - What will change with this new approach.
> >
> > I'm not really sure what you need from me now but I'm more than happy to
> > share everything except the data it self :).
> >
> > The fork is places here:
> > https://github.com/activitystream/drill/tree/lucene-work but no tests
> > files
> > are included in the repo, sorry, and this is all very immature.
> >
> > Regards,
> >  -Stefán
> >
> >
> >
> >
> > On Sat, Jan 16, 2016 at 7:46 PM, Jacques Nadeau <[email protected]>
> > wrote:
> >
> > > Closest things already done to date is the join pushdown in the jdbc
> > > connector and the prototype code someone built a while back to do a
> join
> > > using HBase as a hash table. Aman and I have an ongoing thread
> discussing
> > > using elastic indexing and sideband communication to accelerate joins.
> If
> > > would be great if you could cover exactly what you're doing (including
> > > relevant stats), that would give us a better idea of how to point you
> in
> > > the right direction.
> > >
> > > --
> > > Jacques Nadeau
> > > CTO and Co-Founder, Dremio
> > >
> > > On Sat, Jan 16, 2016 at 5:18 AM, Stefán Baxter <
> > [email protected]>
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > Can anyone point me to an implementation where joins are implemented
> > with
> > > > full support for filters and efficient handling of joins based on
> > > indexes.
> > > >
> > > > The only code I have come across all seems to rely on complete scan
> of
> > > the
> > > > related table and that is not acceptable for the use case we are
> > working
> > > on
> > > > (Lucene reader).
> > > >
> > > > Regards,
> > > >  -Stefán
> > > >
> > >
> >
>

Re: Efficient joins in Drill - avoiding the massive overhead of scan based joins

Reply via email to