Re: Efficient joins in Drill - avoiding the massive overhead of scan based joins

Stefán Baxter Sat, 16 Jan 2016 12:12:06 -0800

Hi Jacques,

Thank you for taking the time, it's appreciated.

I'm trying to contribute to the Lucene reader for Drill (Started by Rahul
Challapalli). We would like to use it for storage of metadata used in our
Drill setup.
This is perfectly suited for our needs as the metadata is already available
in Lucene document+indexes and it's tenant specific (So this is not the
global metadata that should reside in Postgres/HBase or something similar)

I think it's best that I confess that I'm not sure what I'm looking for or
how to ask for it, at least not in proper Drill terms.

The Lucene reader is working but the joins currently rely on full scan
which introduces ~20 time longer execution time on simple data sets (few
million records) so I need to get the index based joins going but I don't
know how.

We have resources to do this now but our knowlidge of Drill is limited and
I could not, in my initial scan of the project, find any use
of DrillJoinRel that indicated indexes were involved (please forgive me if
this is a false assumption).

Can you please clarify things for me a bit:

   - Is the JDBC connector already doing proper pushdown of filters for
   joins? (If so then I must really get my reading glasses on)
   - What will change with this new approach.

I'm not really sure what you need from me now but I'm more than happy to
share everything except the data it self :).

The fork is places here:
https://github.com/activitystream/drill/tree/lucene-work but no tests files
are included in the repo, sorry, and this is all very immature.

Regards,
 -Stefán

On Sat, Jan 16, 2016 at 7:46 PM, Jacques Nadeau <[email protected]> wrote:

> Closest things already done to date is the join pushdown in the jdbc
> connector and the prototype code someone built a while back to do a join
> using HBase as a hash table. Aman and I have an ongoing thread discussing
> using elastic indexing and sideband communication to accelerate joins. If
> would be great if you could cover exactly what you're doing (including
> relevant stats), that would give us a better idea of how to point you in
> the right direction.
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
> On Sat, Jan 16, 2016 at 5:18 AM, Stefán Baxter <[email protected]>
> wrote:
>
> > Hi,
> >
> > Can anyone point me to an implementation where joins are implemented with
> > full support for filters and efficient handling of joins based on
> indexes.
> >
> > The only code I have come across all seems to rely on complete scan of
> the
> > related table and that is not acceptable for the use case we are working
> on
> > (Lucene reader).
> >
> > Regards,
> >  -Stefán
> >
>

Re: Efficient joins in Drill - avoiding the massive overhead of scan based joins

Reply via email to