Re: Efficient joins in Drill - avoiding the massive overhead of scan based joins

Stefán Baxter Sat, 16 Jan 2016 12:35:19 -0800

Hi,

After a quick glance at Drill-3929 I think I should state that this is
"only" about the push down of the join filter and a efficient way to do
join that does not require a full scan.


We are not using Lucene as an external index for a separate data source ea.
the Lucene index contains all the information we need for the join (stored
fields).

I guess this would make more sense to people if we said we were using Solr
or Elastic Search but this use-case is not as complex as the one detailed
in Drill-3929.

Regards,
 -Stefan

On Sat, Jan 16, 2016 at 8:11 PM, Stefán Baxter <[email protected]>
wrote:

> Hi Jacques,
>
> Thank you for taking the time, it's appreciated.
>
> I'm trying to contribute to the Lucene reader for Drill (Started by Rahul
> Challapalli). We would like to use it for storage of metadata used in our
> Drill setup.
> This is perfectly suited for our needs as the metadata is already
> available in Lucene document+indexes and it's tenant specific (So this is
> not the global metadata that should reside in Postgres/HBase or something
> similar)
>
> I think it's best that I confess that I'm not sure what I'm looking for or
> how to ask for it, at least not in proper Drill terms.
>
> The Lucene reader is working but the joins currently rely on full scan
> which introduces ~20 time longer execution time on simple data sets (few
> million records) so I need to get the index based joins going but I don't
> know how.
>
> We have resources to do this now but our knowlidge of Drill is limited and
> I could not, in my initial scan of the project, find any use
> of DrillJoinRel that indicated indexes were involved (please forgive me if
> this is a false assumption).
>
> Can you please clarify things for me a bit:
>
>    - Is the JDBC connector already doing proper pushdown of filters for
>    joins? (If so then I must really get my reading glasses on)
>    - What will change with this new approach.
>
> I'm not really sure what you need from me now but I'm more than happy to
> share everything except the data it self :).
>
> The fork is places here:
> https://github.com/activitystream/drill/tree/lucene-work but no tests
> files are included in the repo, sorry, and this is all very immature.
>
> Regards,
>  -Stefán
>
>
>
>
> On Sat, Jan 16, 2016 at 7:46 PM, Jacques Nadeau <[email protected]>
> wrote:
>
>> Closest things already done to date is the join pushdown in the jdbc
>> connector and the prototype code someone built a while back to do a join
>> using HBase as a hash table. Aman and I have an ongoing thread discussing
>> using elastic indexing and sideband communication to accelerate joins. If
>> would be great if you could cover exactly what you're doing (including
>> relevant stats), that would give us a better idea of how to point you in
>> the right direction.
>>
>> --
>> Jacques Nadeau
>> CTO and Co-Founder, Dremio
>>
>> On Sat, Jan 16, 2016 at 5:18 AM, Stefán Baxter <[email protected]
>> >
>> wrote:
>>
>> > Hi,
>> >
>> > Can anyone point me to an implementation where joins are implemented
>> with
>> > full support for filters and efficient handling of joins based on
>> indexes.
>> >
>> > The only code I have come across all seems to rely on complete scan of
>> the
>> > related table and that is not acceptable for the use case we are
>> working on
>> > (Lucene reader).
>> >
>> > Regards,
>> >  -Stefán
>> >
>>
>
>

Re: Efficient joins in Drill - avoiding the massive overhead of scan based joins

Reply via email to