Re: Efficient joins in Drill - avoiding the massive overhead of scan based joins

Stefán Baxter Mon, 18 Jan 2016 04:20:04 -0800

Guys,

We are not dealing with the same data volume as you are, at least not for
on a single-tenant basis, but we are very execution-time sensitive.


Can you please help me implement this join filter pushdown properly, to
avoid these complete scans, by pointing me to the right examples so we can
continue with our Lucene based plan?

We are more than happy to sponsor this work and/or pay for professional
service to anyone that has the knowledge and the time to assist us.

Regards,
 -Stefán




On Sun, Jan 17, 2016 at 7:50 PM, Stefán Baxter <[email protected]>
wrote:

> Hi Rahul,
>
> I'm aware of the segment parallelization and the option of rewriting the
> queries but I disagree with that being the best option.
>
> Since Drill supports push down of join filters I think our best option is
> to implement that in the Lucene reader.
>
> Rewriting the queries ma be a temporary option but we are already using
> sub queries for more complex things and I really need these simple lookup
> joins to be both simple and effective.
>
> - Stefan
>
> On Sun, Jan 17, 2016 at 7:44 PM, rahul challapalli <
> [email protected]> wrote:
>
>> The level of parallelization in the lucene plugin is a segment.
>>
>> Stefan,
>>
>> I think it would be more accurate if you rewrite your join query so that
>> we
>> push the join keys into the lucene group scan and then compare the
>> numbers.
>> Something like the below
>>
>>    select * from tbl1 a left join (select * from tbl2 where tbl2.col1 in
>> (select col1 from tbl1)) b where a.col1 = b.col1;
>>
>> - Rahul
>>
>> On Sun, Jan 17, 2016 at 11:20 AM, Jacques Nadeau <[email protected]>
>> wrote:
>>
>> > Can you give more detail about the join stats themselves? You also state
>> > 20x slower but I'm trying to understand what that means. 20x slower than
>> > what? Are you parallelizing the Lucene read or is this a single reader?
>> >
>> > For example:
>> >
>> > I have a join.
>> > The left side has a billion rows.
>> > The right side has 10 million rows.
>> > When applying the join condition, only 10k rows are needed from the
>> right
>> > side.
>> >
>> > How long does it take to read a few million records from Lucene?
>> (Recently
>> > with Elastic we've been seeing ~50-100k/second per thread when only
>> > retrieving a single stored field.)
>> >
>> > --
>> > Jacques Nadeau
>> > CTO and Co-Founder, Dremio
>> >
>> > On Sat, Jan 16, 2016 at 12:11 PM, Stefán Baxter <
>> [email protected]
>> > >
>> > wrote:
>> >
>> > > Hi Jacques,
>> > >
>> > > Thank you for taking the time, it's appreciated.
>> > >
>> > > I'm trying to contribute to the Lucene reader for Drill (Started by
>> Rahul
>> > > Challapalli). We would like to use it for storage of metadata used in
>> our
>> > > Drill setup.
>> > > This is perfectly suited for our needs as the metadata is already
>> > available
>> > > in Lucene document+indexes and it's tenant specific (So this is not
>> the
>> > > global metadata that should reside in Postgres/HBase or something
>> > similar)
>> > >
>> > > I think it's best that I confess that I'm not sure what I'm looking
>> for
>> > or
>> > > how to ask for it, at least not in proper Drill terms.
>> > >
>> > > The Lucene reader is working but the joins currently rely on full scan
>> > > which introduces ~20 time longer execution time on simple data sets
>> (few
>> > > million records) so I need to get the index based joins going but I
>> don't
>> > > know how.
>> > >
>> > > We have resources to do this now but our knowlidge of Drill is limited
>> > and
>> > > I could not, in my initial scan of the project, find any use
>> > > of DrillJoinRel that indicated indexes were involved (please forgive
>> me
>> > if
>> > > this is a false assumption).
>> > >
>> > > Can you please clarify things for me a bit:
>> > >
>> > >    - Is the JDBC connector already doing proper pushdown of filters
>> for
>> > >    joins? (If so then I must really get my reading glasses on)
>> > >    - What will change with this new approach.
>> > >
>> > > I'm not really sure what you need from me now but I'm more than happy
>> to
>> > > share everything except the data it self :).
>> > >
>> > > The fork is places here:
>> > > https://github.com/activitystream/drill/tree/lucene-work but no tests
>> > > files
>> > > are included in the repo, sorry, and this is all very immature.
>> > >
>> > > Regards,
>> > >  -Stefán
>> > >
>> > >
>> > >
>> > >
>> > > On Sat, Jan 16, 2016 at 7:46 PM, Jacques Nadeau <[email protected]>
>> > > wrote:
>> > >
>> > > > Closest things already done to date is the join pushdown in the jdbc
>> > > > connector and the prototype code someone built a while back to do a
>> > join
>> > > > using HBase as a hash table. Aman and I have an ongoing thread
>> > discussing
>> > > > using elastic indexing and sideband communication to accelerate
>> joins.
>> > If
>> > > > would be great if you could cover exactly what you're doing
>> (including
>> > > > relevant stats), that would give us a better idea of how to point
>> you
>> > in
>> > > > the right direction.
>> > > >
>> > > > --
>> > > > Jacques Nadeau
>> > > > CTO and Co-Founder, Dremio
>> > > >
>> > > > On Sat, Jan 16, 2016 at 5:18 AM, Stefán Baxter <
>> > > [email protected]>
>> > > > wrote:
>> > > >
>> > > > > Hi,
>> > > > >
>> > > > > Can anyone point me to an implementation where joins are
>> implemented
>> > > with
>> > > > > full support for filters and efficient handling of joins based on
>> > > > indexes.
>> > > > >
>> > > > > The only code I have come across all seems to rely on complete
>> scan
>> > of
>> > > > the
>> > > > > related table and that is not acceptable for the use case we are
>> > > working
>> > > > on
>> > > > > (Lucene reader).
>> > > > >
>> > > > > Regards,
>> > > > >  -Stefán
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: Efficient joins in Drill - avoiding the massive overhead of scan based joins

Reply via email to