Re: Question about Drill aggregate queries and schema change

Jinfeng Ni Tue, 25 Jul 2017 11:50:42 -0700

I'm currently working on a patch using the idea described in DRILL-5546.
The idea is similar to your idea of null row : in stead of returning an
empty batch, or a scan batch with injected nullable-int columns, we will
return NONE to the downstream operators directly, which will avoid the
unintended consequence.


I will probably wrap up that work in a few days, and submit a PR for
review.



On Mon, Jul 24, 2017 at 5:37 PM, Cliff Resnick <[email protected]> wrote:

> That makes sense, so I guess the solution is to return a null row instead?
> If so is there a way to fag it to be ignored downstream (to avoid any
> unintended consequences)?
>
> Thanks for the help!
>
> On Mon, Jul 24, 2017 at 7:06 PM, Jinfeng Ni <[email protected]> wrote:
>
> > Based on my limited understanding of Drill's KuduRecordReader, the
> problem
> > seems to be in the next() method [1]. When RowResult's iterator return
> > false for hasNext(), in the case filter prune everything, the code will
> > skip the call of addRowResult(). That means no columns/data will be added
> > to scan's batch.  Nullable int will be injected in downstream operator.
> >
> > 1.
> > https://github.com/apache/drill/blob/master/contrib/
> > storage-kudu/src/main/java/org/apache/drill/exec/store/
> > kudu/KuduRecordReader.java#L149-L163
> >
> >
> > On Mon, Jul 24, 2017 at 1:35 PM, Cliff Resnick <[email protected]> wrote:
> >
> > > Jinfeng,
> > >
> > > I'm wondering if there's a way to push schema info to Drill even if
> there
> > > is no result. KuduScanner always has schema, and RecordReader always
> has
> > > scanner. But I can't seem to find the disconnect. Any idea if this is
> > > possible even if it's Kudu-specific hack?
> > >
> > > -Cliff
> > >
> > > On Mon, Jul 24, 2017 at 2:46 PM, Cliff Resnick <[email protected]>
> wrote:
> > >
> > >> Jinfeng,
> > >>
> > >> Thanks, that confirms my thoughts as well. If I query using full range
> > >> bounds and all hash keys, then Kudu prunes to the exact tablets and
> > there
> > >> is no error. I'll watch that jira expectantly because Kudu + Drill
> > would be
> > >> an awseome combo. But without the pruning it's useless to us.
> > >>
> > >> -Cliff
> > >>
> > >> On Mon, Jul 24, 2017 at 2:17 PM, Jinfeng Ni <[email protected]> wrote:
> > >>
> > >>> If you see such errors only when you enable predicate pushdown, it
> > might
> > >>> be
> > >>> related to a known issue; schema change failure caused by empty batch
> > >>> [1].
> > >>> This happened when predicate prunes everything, and kudu reader did
> not
> > >>> return a RowResult with a schema.  In such case, Drill would
> interprete
> > >>> the
> > >>> requested column (such as a) as nullable int, which would lead
> conflict
> > >>> to
> > >>> other minor-fragment which may have the data/schema.
> > >>>
> > >>> The reason why you hit such failure randomly : there is a race
> > condition
> > >>> for such conflict to happen. If the minor-fragment with empty batch
> is
> > >>> executed after the one with data is executed, the empty batch would
> be
> > >>> ignored. If reverse order, it would cause conflict, hence query
> > failure.
> > >>>
> > >>> 1. https://issues.apache.org/jira/browse/DRILL-5546
> > >>>
> > >>>
> > >>>
> > >>> On Mon, Jul 24, 2017 at 10:56 AM, Cliff Resnick <[email protected]>
> > >>> wrote:
> > >>>
> > >>> > I spent some time over the weekend altering Drill's storage-kudu to
> > use
> > >>> > Kudu's predicate pushdown api. Everything worked great as long as I
> > >>> > performed flat filtered selects (eg. SELECT .. FROM .. WHERE ..")
> but
> > >>> > whenever I tested aggregate queries, they would succeed sometimes,
> > then
> > >>> > fail other times -- using the exact same queries.
> > >>> >
> > >>> > The failures were always like below. After searching around, I came
> > >>> across
> > >>> > a number of jiras, like https://issues.apache.org/jira
> > >>> /browse/DRILL-2602
> > >>> > that imply Drill can't handle sorts/aggregate queries on "changing
> > >>> > schemas". This was confusing to me because I was testing with a
> > single
> > >>> > table/single schema, which leaves me wondering if "changing schema"
> > >>> means
> > >>> > the unknown type of the aggregate itself? Meaning,  SELECT SUM(a),b
> > >>> FROM t
> > >>> > GROUP BY a; where field a is an INT64, Drill can't figure out how
> to
> > >>> deal
> > >>> > with SUM(a) because it may exceed the scale of INT64?
> > >>> >
> > >>> > If someone could clarify this for me I'd really appreciate it. I'm
> > >>> really
> > >>> > hoping my above understanding is not correct and it's just a
> problem
> > >>> with
> > >>> > the Vector handling in storage-kudu, because otherwise it seems
> that
> > >>> > Drill's aggregation capabilities are rather limited.
> > >>> >
> > >>> > Errors:
> > >>> >
> > >>> > java.lang.IllegalStateException: Failure while reading vector.
> > >>> Expected
> > >>> > vector class of org.apache.drill.exec.vector.NullableIntVector but
> > was
> > >>> > holding vector class org.apache.drill.exec.vector.BigIntVector,
> > field=
> > >>> > campaign_id(BIGINT:REQUIRED)
> > >>> > at org.apache.drill.exec.record.VectorContainer.
> > getValueAccessorById(
> > >>> > VectorContainer.java:321)
> > >>> > at org.apache.drill.exec.record.RecordBatchLoader.getValueAcces
> > >>> sorById(
> > >>> > RecordBatchLoader.java:179)
> > >>> >
> > >>> > OR
> > >>> >
> > >>> > Error: UNSUPPORTED_OPERATION ERROR: Sort doesn't currently support
> > >>> sorts
> > >>> > with changing schemas.
> > >>> >
> > >>>
> > >>
> > >>
> > >
> >
>

Re: Question about Drill aggregate queries and schema change

Reply via email to