Re: Question about Drill aggregate queries and schema change

Cliff Resnick Tue, 25 Jul 2017 12:51:12 -0700

Awesome! I'll be watching that issue for the PR.

On Tue, Jul 25, 2017 at 2:50 PM, Jinfeng Ni <[email protected]> wrote:


> I'm currently working on a patch using the idea described in DRILL-5546.
> The idea is similar to your idea of null row : in stead of returning an
> empty batch, or a scan batch with injected nullable-int columns, we will
> return NONE to the downstream operators directly, which will avoid the
> unintended consequence.
>
> I will probably wrap up that work in a few days, and submit a PR for
> review.
>
>
>
> On Mon, Jul 24, 2017 at 5:37 PM, Cliff Resnick <[email protected]> wrote:
>
> > That makes sense, so I guess the solution is to return a null row
> instead?
> > If so is there a way to fag it to be ignored downstream (to avoid any
> > unintended consequences)?
> >
> > Thanks for the help!
> >
> > On Mon, Jul 24, 2017 at 7:06 PM, Jinfeng Ni <[email protected]> wrote:
> >
> > > Based on my limited understanding of Drill's KuduRecordReader, the
> > problem
> > > seems to be in the next() method [1]. When RowResult's iterator return
> > > false for hasNext(), in the case filter prune everything, the code will
> > > skip the call of addRowResult(). That means no columns/data will be
> added
> > > to scan's batch.  Nullable int will be injected in downstream operator.
> > >
> > > 1.
> > > https://github.com/apache/drill/blob/master/contrib/
> > > storage-kudu/src/main/java/org/apache/drill/exec/store/
> > > kudu/KuduRecordReader.java#L149-L163
> > >
> > >
> > > On Mon, Jul 24, 2017 at 1:35 PM, Cliff Resnick <[email protected]>
> wrote:
> > >
> > > > Jinfeng,
> > > >
> > > > I'm wondering if there's a way to push schema info to Drill even if
> > there
> > > > is no result. KuduScanner always has schema, and RecordReader always
> > has
> > > > scanner. But I can't seem to find the disconnect. Any idea if this is
> > > > possible even if it's Kudu-specific hack?
> > > >
> > > > -Cliff
> > > >
> > > > On Mon, Jul 24, 2017 at 2:46 PM, Cliff Resnick <[email protected]>
> > wrote:
> > > >
> > > >> Jinfeng,
> > > >>
> > > >> Thanks, that confirms my thoughts as well. If I query using full
> range
> > > >> bounds and all hash keys, then Kudu prunes to the exact tablets and
> > > there
> > > >> is no error. I'll watch that jira expectantly because Kudu + Drill
> > > would be
> > > >> an awseome combo. But without the pruning it's useless to us.
> > > >>
> > > >> -Cliff
> > > >>
> > > >> On Mon, Jul 24, 2017 at 2:17 PM, Jinfeng Ni <[email protected]> wrote:
> > > >>
> > > >>> If you see such errors only when you enable predicate pushdown, it
> > > might
> > > >>> be
> > > >>> related to a known issue; schema change failure caused by empty
> batch
> > > >>> [1].
> > > >>> This happened when predicate prunes everything, and kudu reader did
> > not
> > > >>> return a RowResult with a schema.  In such case, Drill would
> > interprete
> > > >>> the
> > > >>> requested column (such as a) as nullable int, which would lead
> > conflict
> > > >>> to
> > > >>> other minor-fragment which may have the data/schema.
> > > >>>
> > > >>> The reason why you hit such failure randomly : there is a race
> > > condition
> > > >>> for such conflict to happen. If the minor-fragment with empty batch
> > is
> > > >>> executed after the one with data is executed, the empty batch would
> > be
> > > >>> ignored. If reverse order, it would cause conflict, hence query
> > > failure.
> > > >>>
> > > >>> 1. https://issues.apache.org/jira/browse/DRILL-5546
> > > >>>
> > > >>>
> > > >>>
> > > >>> On Mon, Jul 24, 2017 at 10:56 AM, Cliff Resnick <[email protected]>
> > > >>> wrote:
> > > >>>
> > > >>> > I spent some time over the weekend altering Drill's storage-kudu
> to
> > > use
> > > >>> > Kudu's predicate pushdown api. Everything worked great as long
> as I
> > > >>> > performed flat filtered selects (eg. SELECT .. FROM .. WHERE ..")
> > but
> > > >>> > whenever I tested aggregate queries, they would succeed
> sometimes,
> > > then
> > > >>> > fail other times -- using the exact same queries.
> > > >>> >
> > > >>> > The failures were always like below. After searching around, I
> came
> > > >>> across
> > > >>> > a number of jiras, like https://issues.apache.org/jira
> > > >>> /browse/DRILL-2602
> > > >>> > that imply Drill can't handle sorts/aggregate queries on
> "changing
> > > >>> > schemas". This was confusing to me because I was testing with a
> > > single
> > > >>> > table/single schema, which leaves me wondering if "changing
> schema"
> > > >>> means
> > > >>> > the unknown type of the aggregate itself? Meaning,  SELECT
> SUM(a),b
> > > >>> FROM t
> > > >>> > GROUP BY a; where field a is an INT64, Drill can't figure out how
> > to
> > > >>> deal
> > > >>> > with SUM(a) because it may exceed the scale of INT64?
> > > >>> >
> > > >>> > If someone could clarify this for me I'd really appreciate it.
> I'm
> > > >>> really
> > > >>> > hoping my above understanding is not correct and it's just a
> > problem
> > > >>> with
> > > >>> > the Vector handling in storage-kudu, because otherwise it seems
> > that
> > > >>> > Drill's aggregation capabilities are rather limited.
> > > >>> >
> > > >>> > Errors:
> > > >>> >
> > > >>> > java.lang.IllegalStateException: Failure while reading vector.
> > > >>> Expected
> > > >>> > vector class of org.apache.drill.exec.vector.NullableIntVector
> but
> > > was
> > > >>> > holding vector class org.apache.drill.exec.vector.BigIntVector,
> > > field=
> > > >>> > campaign_id(BIGINT:REQUIRED)
> > > >>> > at org.apache.drill.exec.record.VectorContainer.
> > > getValueAccessorById(
> > > >>> > VectorContainer.java:321)
> > > >>> > at org.apache.drill.exec.record.RecordBatchLoader.getValueAcces
> > > >>> sorById(
> > > >>> > RecordBatchLoader.java:179)
> > > >>> >
> > > >>> > OR
> > > >>> >
> > > >>> > Error: UNSUPPORTED_OPERATION ERROR: Sort doesn't currently
> support
> > > >>> sorts
> > > >>> > with changing schemas.
> > > >>> >
> > > >>>
> > > >>
> > > >>
> > > >
> > >
> >
>

Re: Question about Drill aggregate queries and schema change

Reply via email to