[jira] [Commented] (DRILL-6312) Enable pushing of cast expressions to the scanner for better schema discovery.

2018-04-07 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16429635#comment-16429635
 ] 

Paul Rogers commented on DRILL-6312:


While we are focussing on the type of pesky fields, data processing system 
often allow other forms of column definitions.

For example, it is often helpful to combine or split columns. Suppose I have a 
field like the following from a web log:

{noformat}
GET http://mySite.com/path/to/asset
{noformat}

I may want to split this into four field: HTTP operation ("GET"), service type 
("http"), host ("mySite.com") and asset ("/path/to/asset").

Or, I may have two fields that give the and time:

{noformat}
2018-04-07, 10:13:43.345
{noformat}

And I may want to combine them into a single date-time type.

A handy technique is to define a computed column that does the work. If the 
computed column can call a UDF, then pretty much any transform is possible. 
Here is a very simple case for a line item:

{noformat}
price * quantity AS extendedPrice
{noformat}


> Enable pushing of cast expressions to the scanner for better schema discovery.
> --
>
> Key: DRILL-6312
> URL: https://issues.apache.org/jira/browse/DRILL-6312
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Relational Operators, Query Planning  
> Optimization
>Affects Versions: 1.13.0
>Reporter: Hanumath Rao Maduri
>Priority: Major
>
> Drill is a schema less engine which tries to infer the schema from disparate 
> sources at the read time. Currently the scanners infer the schema for each 
> batch depending upon the data for that column in the corresponding batch. 
> This solves many uses cases but can error out when the data is too different 
> between batches like int and array[int] etc... (There are other cases as well 
> but just to give one example).
> There is also a mechanism to create a view by type casting the columns to 
> appropriate type. This solves issues in some cases but fails in many other 
> cases. This is due to the fact that cast expression is not being pushed down 
> to the scanner but staying at the project or filter etc operators up the 
> query plan.
> This JIRA is to fix this by propagating the type information embedded in the 
> cast function to the scanners so that scanners can cast the incoming data 
> appropriately.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (DRILL-6312) Enable pushing of cast expressions to the scanner for better schema discovery.

2018-04-07 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16429634#comment-16429634
 ] 

Paul Rogers edited comment on DRILL-6312 at 4/8/18 5:50 AM:


While type inference (using Cast and other hints) is a very good idea, it 
cannot be the full answer. Here is why:

* The only way to express a type is to include the column in a SELECT clause. 
If a column is not projected, no hint can be provided, and we can end up with 
possible read-time problems as discussed in the original e-mail thread ("Death 
of Schema on Read").
* The only way to express the type of a column is to explicitly include it in 
the SELECT clause. Using a wildcard ("*") query will bypass the type rules 
unless there is a view underneath that applies the rules.
* There is no way to type just the pesky, troublesome columns, leaving the 
others to be detected automatically. If we must use a view, and we have to, 
say, use a cast for column x, then we have to include all other columns in the 
SELECT clause or we end up projecting only x. We can't use a wildcard for the 
other columns.
* Putting the type information in the query puts the burden on the query writer 
(and, ultimately, something like Tableau.) But, the schema is a property of the 
data, not the query, so this is not good model of reality.

For this reason, the cast idea, though elegant, and a very good enhancement, 
cannot be the full answer, It will reduce the number of cases where type 
ambiguity occurs, but it is not a general-purpose solution.

A general-purpose solution would be to provide some means to explicitly apply 
type information. For example, in a view or query, provide explicit hint syntax:

{noformat}
SELECT * FROM myFunkyTable
  WITH HINTS (f: INT, m.x: BIGINT NOT NULL,  a[]: VARCHAR NULL)
{noformat}

The hints say that, if fields "f", "m.x" and "a" appear, they are of the type 
specified. If the fields don't appear, just ignore the hints.

Most systems put this information in metadata, but Drill is very hostile to 
metadata, so it must be in the query (or, equivalently, a view.)

Lore has it that the early Drill designers proposed a ".drill" file to hold 
schema information. In this case, schema information would be an add-on file, 
much as views are. As proposed in the e-mail thread, perhaps both forms of 
information can be combined in a single file.


was (Author: paul-rogers):
While type inference (using Cast and other hints) is a very good idea, it 
cannot be the full answer. Here is why:

* The only way to express a type is to include the column in a SELECT clause. 
If a column is not projected, no hint can be provided, and we can end up with 
possible read-time problems as discussed in the original e-mail thread ("Death 
of Schema on Read").
* The only way to express the type of a column is to explicitly include it in 
the SELECT clause. Using a wildcard ("*") query will bypass the type rules 
unless there is a view underneath that applies the rules.
* There is no way to type just the pesky, troublesome columns, leaving the 
others to be detected automatically. If we must use a view, and we have to, 
say, use a cast for column x, then we have to include all other columns in the 
SELECT clause or we end up projecting only x. We can't use a wildcard for the 
other columns.
* Putting the type information in the query puts the burden on the query writer 
(and, ultimately, something like Tableau.) But, the schema is a property of the 
data, not the query, so this is not good model of reality.

For this reason, the cast idea, though elegant, and a very good enhancement, 
cannot be the full answer, It will reduce the number of cases where type 
ambiguity occurs, but it is not a general-purpose solution.

A general-purpose solution would be to provide some means to explicitly apply 
type information. For example, in a view or query, provide explicit hint syntax:

{noformat}
SELECT * FROM myFunkyTable
  WITH HINTS (f: INT, m.x: BIGINT NOT NULL,  a[]: VARCHAR NULL)
{noformat}

The hints say that, if fields "f", "m.x" and "a" appear, they are of the type 
specified. If the fields don't appear, just ignore the hints.

Most systems put this information in metadata, but Drill is very hostile to 
metadata, so it must be in the query (or, equivalently, a view.)

> Enable pushing of cast expressions to the scanner for better schema discovery.
> --
>
> Key: DRILL-6312
> URL: https://issues.apache.org/jira/browse/DRILL-6312
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Relational Operators, Query Planning  
> Optimization
>Affects Versions: 1.13.0
>Reporter: Hanumath Rao Maduri
>Priority: Major
>
> Drill is a schema less engine which tries to infer the schema 

[jira] [Comment Edited] (DRILL-6312) Enable pushing of cast expressions to the scanner for better schema discovery.

2018-04-07 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16429634#comment-16429634
 ] 

Paul Rogers edited comment on DRILL-6312 at 4/8/18 5:48 AM:


While type inference (using Cast and other hints) is a very good idea, it 
cannot be the full answer. Here is why:

* The only way to express a type is to include the column in a SELECT clause. 
If a column is not projected, no hint can be provided, and we can end up with 
possible read-time problems as discussed in the original e-mail thread ("Death 
of Schema on Read").
* The only way to express the type of a column is to explicitly include it in 
the SELECT clause. Using a wildcard ("*") query will bypass the type rules 
unless there is a view underneath that applies the rules.
* There is no way to type just the pesky, troublesome columns, leaving the 
others to be detected automatically. If we must use a view, and we have to, 
say, use a cast for column x, then we have to include all other columns in the 
SELECT clause or we end up projecting only x. We can't use a wildcard for the 
other columns.
* Putting the type information in the query puts the burden on the query writer 
(and, ultimately, something like Tableau.) But, the schema is a property of the 
data, not the query, so this is not good model of reality.

For this reason, the cast idea, though elegant, and a very good enhancement, 
cannot be the full answer, It will reduce the number of cases where type 
ambiguity occurs, but it is not a general-purpose solution.

A general-purpose solution would be to provide some means to explicitly apply 
type information. For example, in a view or query, provide explicit hint syntax:

{noformat}
SELECT * FROM myFunkyTable
  WITH HINTS (f: INT, m.x: BIGINT NOT NULL,  a[]: VARCHAR NULL)
{noformat}

The hints say that, if fields "f", "m.x" and "a" appear, they are of the type 
specified. If the fields don't appear, just ignore the hints.

Most systems put this information in metadata, but Drill is very hostile to 
metadata, so it must be in the query (or, equivalently, a view.)


was (Author: paul-rogers):
While type inference (using Cast and other hints) is a very good idea, it 
cannot be the full answer. Here is why:

* The only way to express a type is to include the column in a SELECT clause. 
If a column is not projected, no hint can be provided, and we can end up with 
possible read-time problems as discussed in the original e-mail thread ("Death 
of Schema on Read").
* The only way to express the type of a column is to explicitly include it in 
the SELECT clause. Using a wildcard ("*") query will bypass the type rules 
unless there is a view underneath that applies the rules.
* There is no way to type just the pesky, troublesome columns, leaving the 
others to be detected automatically. If we must use a view, and we have to, 
say, use a cast for column x, then we have to include all other columns in the 
SELECT clause or we end up projecting only x.

For this reason, the cast idea, though elegant, and a very good enhancement, 
cannot be the full answer, It will reduce the number of cases where type 
ambiguity occurs, but it is not a general-purpose solution.

A general-purpose solution would be to provide some means to explicitly apply 
type information. For example, in a view or query, provide explicit hint syntax:

{noformat}
SELECT * FROM myFunkyTable
  WITH HINTS (f: INT, m.x: BIGINT NOT NULL,  a[]: VARCHAR NULL)
{noformat}

The hints say that, if fields "f", "m.x" and "a" appear, they are of the type 
specified. If the fields don't appear, just ignore the hints.

Most systems put this information in metadata, but Drill is very hostile to 
metadata, so it must be in the query (or, equivalently, a view.)

> Enable pushing of cast expressions to the scanner for better schema discovery.
> --
>
> Key: DRILL-6312
> URL: https://issues.apache.org/jira/browse/DRILL-6312
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Relational Operators, Query Planning  
> Optimization
>Affects Versions: 1.13.0
>Reporter: Hanumath Rao Maduri
>Priority: Major
>
> Drill is a schema less engine which tries to infer the schema from disparate 
> sources at the read time. Currently the scanners infer the schema for each 
> batch depending upon the data for that column in the corresponding batch. 
> This solves many uses cases but can error out when the data is too different 
> between batches like int and array[int] etc... (There are other cases as well 
> but just to give one example).
> There is also a mechanism to create a view by type casting the columns to 
> appropriate type. This solves issues in some cases but fails in many other 
> cases. This is due to the 

[jira] [Commented] (DRILL-6312) Enable pushing of cast expressions to the scanner for better schema discovery.

2018-04-07 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16429634#comment-16429634
 ] 

Paul Rogers commented on DRILL-6312:


While type inference (using Cast and other hints) is a very good idea, it 
cannot be the full answer. Here is why:

* The only way to express a type is to include the column in a SELECT clause. 
If a column is not projected, no hint can be provided, and we can end up with 
possible read-time problems as discussed in the original e-mail thread ("Death 
of Schema on Read").
* The only way to express the type of a column is to explicitly include it in 
the SELECT clause. Using a wildcard ("*") query will bypass the type rules 
unless there is a view underneath that applies the rules.
* There is no way to type just the pesky, troublesome columns, leaving the 
others to be detected automatically. If we must use a view, and we have to, 
say, use a cast for column x, then we have to include all other columns in the 
SELECT clause or we end up projecting only x.

For this reason, the cast idea, though elegant, and a very good enhancement, 
cannot be the full answer, It will reduce the number of cases where type 
ambiguity occurs, but it is not a general-purpose solution.

A general-purpose solution would be to provide some means to explicitly apply 
type information. For example, in a view or query, provide explicit hint syntax:

{noformat}
SELECT * FROM myFunkyTable
  WITH HINTS (f: INT, m.x: BIGINT NOT NULL,  a[]: VARCHAR NULL)
{noformat}

The hints say that, if fields "f", "m.x" and "a" appear, they are of the type 
specified. If the fields don't appear, just ignore the hints.

Most systems put this information in metadata, but Drill is very hostile to 
metadata, so it must be in the query (or, equivalently, a view.)

> Enable pushing of cast expressions to the scanner for better schema discovery.
> --
>
> Key: DRILL-6312
> URL: https://issues.apache.org/jira/browse/DRILL-6312
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Relational Operators, Query Planning  
> Optimization
>Affects Versions: 1.13.0
>Reporter: Hanumath Rao Maduri
>Priority: Major
>
> Drill is a schema less engine which tries to infer the schema from disparate 
> sources at the read time. Currently the scanners infer the schema for each 
> batch depending upon the data for that column in the corresponding batch. 
> This solves many uses cases but can error out when the data is too different 
> between batches like int and array[int] etc... (There are other cases as well 
> but just to give one example).
> There is also a mechanism to create a view by type casting the columns to 
> appropriate type. This solves issues in some cases but fails in many other 
> cases. This is due to the fact that cast expression is not being pushed down 
> to the scanner but staying at the project or filter etc operators up the 
> query plan.
> This JIRA is to fix this by propagating the type information embedded in the 
> cast function to the scanners so that scanners can cast the incoming data 
> appropriately.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6312) Enable pushing of cast expressions to the scanner for better schema discovery.

2018-04-07 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16429632#comment-16429632
 ] 

Paul Rogers commented on DRILL-6312:


The idea of using the cast statement came from [~tdunning], and is a very good 
one.

The idea can be generalized using ideas from [this 
paper|https://blog.acolyer.org/2015/08/03/towards-practical-gradual-typing/]. 
Cast is just a special case of a more general idea: top-down, then bottom-up 
typing. Drill already implements bottom-up typing: Drill starts with columns, 
then infers the overridden versions of functions based on arguments, and 
eventually arrives at the type of each column in the result set.

For example, if we have an expression {{a + b}}, the reader will figure out the 
types of {{a}} and {{b}}.  Perhaps {{a}} is an {{INT}} and {{b}} is a 
{{Float8}}. Through type inference, Drill will find a version of the {{add}} 
function that takes two {{Float8}} arguments. Next, Drill will infer that it 
can convert an {{INT}} to a {{Float8}}.

The idea here is to run the system in reverse, from the result set back out to 
the scan columns. For each expression (function) in the SELECT clause, infer 
the types of the input. If we have an the expression above, {{a + b}}, then we 
can scan all the available versions of the {{add}} function to determine the 
set of possible argument types. Since {{add}} has many versions, one for each 
numeric type, we'll need a way to say that the arguments must be numeric, 
though we don't care the specific type. So, label the inputs as the new 
abstract type {{Numeric}}.

We've now labeled the arguments {{a}} and {{b}} as {{Numeric}}. We pass that 
information into the Scan operator, say the JSON reader. Now, when JSON sees 
the first value of {{a}} as null, and finds that {{b}} is missing, JSON has 
context to choose the correct type; say {{Float8}} or {{BigInt}} (the two 
numeric types that JSON uses.)

As we can see, Cast is just a special case: one in which the type is narrowed 
down to one very specific type. That is {{CAST(a AS INT)}} says not just that 
{{a}} is numeric, but that it is {{Int}}.

While this is all very useful, it still leads to ambiguity. In the case above, 
if all we know is that {{a}} is numeric, the first reader, the one that sees as 
{{null}} value, can choose {{BigInt}}. But, if another reader (or a later 
record) actually has the value as {{Float8}}, we've still got problems.

The result is a "bounce" algorithm: do a top-down tree traversal of the parse 
tree to infer possible expression types. Then, at runtime, continue to use the 
bottom-up traversal to infer actual types.

> Enable pushing of cast expressions to the scanner for better schema discovery.
> --
>
> Key: DRILL-6312
> URL: https://issues.apache.org/jira/browse/DRILL-6312
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Relational Operators, Query Planning  
> Optimization
>Affects Versions: 1.13.0
>Reporter: Hanumath Rao Maduri
>Priority: Major
>
> Drill is a schema less engine which tries to infer the schema from disparate 
> sources at the read time. Currently the scanners infer the schema for each 
> batch depending upon the data for that column in the corresponding batch. 
> This solves many uses cases but can error out when the data is too different 
> between batches like int and array[int] etc... (There are other cases as well 
> but just to give one example).
> There is also a mechanism to create a view by type casting the columns to 
> appropriate type. This solves issues in some cases but fails in many other 
> cases. This is due to the fact that cast expression is not being pushed down 
> to the scanner but staying at the project or filter etc operators up the 
> query plan.
> This JIRA is to fix this by propagating the type information embedded in the 
> cast function to the scanners so that scanners can cast the incoming data 
> appropriately.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (DRILL-6313) ScanBatch.Mutator does not report new schema for empty first batch

2018-04-07 Thread Paul Rogers (JIRA)
Paul Rogers created DRILL-6313:
--

 Summary: ScanBatch.Mutator does not report new schema for empty 
first batch
 Key: DRILL-6313
 URL: https://issues.apache.org/jira/browse/DRILL-6313
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.13.0
Reporter: Paul Rogers
Assignee: Paul Rogers
 Fix For: 1.14.0


Create a format plugin that honors an empty select list by returning no 
columns. This case occurs in a {{COUNT(\*)}} query.

When run, the query fails with:

{noformat}
SYSTEM ERROR: IllegalStateException: next() returned OK without first returning 
OK_NEW_SCHEMA [#2, ScanBatch]
{noformat}

The reason is that the {{Mutator}} class uses a flag, {{schemaChanged}}, which 
defaults to {{schemaChanged}}. It is set to {{true}} only when a field is 
added. But, since the query requested no fields, no field is added.

The fix is simple, just default {{schemaChanged}} to {{true}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6312) Enable pushing of cast expressions to the scanner for better schema discovery.

2018-04-07 Thread Hanumath Rao Maduri (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16429403#comment-16429403
 ] 

Hanumath Rao Maduri commented on DRILL-6312:


Please find the mail thread which discusses various issues and approaches to 
deal with discovery of schema.

{noformat}
Hi Hanu,

The problem with views as is, even with casts, is that the casting comes too 
late to resolve he issues I highlighted in earlier messages. Ted's cast 
push-down idea causes the conversion to happen at read time so that we can, 
say, cast a string to an int, or cast a null to the proper type.

Today, if we use a cast, such as SELECT cast(a AS INT) FROM myTable then we get 
a DAG that has tree parts (to keep things simple):

* Scan the data, using types inferred from the data itself
* In a Filter operator, convert the type of data to INT
* In Screen, return the result to the user

If the type is ambiguous in the file, then the first step above fails; data 
never gets far enough for the Filter to kick in and apply the cast. Also, if a 
file contains a run of nulls, the scanner will choose Nullable Int, then fail 
when it finds, say, a string.

The key point is that the cast push-down means that the query will not fail due 
to dicey files: the cast resolves the ambiguity. If we push the cast down, then 
it is the SCAN operator that resolves the conflict and does the cast; avoiding 
the failures we've been discussing.

I like the idea you seem to be proposing: cascading views. Have a table view 
that cleans up each table. Then, these can be combined in higher-order views 
for specialized purposes.

The beauty of the cast push-down idea is that no metadata is needed other than 
the query. If the user wants metadata, they use existing views (that contain 
the casts and cause the cast push-down.)

This seems like such a simple, elegant solution that we could try it out 
quickly (if we get past the planner issues Aman mentioned.) In fact, the new 
scan operator code (done as part of the batch sizing work) already has a 
prototype mechanism for type hints. If the type hint is provided to the 
scanner, it uses them, otherwise it infers the type. We'd just hook up the cast 
push down data to that prototype and we could try out the result quickly. (The 
new scan operator is still in my private branch, in case anyone goes looking 
for it...)

Some of your discussion talks about automatically inferring the schema. I 
really don't think we need to do that. The hint (cast push-down) is sufficient 
to resolve ambiguities in the existing scan-time schema inference.

The syntax trick would be to find a way to provide hints just for those columns 
that are issues. If I have a table with columns a, b, ... z, but only b is a 
problem, I don't want to have to do:

SELECT a, CAST(b AS INT), c, ... z FROM myTable

Would be great if we could just do:

SELECT *, CAST(b AS INT) FROM myTable

I realize the above has issues; the key idea is: provide casts only for the 
problem fields without spelling out all fields.

If we really want to get fancy, we can do UDF push down for the complex cases 
you mentioned. Maybe:

SELECT *, CAST(b AS INT), parseCode(c) ...

We are diving into design here; maybe you can file a JIRA and we can shift 
detailed design discussion to that JIRA. Salim already has one related to 
schema change errors, which was why the "Death" article caught my eye.

Thanks,
- Paul





On Friday, April 6, 2018, 4:59:40 PM PDT, Hanumath Rao Maduri 
 wrote:

 Hello,

Thanks for Ted & Paul for clarifying my questions.
Sorry for not being clear in my previous post, When I said create view I
was under the impression for simple views where we use cast expressions
currently to cast them to types. In this case planner can use this
information to force the scans to use this as the schema.

If the query fails then it fails at the scan and not after inferring the
schema by the scanner.

I know that views can get complicated with joins and expressions. For
schema hinting through views I assume they should be created on single
tables with corresponding columns one wants to project from the table.


Regarding the same question, today we had a discussion with Aman. Here view
can be considered as a "view" of the table with schema in place.

We can change some syntax to suite it for specifying schema. something like
this.

create schema[optional] view(/virtual table ) v1 as (a: int, b : int)
select a, b from t1 with some other rules as to conversion of scalar to
complex types.

Then the queries when used on this view (below) should enable the scanner
to use this type information and then use it to convert the data into the
appropriate types.
select * from v1

For the possibility of schema information not being known by the user, may
be use something like this.

create schema[optional] view(/virtual table) v1 as select a, b from t1
infer 

[jira] [Created] (DRILL-6312) Enable pushing of cast expressions to the scanner for better schema discovery.

2018-04-07 Thread Hanumath Rao Maduri (JIRA)
Hanumath Rao Maduri created DRILL-6312:
--

 Summary: Enable pushing of cast expressions to the scanner for 
better schema discovery.
 Key: DRILL-6312
 URL: https://issues.apache.org/jira/browse/DRILL-6312
 Project: Apache Drill
  Issue Type: Bug
  Components: Execution - Relational Operators, Query Planning  
Optimization
Affects Versions: 1.13.0
Reporter: Hanumath Rao Maduri


Drill is a schema less engine which tries to infer the schema from disparate 
sources at the read time. Currently the scanners infer the schema for each 
batch depending upon the data for that column in the corresponding batch. This 
solves many uses cases but can error out when the data is too different between 
batches like int and array[int] etc... (There are other cases as well but just 
to give one example).

There is also a mechanism to create a view by type casting the columns to 
appropriate type. This solves issues in some cases but fails in many other 
cases. This is due to the fact that cast expression is not being pushed down to 
the scanner but staying at the project or filter etc operators up the query 
plan.

This JIRA is to fix this by propagating the type information embedded in the 
cast function to the scanners so that scanners can cast the incoming data 
appropriately.





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6289) Cluster view should show more relevant information

2018-04-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16429347#comment-16429347
 ] 

ASF GitHub Bot commented on DRILL-6289:
---

Github user arina-ielchiieva commented on the issue:

https://github.com/apache/drill/pull/1203
  
Before the review I guess we need to clarify one thing. After DRILL-6044 
Shutdown button was shown only for the current drillbit. As far as I 
understood, you cannot shutdown other drillbits from Web UI except of current. 
@dvjyothsna please confirm.


> Cluster view should show more relevant information
> --
>
> Key: DRILL-6289
> URL: https://issues.apache.org/jira/browse/DRILL-6289
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Web Server
>Affects Versions: 1.13.0
>Reporter: Kunal Khatua
>Assignee: Kunal Khatua
>Priority: Major
> Fix For: 1.14.0
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> When fixing DRILL-6224, I noticed that the same information can be very 
> useful to have in the cluster view shown on a Drillbit's homepage. 
> The proposal is to show the following:
> # Heap Memory in use
> # Direct Memory (actively) in use - Since we're not able to get the total 
> memory held by Netty at the moment, but only what is currently allocated to 
> running queries
> # Process CPU
> # Average (System) Load Factor 
> Information such as the port numbers don't help much during general cluster 
> health, so it might be worth removing this information if more real-estate is 
> needed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (DRILL-6296) Add operator metrics for batch sizing for merge join

2018-04-07 Thread Arina Ielchiieva (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-6296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva resolved DRILL-6296.
-
Resolution: Fixed

Merged with commit id da241134fb88464139437b05b1feaafbb3014bb0.

> Add operator metrics for batch sizing for merge join
> 
>
> Key: DRILL-6296
> URL: https://issues.apache.org/jira/browse/DRILL-6296
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Relational Operators
>Affects Versions: 1.13.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
>Priority: Major
> Fix For: 1.14.0
>
>
> Add operator metrics for batch sizing stats for merge join.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6287) apache-release profile should be disabled by default

2018-04-07 Thread Arina Ielchiieva (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-6287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva updated DRILL-6287:

Fix Version/s: 1.14.0

> apache-release profile should be disabled by default
> 
>
> Key: DRILL-6287
> URL: https://issues.apache.org/jira/browse/DRILL-6287
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Vlad Rozov
>Assignee: Vlad Rozov
>Priority: Minor
>  Labels: ready-to-commit
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6230) Extend row set readers to handle hyper vectors

2018-04-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16429337#comment-16429337
 ] 

ASF GitHub Bot commented on DRILL-6230:
---

Github user asfgit closed the pull request at:

https://github.com/apache/drill/pull/1161


> Extend row set readers to handle hyper vectors
> --
>
> Key: DRILL-6230
> URL: https://issues.apache.org/jira/browse/DRILL-6230
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.14.0
>
>
> The current row set readers have incomplete support for hyper-vectors. To add 
> full support, we need an interface that supports either single batches or 
> hyper batches. Accessing vectors in hyper batches differs depending on 
> whether the vector is at the top level or is nested. See [this 
> post|https://github.com/paul-rogers/drill/wiki/BH-Column-Readers] for 
> details. Also includes a simpler reader template: replaces the original three 
> classes with one, in parallel with the writers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6303) Provide a button to copy the Drillbit's JStack shown in /threads

2018-04-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16429332#comment-16429332
 ] 

ASF GitHub Bot commented on DRILL-6303:
---

Github user asfgit closed the pull request at:

https://github.com/apache/drill/pull/1199


> Provide a button to copy the Drillbit's JStack shown in /threads
> 
>
> Key: DRILL-6303
> URL: https://issues.apache.org/jira/browse/DRILL-6303
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Web Server
>Reporter: Kunal Khatua
>Assignee: Kunal Khatua
>Priority: Trivial
>  Labels: ready-to-commit
> Fix For: 1.14.0
>
> Attachments: mouseOnClick.png, mouseOver.png
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Currently, when using the WebUI inspecting the JStack for the state of 
> threads within a Drillbit (via +{{http://:8047/threads}}+ ), the 
> contents of the `div` element refreshes automatically and resets any 
> selection, making it harder to freeze the contents for inspection.
> Pausing the refresh is not recommended, so the alternative is to copy the 
> contents to the user's clipboard for separately viewing in a text editor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6016) Error reading INT96 created by Apache Spark

2018-04-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16429335#comment-16429335
 ] 

ASF GitHub Bot commented on DRILL-6016:
---

Github user asfgit closed the pull request at:

https://github.com/apache/drill/pull/1166


> Error reading INT96 created by Apache Spark
> ---
>
> Key: DRILL-6016
> URL: https://issues.apache.org/jira/browse/DRILL-6016
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.13.0
>Reporter: Rahul Raj
>Assignee: Rahul Raj
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.14.0
>
>
> Hi,
> I am getting the error - SYSTEM ERROR : ClassCastException: 
> org.apache.drill.exec.vector.TimeStampVector cannot be cast to 
> org.apache.drill.exec.vector.VariableWidthVector while trying to read a spark 
> INT96 datetime field on Drill 1.11 in spite of setting the property 
> store.parquet.reader.int96_as_timestamp to  true.
> I believe this was fixed in drill 
> 1.10(https://issues.apache.org/jira/browse/DRILL-4373). What could be wrong.
> I have attached the dataset at 
> https://github.com/rajrahul/files/blob/master/result.tar.gz



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6279) Web UI should indicate when operators have spilled in-memory data to disk

2018-04-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16429331#comment-16429331
 ] 

ASF GitHub Bot commented on DRILL-6279:
---

Github user asfgit closed the pull request at:

https://github.com/apache/drill/pull/1197


> Web UI should indicate when operators have spilled in-memory data to disk
> -
>
> Key: DRILL-6279
> URL: https://issues.apache.org/jira/browse/DRILL-6279
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.13.0
>Reporter: Kunal Khatua
>Assignee: Kunal Khatua
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.14.0
>
> Attachments: spillToDiskSnapshot.png
>
>
> Currently, there is no indication of when an operator is spilling to disk, 
> which would help explain a slow running query. 
> Suggestions are welcome, but the current proposal is to simply update the 
> Operators Overview section to show average and max spill cycles, preferrably, 
> with a color code (or formatting).  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6287) apache-release profile should be disabled by default

2018-04-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16429333#comment-16429333
 ] 

ASF GitHub Bot commented on DRILL-6287:
---

Github user asfgit closed the pull request at:

https://github.com/apache/drill/pull/1182


> apache-release profile should be disabled by default
> 
>
> Key: DRILL-6287
> URL: https://issues.apache.org/jira/browse/DRILL-6287
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Vlad Rozov
>Assignee: Vlad Rozov
>Priority: Minor
>  Labels: ready-to-commit
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6271) Update copyright range in NOTICE

2018-04-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16429334#comment-16429334
 ] 

ASF GitHub Bot commented on DRILL-6271:
---

Github user asfgit closed the pull request at:

https://github.com/apache/drill/pull/1188


> Update copyright range in NOTICE
> 
>
> Key: DRILL-6271
> URL: https://issues.apache.org/jira/browse/DRILL-6271
> Project: Apache Drill
>  Issue Type: Task
>Reporter: Vlad Rozov
>Assignee: Venkata Jyothsna Donapati
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6284) Add operator metrics for batch sizing for flatten

2018-04-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16429336#comment-16429336
 ] 

ASF GitHub Bot commented on DRILL-6284:
---

Github user asfgit closed the pull request at:

https://github.com/apache/drill/pull/1181


> Add operator metrics for batch sizing for flatten
> -
>
> Key: DRILL-6284
> URL: https://issues.apache.org/jira/browse/DRILL-6284
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Flow
>Affects Versions: 1.13.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
>Priority: Critical
>  Labels: ready-to-commit
> Fix For: 1.14.0
>
>
> Add the following operator metrics for flatten.
> INPUT_BATCH_COUNT,
> AVG_INPUT_BATCH_BYTES,
> AVG_INPUT_ROW_BYTES,
> INPUT_RECORD_COUNT,
> OUTPUT_BATCH_COUNT,
> AVG_OUTPUT_BATCH_BYTES,
> AVG_OUTPUT_ROW_BYTES,
> OUTPUT_RECORD_COUNT;
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)