[jira] [Commented] (DRILL-4264) Dots in identifier are not escaped correctly

2017-08-14 Thread Volodymyr Vysotskyi (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16125499#comment-16125499
 ] 

Volodymyr Vysotskyi commented on DRILL-4264:


[~mandoskippy], the goal of this Jira is to fix the issue which you have 
mentioned.
With the fix for this Jira query 
{code:sql}
select * from `test.json`;
{code}
where *test.json* is the file from Jira description (it also has dots in the 
field names) will return correct result:
{noformat}
+--+--+
|  0.0.1   |  0.1.2 
  |
+--+--+
| {"version":"0.0.1","date_created":"2014-03-15"}  | 
{"version":"0.1.2","date_created":"2014-05-21"}  |
+--+--+
{noformat}

> Dots in identifier are not escaped correctly
> 
>
> Key: DRILL-4264
> URL: https://issues.apache.org/jira/browse/DRILL-4264
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Codegen
>Reporter: Alex
>Assignee: Volodymyr Vysotskyi
>  Labels: doc-impacting
> Fix For: 1.12.0
>
>
> If you have some json data like this...
> {code:javascript}
> {
>   "0.0.1":{
> "version":"0.0.1",
> "date_created":"2014-03-15"
>   },
>   "0.1.2":{
> "version":"0.1.2",
> "date_created":"2014-05-21"
>   }
> }
> {code}
> ... there is no way to select any of the rows since their identifiers contain 
> dots and when trying to select them, Drill throws the following error:
> Error: SYSTEM ERROR: UnsupportedOperationException: Unhandled field reference 
> "0.0.1"; a field reference identifier must not have the form of a qualified 
> name
> This must be fixed since there are many json data files containing dots in 
> some of the keys (e.g. when specifying version numbers etc)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-4264) Dots in identifier are not escaped correctly

2017-08-11 Thread John Omernik (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16123836#comment-16123836
 ] 

John Omernik commented on DRILL-4264:
-

Lots to process here...

Let me add a simple thing though... 

"As a user I have a new data set, I have no idea what's in it, from Drill 
(that's the key) I should be able to select * from directory, and if it's a 
known format (JSON, Parquet, CSV etc) I should get results back.  I know I can 
do select `field.one`, `field.two` from directory and get it, but say it's a 
parquet file created in Spark... there is no way for me explore that data in 
Drill ... I need select *

> Dots in identifier are not escaped correctly
> 
>
> Key: DRILL-4264
> URL: https://issues.apache.org/jira/browse/DRILL-4264
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Codegen
>Reporter: Alex
>Assignee: Volodymyr Vysotskyi
>  Labels: doc-impacting
> Fix For: 1.12.0
>
>
> If you have some json data like this...
> {code:javascript}
> {
>   "0.0.1":{
> "version":"0.0.1",
> "date_created":"2014-03-15"
>   },
>   "0.1.2":{
> "version":"0.1.2",
> "date_created":"2014-05-21"
>   }
> }
> {code}
> ... there is no way to select any of the rows since their identifiers contain 
> dots and when trying to select them, Drill throws the following error:
> Error: SYSTEM ERROR: UnsupportedOperationException: Unhandled field reference 
> "0.0.1"; a field reference identifier must not have the form of a qualified 
> name
> This must be fixed since there are many json data files containing dots in 
> some of the keys (e.g. when specifying version numbers etc)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-4264) Dots in identifier are not escaped correctly

2017-07-26 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16102548#comment-16102548
 ] 

Paul Rogers commented on DRILL-4264:


Wonderful detailed analysis! You caught many detailed issues that my quick scan 
missed.

The solution for Parquet metadata seems good. I'm not an expert in that area, 
but a few unit tests will validate the change once you make it. Bumping the 
version number will solve the forward/backward compatibility issues (using the 
mechanism from DRILL-5660.)

The {{MaterializedField}} issue is harder. Fortunately, some of the nested-name 
issues might not be actual issues.

For example, your example of 
[ScanBatch.Mutator:362|https://github.com/apache/drill/blob/3e8b01d5b0d3013e3811913f0fd6028b22c1ac3f/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet2/DrillParquetReader.java#L225]
 should be OK as long as the caller knows to pass in only the leaf name. This 
line is used to build up a record batch during reading such as in JSON or 
Parquet. The problem is if the container is a map. In this case, the caller 
should be calling {{AbstractMapVector.addOrGet()}} to add the field rather than 
adding it at the top level using the {{Mutator}}.

Are there other cases where the code assembles a path then tears it down again? 
Or, parses a path?

Otherwise, we can find all uses of {{MaterializedField.getPath()}}, verify that 
the really only use the leaf name, and replace them with {{getName()}}. The 
same is true of {{getLastName()}}.

> Dots in identifier are not escaped correctly
> 
>
> Key: DRILL-4264
> URL: https://issues.apache.org/jira/browse/DRILL-4264
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Codegen
>Reporter: Alex
>Assignee: Volodymyr Vysotskyi
>  Labels: doc-impacting
>
> If you have some json data like this...
> {code:javascript}
> {
>   "0.0.1":{
> "version":"0.0.1",
> "date_created":"2014-03-15"
>   },
>   "0.1.2":{
> "version":"0.1.2",
> "date_created":"2014-05-21"
>   }
> }
> {code}
> ... there is no way to select any of the rows since their identifiers contain 
> dots and when trying to select them, Drill throws the following error:
> Error: SYSTEM ERROR: UnsupportedOperationException: Unhandled field reference 
> "0.0.1"; a field reference identifier must not have the form of a qualified 
> name
> This must be fixed since there are many json data files containing dots in 
> some of the keys (e.g. when specifying version numbers etc)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-4264) Dots in identifier are not escaped correctly

2017-07-25 Thread Volodymyr Vysotskyi (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16099918#comment-16099918
 ] 

Volodymyr Vysotskyi commented on DRILL-4264:


Thanks for such detailed analysis. 

I agree with you that such deserializing of {{ColumnTypeMetadata_v3.Key}} 
objects will cause problems for the fields that contain dots in their names. To 
solve this issue I propose to change the structure of the 
{{ColumnTypeMetadata_v3.Key}} class. Instead of using an array with the 
components of the field name we should use {{SchemaPath}} and serialise it as a 
string obtained by calling {{SchemaPath.toExpr()}}. With this change, we also 
should update parquet metadata version. 

A more complex problem is connected with {{MaterializedField}} class. 
{{SchemaPath}} was removed from {{MaterializedField}} class in 
[PR-373|https://github.com/apache/drill/pull/373]. One of the reasons for this 
refactoring was the assumption that {{MaterializedField}} should have no 
knowledge of its parents. Some code in Drill supposes that 
{{MaterializedField.getPath()}} returns field path including its parents. 
For example in [this 
line|https://github.com/apache/drill/blob/3e8b01d5b0d3013e3811913f0fd6028b22c1ac3f/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet2/DrillParquetReader.java#L225]
 {{MaterializedField}} instance will be created with the name 
{{col.getAsUnescapedPath()}}. In [this 
line|https://github.com/apache/drill/blob/874bf6296dcd1a42c7cf7f097c1a6b5458010cbb/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/ScanBatch.java#L362]
 the name with parent field names was used. Using only the field name in the 
{{MaterializedField}} will cause problems since the field at the root level may 
have the same name as the field, nested in the map. 
So full field path should be used in the {{MaterializedField}} class in this 
case.

The {{SchemaPath.getSimplePath(field.getPath())}} code is used in many places, 
but it does not return the same {{SchemaPath}} that was used to create 
{{MaterializedField}} instance. 
We should change the implementation of {{MaterializedField}} in such a way that 
this code returns the same {{SchemaPath}} which was used to create 
{{MaterializedField}} instance. 

I think we should store a separate field {{String path}} in 
{{MaterializedField}} class with value {{SchemaPath.toExpr()}} and replace all 
{{SchemaPath.getAsUnescapedPath()}} calls by the {{SchemaPath.toExpr()}}. 
* when the {{MaterializedField}} instance is created using the path 
{{SchemaPath.toExpr()}}, the name will be assigned as the last name of the 
{{SchemaPath}}. 
* when {{MaterializedField}} instance is created using the name, the path will 
be the same as the name with backticks. 

The less preferred solution is the revert of commit 
[PR-373|https://github.com/apache/drill/pull/373]. In this case dots in the 
field names will be handled correctly. But such solution will make the 
transition to using Apache Arrow more complex (but {{MaterializedField}} was 
replaced by {{Flatbuffer Field}}, so the transition is already too complex). 


> Dots in identifier are not escaped correctly
> 
>
> Key: DRILL-4264
> URL: https://issues.apache.org/jira/browse/DRILL-4264
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Codegen
>Reporter: Alex
>Assignee: Volodymyr Vysotskyi
>
> If you have some json data like this...
> {code:javascript}
> {
>   "0.0.1":{
> "version":"0.0.1",
> "date_created":"2014-03-15"
>   },
>   "0.1.2":{
> "version":"0.1.2",
> "date_created":"2014-05-21"
>   }
> }
> {code}
> ... there is no way to select any of the rows since their identifiers contain 
> dots and when trying to select them, Drill throws the following error:
> Error: SYSTEM ERROR: UnsupportedOperationException: Unhandled field reference 
> "0.0.1"; a field reference identifier must not have the form of a qualified 
> name
> This must be fixed since there are many json data files containing dots in 
> some of the keys (e.g. when specifying version numbers etc)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-4264) Dots in identifier are not escaped correctly

2017-07-19 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16094260#comment-16094260
 ] 

Paul Rogers commented on DRILL-4264:


On the planner side, we represent field references with the {{FieldReference}} 
class you mentioned. {{FieldReference}} extends {{SchemaPath}}. These classes 
break names down into one object per name part.

Assume we have {{SELECT a.b.c, a."d.e" ...}}

Within the {{FieldReference}} itself, we hold onto the name using a 
{{PathSegment}} which has two subclasses: {{ArraySegment}} and {{NameSegment}}. 
So, as you noted, in the planner, we can tell the difference between the two 
cases (using functional notation):

{code}
a.b.c: FieldReference(NameSegment("a", NameSegment("b", NameSegment("c"
a."d.e": FieldReference(NameSegment("a", NameSegment("d.e")))
{code}

So far so good. Bug, {{SchemaPath}} provides the {{getAsUnescapedPath()}} 
method which concatenates the parts of the name using dots. We end up with two 
{{FieldReference}} instances. Calling {{getAsUnescapedPath()}} on each produces 
{{a.b.c}} and {{a.d.e}}. So, if anything actually uses this unescaped path, we 
end up with an ambiguity: does "a.d.e" represent one field, two fields or three 
fields? We cannot tell.

Now, if this method was only used for debugging (line {{toString()}}), it would 
be fine. But, in fact, many operators refer to this method, especially when 
creating the run-time representation of a field schema: {{MaterializedField}}:

>From {{StreamingAggBatch}}:

{code}
  final MaterializedField outputField = MaterializedField.create(
ne.getRef().getAsUnescapedPath(), expr.getMajorType());
{code}

In our examples, we end up with two materialized fields: one called "a.b.c", 
the other "a.d.e", so the ambiguity persists.

As it turns out, each {{MaterializedField}} represents one field or structured 
column. So, our map "a" is represented by a {{MaterializedField}}, "b" by 
another, "c" by yet another and "d.e" by another. So, each should correspond to 
a single name part.

But, the code doesn't work that way, it actually builds up the full unescaped 
name.

Now, I suspect that the code here is old and inconsistent. It should be that 
creating a materialized field pulls out only one name part. But, the code 
actually concatenates. My suspicion increases when I see methods like these in 
{{MaterializedField}}:

{code}
  public String getPath() { return getName(); }
  public String getLastName() { return getName(); }
  public String getName() { return name; }
{code}

That first one really worries me: it is asking for the "path", which means 
dotted name. There are many references to this name. Does this mean the code 
expects to get a string (not a {{NameSegment}}) that holds the composite name. 
If so, we are in trouble.

Now, as it turns out, it seems that the "modern" form of {{MaterializedSchema}} 
is that each hold just one name part. So:

{code}
MaterializedField(name="a", children = (
  MaterializedField(name="b", children = (
MaterializedField(name = c))),
  MaterializedField(name="d.e")))
{code}

I wonder, because the code appears to be written assuming that a 
{{MaterializedField}} had a path name, does any code still rely on this fact, 
then split the name at dots to get fields?

If not, can we remove the {{getPath()}}, and {{getLastPath()}} methods to make 
it clear that each {{MaterializedField}} corresponds to a single 
{{NameSegment}}?

And, if we do that, should we remove all calls to 
{{NameSegment.getAsUnescapedPath()}} to make clear that we never (except for 
display) want dotted, combined path name?

By carefully looking at the above issues, we can be sure that no old code in 
Drill tries to concatenate "a" and "d.e" to get the name "a.d.e" which it then 
splits into "a", "d" and "e".

A quick search for ".split(" found a number of places where we split names on a 
dot, including in the Parquet Metadata file:

{code}
public Object deserializeKey(String key, 
com.fasterxml.jackson.databind.DeserializationContext ctxt)
throws IOException, 
com.fasterxml.jackson.core.JsonProcessingException {
  return new Key(key.split("\\."));
}
{code}

Are there others? Do these need to be fixed?

> Dots in identifier are not escaped correctly
> 
>
> Key: DRILL-4264
> URL: https://issues.apache.org/jira/browse/DRILL-4264
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Codegen
>Reporter: Alex
>Assignee: Volodymyr Vysotskyi
>
> If you have some json data like this...
> {code:javascript}
> {
>   "0.0.1":{
> "version":"0.0.1",
> "date_created":"2014-03-15"
>   },
>   "0.1.2":{
> "version":"0.1.2",
> "date_created":"2014-05-21"
>  

[jira] [Commented] (DRILL-4264) Dots in identifier are not escaped correctly

2017-07-18 Thread Volodymyr Vysotskyi (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16091346#comment-16091346
 ] 

Volodymyr Vysotskyi commented on DRILL-4264:


Yes, these statements are correct. 

Drill also supports array syntax {{b\['d.e'\]}}. I used it as an example of 
using field names with dots in the query. 

No, I meant that we are using the same 
[FieldReference|https://github.com/apache/drill/blob/90f43bff7a01eaaee6c8861137759b05367dfcf3/logical/src/main/java/org/apache/drill/common/expression/FieldReference.java#L40]
 class, but in planner we are using 
[checkSimpleString()|https://github.com/apache/drill/blob/90f43bff7a01eaaee6c8861137759b05367dfcf3/logical/src/main/java/org/apache/drill/common/expression/FieldReference.java#L54]
 method and at the execution stage we are using constructor 
[FieldReference()|https://github.com/apache/drill/blob/90f43bff7a01eaaee6c8861137759b05367dfcf3/logical/src/main/java/org/apache/drill/common/expression/FieldReference.java#L64]
 that checks input string for dots. 
The code does not combine the map name and nested field names. Nested field 
names are stored as children of the [NameSegment 
rootSegment|https://github.com/apache/drill/blob/90f43bff7a01eaaee6c8861137759b05367dfcf3/logical/src/main/java/org/apache/drill/common/expression/SchemaPath.java#L45]
 in SchemaPath (it is a superclass of FieldReference). 

So I propose to remove that check.

> Dots in identifier are not escaped correctly
> 
>
> Key: DRILL-4264
> URL: https://issues.apache.org/jira/browse/DRILL-4264
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Codegen
>Reporter: Alex
>Assignee: Volodymyr Vysotskyi
>
> If you have some json data like this...
> {code:javascript}
> {
>   "0.0.1":{
> "version":"0.0.1",
> "date_created":"2014-03-15"
>   },
>   "0.1.2":{
> "version":"0.1.2",
> "date_created":"2014-05-21"
>   }
> }
> {code}
> ... there is no way to select any of the rows since their identifiers contain 
> dots and when trying to select them, Drill throws the following error:
> Error: SYSTEM ERROR: UnsupportedOperationException: Unhandled field reference 
> "0.0.1"; a field reference identifier must not have the form of a qualified 
> name
> This must be fixed since there are many json data files containing dots in 
> some of the keys (e.g. when specifying version numbers etc)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-4264) Dots in identifier are not escaped correctly

2017-07-17 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16091092#comment-16091092
 ] 

Paul Rogers commented on DRILL-4264:


Let's see if I understand the proposal:

* Dots not allowed in storage plugin or workspace names.
* As a result, does in quoted workspace names are assumed to be sparators: 
{{`dfs.x` = `dfs`.`x`}}
* Dots in table names must be quoted. {{`dfs.x`.`my.table.json`}}
* Dots in columns must be quoted. {{SELECT a, b.c, b.`d.e` ...}}

Is this correct?

If so, then it seems fine.

Are you also proposing to support array syntax? {{SELECT a, b.c, b[`d.e`] ...}}?

You mentioned we use one code for names in the planner, another 
({{SchemaPath}}) for runtime. What rules to we use at runtime? Can that code 
handle column names with dots? That is, do we ever start with a map "b" and a 
column "d.e", combine them to get "a.d.e" and try to split them again to get 
"a", "d" and "e"? If so, how do we fix that?

Once we get all these questions resolved, I'd suggest posting a summary of the 
rules to the dev list so that others can take a look. 

> Dots in identifier are not escaped correctly
> 
>
> Key: DRILL-4264
> URL: https://issues.apache.org/jira/browse/DRILL-4264
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Codegen
>Reporter: Alex
>Assignee: Volodymyr Vysotskyi
>
> If you have some json data like this...
> {code:javascript}
> {
>   "0.0.1":{
> "version":"0.0.1",
> "date_created":"2014-03-15"
>   },
>   "0.1.2":{
> "version":"0.1.2",
> "date_created":"2014-05-21"
>   }
> }
> {code}
> ... there is no way to select any of the rows since their identifiers contain 
> dots and when trying to select them, Drill throws the following error:
> Error: SYSTEM ERROR: UnsupportedOperationException: Unhandled field reference 
> "0.0.1"; a field reference identifier must not have the form of a qualified 
> name
> This must be fixed since there are many json data files containing dots in 
> some of the keys (e.g. when specifying version numbers etc)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-4264) Dots in identifier are not escaped correctly

2017-07-17 Thread Volodymyr Vysotskyi (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089604#comment-16089604
 ] 

Volodymyr Vysotskyi commented on DRILL-4264:


I think we shouldn't allow dots in the names of plugins and workspaces since 
Drill user can control these names and schema names are stored without quotes. 
So for example when we will have a plugin with name {{`df.s`}}, a plugin with 
name {{`df`}} and workspace in it with name {{`s`}}, schema name {{df.s}} will 
refer to the two schemes. 

For the table, directory or file names, dots are allowed. The path to the table 
in the query is specifying using quotes. The same for table names with dots. 

Currently, dots are allowed for column aliases and they are quoted also. 
But Drill may create a table using the column alias with dots and will fail at 
the execution stage when it will try to do a {{select *}}. 

The solution that I am proposed in [my 
comment|https://issues.apache.org/jira/browse/DRILL-4264?focusedCommentId=16085700&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16085700]
 does not change the current syntax of Drill queries, I think the current 
syntax is fine. The proposed solution allows Drill does not fail when it will 
discover such columns in the table.

> Dots in identifier are not escaped correctly
> 
>
> Key: DRILL-4264
> URL: https://issues.apache.org/jira/browse/DRILL-4264
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Codegen
>Reporter: Alex
>Assignee: Volodymyr Vysotskyi
>
> If you have some json data like this...
> {code:javascript}
> {
>   "0.0.1":{
> "version":"0.0.1",
> "date_created":"2014-03-15"
>   },
>   "0.1.2":{
> "version":"0.1.2",
> "date_created":"2014-05-21"
>   }
> }
> {code}
> ... there is no way to select any of the rows since their identifiers contain 
> dots and when trying to select them, Drill throws the following error:
> Error: SYSTEM ERROR: UnsupportedOperationException: Unhandled field reference 
> "0.0.1"; a field reference identifier must not have the form of a qualified 
> name
> This must be fixed since there are many json data files containing dots in 
> some of the keys (e.g. when specifying version numbers etc)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-4264) Dots in identifier are not escaped correctly

2017-07-14 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16087705#comment-16087705
 ] 

Paul Rogers commented on DRILL-4264:


Thanks for the explanation! Can you suggest a solution (from the user's view) 
that offers the best trade-off between a number of requirements?

Requirements:

* Allow dots in any name (schema, workspace, table, column, directory, file).
* Consistent behavior for all names.
* Compatible with existing queries.
* Familiar to users of similar systems (Hive, Impala, etc.)

Maybe propose a syntax and user-visible rules that achieve the above (as best 
we can.) Once we agree on those rules, we can move on to discuss how we'll 
implement the rules given Calcite and the Drill execution classes.


> Dots in identifier are not escaped correctly
> 
>
> Key: DRILL-4264
> URL: https://issues.apache.org/jira/browse/DRILL-4264
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Codegen
>Reporter: Alex
>Assignee: Volodymyr Vysotskyi
>
> If you have some json data like this...
> {code:javascript}
> {
>   "0.0.1":{
> "version":"0.0.1",
> "date_created":"2014-03-15"
>   },
>   "0.1.2":{
> "version":"0.1.2",
> "date_created":"2014-05-21"
>   }
> }
> {code}
> ... there is no way to select any of the rows since their identifiers contain 
> dots and when trying to select them, Drill throws the following error:
> Error: SYSTEM ERROR: UnsupportedOperationException: Unhandled field reference 
> "0.0.1"; a field reference identifier must not have the form of a qualified 
> name
> This must be fixed since there are many json data files containing dots in 
> some of the keys (e.g. when specifying version numbers etc)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-4264) Dots in identifier are not escaped correctly

2017-07-14 Thread Volodymyr Vysotskyi (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16087229#comment-16087229
 ] 

Volodymyr Vysotskyi commented on DRILL-4264:


On the one hand user knows that {{tmp}} workspace inside {{dfs}} plugin, so 
user expects that query 
{code:sql}
use dfs.tmp;
{code}
should work. On the other hand query 
{code:sql}
show schemas;
{code}
returns the schema name {{dfs.tmp}}. So user also expects that the query with 
such schema name should work.
{code:sql}
use `dfs.tmp`;
{code}
Both these cases work since schema name and its path are the same.

Schemas names with dots currently does not work in Drill. It is due to the 
handling schema paths in [this 
way|https://github.com/apache/drill/blob/3e8b01d5b0d3013e3811913f0fd6028b22c1ac3f/exec/java-exec/src/main/java/org/apache/drill/exec/rpc/user/UserSession.java#L201]

Queries
{code:sql}
SELECT `dfs`.`ds`.`foo.json`.`a`.`b.c` FROM `dfs`.`ds`.`foo.json` (1)
SELECT `dfs.ds.foo.json.a`.`b.c` FROM `dfs.ds`.`foo.json` (2)
SELECT `dfs.ds.foo.json.a.b.c` FROM `dfs.ds.foo.json` (3)
{code}
will not work since Drill allows only table names or aliases before the field 
names. 
Considering only from clause, third case would not work, since Drill assumes 
that {{`dfs.ds.foo.json`}} is the schema name only.
Queries on the directories with dots and table names with dots also works 
correctly. 

Hive has an 
[option|https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.support.quoted.identifiers]
 that allows dots in the columns names (by default dots in the columns is 
allowed). 
Parquet also allows field names with dots. 
Also current version of Drill can create parquet files with dots in field 
names, but Drill will fail when querying this file.

> Dots in identifier are not escaped correctly
> 
>
> Key: DRILL-4264
> URL: https://issues.apache.org/jira/browse/DRILL-4264
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Codegen
>Reporter: Alex
>Assignee: Volodymyr Vysotskyi
>
> If you have some json data like this...
> {code:javascript}
> {
>   "0.0.1":{
> "version":"0.0.1",
> "date_created":"2014-03-15"
>   },
>   "0.1.2":{
> "version":"0.1.2",
> "date_created":"2014-05-21"
>   }
> }
> {code}
> ... there is no way to select any of the rows since their identifiers contain 
> dots and when trying to select them, Drill throws the following error:
> Error: SYSTEM ERROR: UnsupportedOperationException: Unhandled field reference 
> "0.0.1"; a field reference identifier must not have the form of a qualified 
> name
> This must be fixed since there are many json data files containing dots in 
> some of the keys (e.g. when specifying version numbers etc)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-4264) Dots in identifier are not escaped correctly

2017-07-13 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16086580#comment-16086580
 ] 

Paul Rogers commented on DRILL-4264:


Putting on my user hat, I don't think users of Drill (or even non-planner 
developers like me) will understand that column names behave differently than 
table/schema names.

For example, from the description, I cannot predict what happens here:

{code}
Data: dfs.ds.foo.json, contents: { "a" : { "b.c": 10 } }

SELECT `dfs`.`ds`.`foo.json`.`a`.`b.c` FROM `dfs`.`ds`.`foo.json` (1)
SELECT `dfs.ds.foo.json.a`.`b.c` FROM `dfs.ds`.`foo.json` (2)
SELECT `dfs.ds.foo.json.a.b.c` FROM `dfs.ds.foo.json` (3)
{code}

Case (1) should work, right? Each component of the path names are enclosed in 
back-ticks, the dots are unquoted. {{`b.c`}} is quoted so is a complete field 
name. Similarly the file name, {{`foo.json`}} is quoted so clearly {{.json}} is 
part of the file name.

By this logic, case (2) should not work. The quotes enclose parts of a path and 
so the dots in those names should not be treated as delimiters, but rather as 
part of the name. That is the SELECT list should be:

Table: {{`dfs.ds.foo.a`}}
Column: {{`b.c`}}
Schema: {{`dfs.ds`}}
Table: {{`foo.json`}}

Since the two tables do not agree, the query should fail in the planner. Even 
if it didn't, it should return null because no table column matches {{`b.c`}}. 
And yet your explanation suggests that some part of the quoted name will be 
considered separate components.

If so, then Drill is magic, it knows when dots are part of the name (file name) 
and so (3) should work also. But, it won't for the reasons you state.

Ideally, we'd enforce form (2): dots inside quotes are part of the name; they 
are not separators. But, it seems if we do that we might break existing 
queries. Or, can this actually work?

Let me throw in two more complications:

* The directory containing foo.json has a dot: "my.files"
* The workspace name itself contains a dot: "my.ws"

Can I do the above? If not, why not? What SQL syntax would I use? Maybe:

{code}
SELECT `my.ws`.`/my.files/foo.json`.`a`.`b.c` FROM `dfs`.`ds`.`foo.json` (4)
{code}

Seems we've rather gotten ourselves into a muddle by allowing separator dots in 
names.

About here I guess we should ask, what do Hive and Parquet do? They must have 
solved this issue.

> Dots in identifier are not escaped correctly
> 
>
> Key: DRILL-4264
> URL: https://issues.apache.org/jira/browse/DRILL-4264
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Codegen
>Reporter: Alex
>Assignee: Volodymyr Vysotskyi
>
> If you have some json data like this...
> {code:javascript}
> {
>   "0.0.1":{
> "version":"0.0.1",
> "date_created":"2014-03-15"
>   },
>   "0.1.2":{
> "version":"0.1.2",
> "date_created":"2014-05-21"
>   }
> }
> {code}
> ... there is no way to select any of the rows since their identifiers contain 
> dots and when trying to select them, Drill throws the following error:
> Error: SYSTEM ERROR: UnsupportedOperationException: Unhandled field reference 
> "0.0.1"; a field reference identifier must not have the form of a qualified 
> name
> This must be fixed since there are many json data files containing dots in 
> some of the keys (e.g. when specifying version numbers etc)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-4264) Dots in identifier are not escaped correctly

2017-07-13 Thread Volodymyr Vysotskyi (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16086146#comment-16086146
 ] 

Volodymyr Vysotskyi commented on DRILL-4264:


Drill handles field names and schema paths in different ways. For schema paths 
Drill uses Calcite. The required schema is returned by the method 
[getSchema()|https://github.com/mapr/incubator-calcite/blob/DrillCalcite1.4.0-mapr-1.4.0/core/src/main/java/org/apache/calcite/prepare/CalciteCatalogReader.java#L157].
 So for the input {{`dfs.data`}} needful chema will be returned from the first 
iteration of the cycle. For the case {{dfs.data}} in the first iteration will 
be assigned schema {{dfs}} and in the next iteration schema {{data}}.

Query 
{code:sql}
SELECT `rk.q`, `m.a.b` FROM test_table;
{code}
returns
{noformat}
-
| rk.q   | m.a.b  
|
-
| a | null  
|
-
{noformat}
For this case Drill is looking for the field with name {{m.a.b}}.

Query
{code:sql}
SELECT t.`rk.q`, t.m.`a.b` FROM test_table t;
{code}
returns correct result:
{noformat}
-
| rk.q   | EXPR$1 
|
-
| a | 1 
|
-
{noformat}

So yes, dot inside backticks is considered as a part of the field name and 
outside is considered as a separator.

> Dots in identifier are not escaped correctly
> 
>
> Key: DRILL-4264
> URL: https://issues.apache.org/jira/browse/DRILL-4264
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Codegen
>Reporter: Alex
>Assignee: Volodymyr Vysotskyi
>
> If you have some json data like this...
> {code:javascript}
> {
>   "0.0.1":{
> "version":"0.0.1",
> "date_created":"2014-03-15"
>   },
>   "0.1.2":{
> "version":"0.1.2",
> "date_created":"2014-05-21"
>   }
> }
> {code}
> ... there is no way to select any of the rows since their identifiers contain 
> dots and when trying to select them, Drill throws the following error:
> Error: SYSTEM ERROR: UnsupportedOperationException: Unhandled field reference 
> "0.0.1"; a field reference identifier must not have the form of a qualified 
> name
> This must be fixed since there are many json data files containing dots in 
> some of the keys (e.g. when specifying version numbers etc)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-4264) Dots in identifier are not escaped correctly

2017-07-13 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16085954#comment-16085954
 ] 

Paul Rogers commented on DRILL-4264:


In the example case, what happens for:

{code}
SELECT `rk.q`, `m.a.b` FROM test_table;
{code}

To work, is it necessary to do this:

{code}
SELECT `rk.q`, m.`a.b` FROM test_table;
{code}

That is, is a dot inside back-ticks considered part of the name, but those 
outside considered separators?

I ask because I often do the following, and it works:

{code}
select `name`, `monisid`, `validitydate` from `dfs.data`.`gen.json`
{code}

That is, in the table name, dots inside backticks are, in fact, separators. So, 
do column and table names have different syntax rules?

Can we spell out the syntax rules for these three cases:

* Column names in the planner
* Table names in the planner
* Column names discovered at runtime

All that said, we should certainly make {{SELECT *}} work as the name expansion 
is done in the execution engine, not the planner.

> Dots in identifier are not escaped correctly
> 
>
> Key: DRILL-4264
> URL: https://issues.apache.org/jira/browse/DRILL-4264
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Codegen
>Reporter: Alex
>Assignee: Volodymyr Vysotskyi
>
> If you have some json data like this...
> {code:javascript}
> {
>   "0.0.1":{
> "version":"0.0.1",
> "date_created":"2014-03-15"
>   },
>   "0.1.2":{
> "version":"0.1.2",
> "date_created":"2014-05-21"
>   }
> }
> {code}
> ... there is no way to select any of the rows since their identifiers contain 
> dots and when trying to select them, Drill throws the following error:
> Error: SYSTEM ERROR: UnsupportedOperationException: Unhandled field reference 
> "0.0.1"; a field reference identifier must not have the form of a qualified 
> name
> This must be fixed since there are many json data files containing dots in 
> some of the keys (e.g. when specifying version numbers etc)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-4264) Dots in identifier are not escaped correctly

2017-07-13 Thread Volodymyr Vysotskyi (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16085700#comment-16085700
 ] 

Volodymyr Vysotskyi commented on DRILL-4264:


Currently Drill has inconsistent behaviour when querying the file with quotes. 
Query
{code:sql}
select * from test_table t
{code}
fails, but query 
{code:sql}
select `rk.q` as `rk.q` from test_table t
{code}
returns correct result for the file
{noformat}
{"rk.q": "a", "m": {"a.b":"1", "a":{"b":"2"}, "c":"3"}}
{noformat}
The difference between these two cases is that for the second case filed 
reference was created using the method 
{{FieldReference.getWithQuotedRef(field.getName())}} which does not 
[check|https://github.com/apache/drill/blob/90f43bff7a01eaaee6c8861137759b05367dfcf3/logical/src/main/java/org/apache/drill/common/expression/FieldReference.java#L54]
 the field name. In the first case constructor with check was 
[used|https://github.com/apache/drill/blob/416ec70a616e8d12b5c7fca809763b977d2f7aad/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/project/ProjectRecordBatch.java#L360].

Nested field may be selected by few ways:
{{t.m.c}} or {{t.m\['c'\]}}.
Without checking the field name, query
{code:sql}
select t.m.`a.b`, t.m.a.b, t.m['a.b'] from test_table t
{code}
returns correct result.
Mysql, for example, also allows quoted field with dots.

Preferred solution is to remove the check for field with dots.
But user may forget to add quotes for the field with dots, so query may return 
result that does not expected by user.

Other solution is to add session option that allows to use fields with dots and 
depending on this option check the field or not. So user will be responsible 
for the queries with forgotten quotes.

> Dots in identifier are not escaped correctly
> 
>
> Key: DRILL-4264
> URL: https://issues.apache.org/jira/browse/DRILL-4264
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Codegen
>Reporter: Alex
>Assignee: Volodymyr Vysotskyi
>
> If you have some json data like this...
> {code:javascript}
> {
>   "0.0.1":{
> "version":"0.0.1",
> "date_created":"2014-03-15"
>   },
>   "0.1.2":{
> "version":"0.1.2",
> "date_created":"2014-05-21"
>   }
> }
> {code}
> ... there is no way to select any of the rows since their identifiers contain 
> dots and when trying to select them, Drill throws the following error:
> Error: SYSTEM ERROR: UnsupportedOperationException: Unhandled field reference 
> "0.0.1"; a field reference identifier must not have the form of a qualified 
> name
> This must be fixed since there are many json data files containing dots in 
> some of the keys (e.g. when specifying version numbers etc)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-4264) Dots in identifier are not escaped correctly

2017-07-10 Thread Arina Ielchiieva (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16080129#comment-16080129
 ] 

Arina Ielchiieva commented on DRILL-4264:
-

Some useful notes provided by [~Paul.Rogers]:

1. Escape mechanism for dots in names. (See this for how it is done in 
[JSON|https://stackoverflow.com/questions/2577172/how-to-get-json-objects-value-if-its-name-contains-dots].
 Can we support table[“1.2.3”] syntax in Drill?)

2. Dots in column names: Identify the issue (Drill attaches meaning to the 
dots.) Research SQL escape characters. How do other products handle this? Given 
that these names come from JSON, can we use Javascript syntax (table[“1.2.3”])? 
How do we ensure current behavior does not break? Experiment to see what works.



> Dots in identifier are not escaped correctly
> 
>
> Key: DRILL-4264
> URL: https://issues.apache.org/jira/browse/DRILL-4264
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Codegen
>Reporter: Alex
>Assignee: Volodymyr Vysotskyi
>
> If you have some json data like this...
> {code:javascript}
> {
>   "0.0.1":{
> "version":"0.0.1",
> "date_created":"2014-03-15"
>   },
>   "0.1.2":{
> "version":"0.1.2",
> "date_created":"2014-05-21"
>   }
> }
> {code}
> ... there is no way to select any of the rows since their identifiers contain 
> dots and when trying to select them, Drill throws the following error:
> Error: SYSTEM ERROR: UnsupportedOperationException: Unhandled field reference 
> "0.0.1"; a field reference identifier must not have the form of a qualified 
> name
> This must be fixed since there are many json data files containing dots in 
> some of the keys (e.g. when specifying version numbers etc)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-4264) Dots in identifier are not escaped correctly

2016-02-16 Thread Zelaine Fong (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15149859#comment-15149859
 ] 

Zelaine Fong commented on DRILL-4264:
-

Although the error is slightly different, DRILL-3922 looks like it might be 
related.

> Dots in identifier are not escaped correctly
> 
>
> Key: DRILL-4264
> URL: https://issues.apache.org/jira/browse/DRILL-4264
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Codegen
>Reporter: Alex
>
> If you have some json data like this...
> {code:javascript}
> {
>   "0.0.1":{
> "version":"0.0.1",
> "date_created":"2014-03-15"
>   },
>   "0.1.2":{
> "version":"0.1.2",
> "date_created":"2014-05-21"
>   }
> }
> {code}
> ... there is no way to select any of the rows since their identifiers contain 
> dots and when trying to select them, Drill throws the following error:
> Error: SYSTEM ERROR: UnsupportedOperationException: Unhandled field reference 
> "0.0.1"; a field reference identifier must not have the form of a qualified 
> name
> This must be fixed since there are many json data files containing dots in 
> some of the keys (e.g. when specifying version numbers etc)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)