[GitHub] drill issue #1138: DRILL-4120: Allow implicit columns for Avro storage forma...

2018-03-07 Thread arina-ielchiieva
Github user arina-ielchiieva commented on the issue:

https://github.com/apache/drill/pull/1138
  
+1. LGTM.


---


[GitHub] drill issue #1138: DRILL-4120: Allow implicit columns for Avro storage forma...

2018-03-06 Thread paul-rogers
Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/1138
  
@vvysotskyi, thanks for addressing the schema issues! 


---


[GitHub] drill issue #1138: DRILL-4120: Allow implicit columns for Avro storage forma...

2018-03-06 Thread vvysotskyi
Github user vvysotskyi commented on the issue:

https://github.com/apache/drill/pull/1138
  
@arina-ielchiieva, @paul-rogers, I have reworked this pull request to use 
`AvroDrillTable`, as it was before my changes. Also, with `AvroDrillTable` 
there is no need to make changes in `AvroRecordReader`, so I reverted them.
Could you please take a look again?


---


[GitHub] drill issue #1138: DRILL-4120: Allow implicit columns for Avro storage forma...

2018-03-01 Thread vvysotskyi
Github user vvysotskyi commented on the issue:

https://github.com/apache/drill/pull/1138
  
@paul-rogers, schema is taken from the first file in the `FormatSelection`. 
Therefore for the case, when we have a table with several files with a 
different scheme, Drill query will fail.

As for the plan-time type information, besides the validation at the stage 
when a query is converted into rel nodes, field list may be used in project rel 
nodes instead of the dynamic star for `DynamicDrillTable`.


---


[GitHub] drill issue #1138: DRILL-4120: Allow implicit columns for Avro storage forma...

2018-03-01 Thread paul-rogers
Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/1138
  
Another thought. The removed code is at plan time. Did the original code 
have to open each file to retrieve schema? If so, does removing the code remove 
that load? If so, then this change could be a huge performance improvement if 
avoids the need to open every file in the Foreman.

Then, the the next question is: do we actually do anything with the 
plan-time type information? Few files have that information. Given that, does 
the planner actually use the information? Is this something we get for free 
from Calcite? If we are not using the type information at plan time, then 
clearly there is no harm in removing the code that retrieves the type 
information.


---


[GitHub] drill issue #1138: DRILL-4120: Allow implicit columns for Avro storage forma...

2018-03-01 Thread paul-rogers
Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/1138
  
As @arina-ielchiieva points out, this change backs out plan-time knowledge 
of schema. This may not affect run-time accuracy. However, it does mean that 
queries can be planned, based on not knowing types, that fail at runtime when 
types are learned. This seems more like a bug that a feature. In general, we 
should use all information available. It is not helpful to ignore information 
if doing so results in poorer user experience.


---


[GitHub] drill issue #1138: DRILL-4120: Allow implicit columns for Avro storage forma...

2018-02-28 Thread paul-rogers
Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/1138
  
General comment: if we could move to the new scan framework; it handles 
implicit columns for all file-based readers. It also handles projection, 
missing columns, etc...


---


[GitHub] drill issue #1138: DRILL-4120: Allow implicit columns for Avro storage forma...

2018-02-28 Thread arina-ielchiieva
Github user arina-ielchiieva commented on the issue:

https://github.com/apache/drill/pull/1138
  
You are basically reverting changes done in DRILL-3810 to support schema 
validation in Avro. 
Avro format is strict and has schema. Should Drill treat it the same way or 
do loosen parsing?

We should evaluate the option of leaving schema for avro but adding 
implicit columns. Maybe the change won't be as easy as changing 
`AvroDrillTable` to `DynamicDrillTable` but it might be more correct.

You can also start mailing thread on dev / user list, asking about treating 
avro as dynamic format (listing pros and cons) and get feedback from the users. 

[1] https://issues.apache.org/jira/browse/DRILL-3810


---