[GitHub] drill issue #1138: DRILL-4120: Allow implicit columns for Avro storage forma...
Github user arina-ielchiieva commented on the issue: https://github.com/apache/drill/pull/1138 +1. LGTM. ---
[GitHub] drill issue #1138: DRILL-4120: Allow implicit columns for Avro storage forma...
Github user paul-rogers commented on the issue: https://github.com/apache/drill/pull/1138 @vvysotskyi, thanks for addressing the schema issues! ---
[GitHub] drill issue #1138: DRILL-4120: Allow implicit columns for Avro storage forma...
Github user vvysotskyi commented on the issue: https://github.com/apache/drill/pull/1138 @arina-ielchiieva, @paul-rogers, I have reworked this pull request to use `AvroDrillTable`, as it was before my changes. Also, with `AvroDrillTable` there is no need to make changes in `AvroRecordReader`, so I reverted them. Could you please take a look again? ---
[GitHub] drill issue #1138: DRILL-4120: Allow implicit columns for Avro storage forma...
Github user vvysotskyi commented on the issue: https://github.com/apache/drill/pull/1138 @paul-rogers, schema is taken from the first file in the `FormatSelection`. Therefore for the case, when we have a table with several files with a different scheme, Drill query will fail. As for the plan-time type information, besides the validation at the stage when a query is converted into rel nodes, field list may be used in project rel nodes instead of the dynamic star for `DynamicDrillTable`. ---
[GitHub] drill issue #1138: DRILL-4120: Allow implicit columns for Avro storage forma...
Github user paul-rogers commented on the issue: https://github.com/apache/drill/pull/1138 Another thought. The removed code is at plan time. Did the original code have to open each file to retrieve schema? If so, does removing the code remove that load? If so, then this change could be a huge performance improvement if avoids the need to open every file in the Foreman. Then, the the next question is: do we actually do anything with the plan-time type information? Few files have that information. Given that, does the planner actually use the information? Is this something we get for free from Calcite? If we are not using the type information at plan time, then clearly there is no harm in removing the code that retrieves the type information. ---
[GitHub] drill issue #1138: DRILL-4120: Allow implicit columns for Avro storage forma...
Github user paul-rogers commented on the issue: https://github.com/apache/drill/pull/1138 As @arina-ielchiieva points out, this change backs out plan-time knowledge of schema. This may not affect run-time accuracy. However, it does mean that queries can be planned, based on not knowing types, that fail at runtime when types are learned. This seems more like a bug that a feature. In general, we should use all information available. It is not helpful to ignore information if doing so results in poorer user experience. ---
[GitHub] drill issue #1138: DRILL-4120: Allow implicit columns for Avro storage forma...
Github user paul-rogers commented on the issue: https://github.com/apache/drill/pull/1138 General comment: if we could move to the new scan framework; it handles implicit columns for all file-based readers. It also handles projection, missing columns, etc... ---
[GitHub] drill issue #1138: DRILL-4120: Allow implicit columns for Avro storage forma...
Github user arina-ielchiieva commented on the issue: https://github.com/apache/drill/pull/1138 You are basically reverting changes done in DRILL-3810 to support schema validation in Avro. Avro format is strict and has schema. Should Drill treat it the same way or do loosen parsing? We should evaluate the option of leaving schema for avro but adding implicit columns. Maybe the change won't be as easy as changing `AvroDrillTable` to `DynamicDrillTable` but it might be more correct. You can also start mailing thread on dev / user list, asking about treating avro as dynamic format (listing pros and cons) and get feedback from the users. [1] https://issues.apache.org/jira/browse/DRILL-3810 ---