[ https://issues.apache.org/jira/browse/DRILL-3955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Daniel Barclay (Drill) updated DRILL-3955: ------------------------------------------ Description: If all of the rows read by a given {{HBaseRecordReader}} have no HBase columns in a given HBase column family, {{HBaseRecordReader}} doesn't create a Drill column for that HBase column family. Later, in a {{ProjectRecordBatch}}'s {{setupNewSchema}}, because no Drill column exists for that HBase column family, that {{setupNewSchema}} creates a dummy Drill column using the usual {{NullableIntVector}} type. In particular, it is not a map vector as {{HBaseRecordReader}} creates when it sees an HBase column family. Should {{HBaseRecordReader}} and/or something around setting up for reading HBase (including setting up that {{ProjectRecordBatch}}) make sure that all HBase column families are represented with map vectors so that {{setupNewSchema}} doesn't create a dummy field of type {{NullableIntVector}}? The problem is that, currently, when an HBase table is read in two separate fragments, one fragment (seeing rows with columns in the column family) can get a map vector for the column family while the other (seeing only rows with no columns in the column familar) can get the {{NullableIntVector}}. Downstream code that receives the two batches ends up with an unresolved conflict, yielding IndexOutOfBoundsExceptions as in DRILL-3954. It's not clear whether there is only one bug\--that downstream code doesn't resolve {{NullableIntValue}} dummy fields right (DRILL-TBD)\--or two\--that the HBase reading code should set up a Drill column for every HBase column family (regardless of whether it has any columns in the rows that were read) and that downstream code doesn't resolve {{NullableIntValue}} dummy fields (resolution is applicable to sources other than just HBase). was: If all of the rows read by a given {{HBaseRecordReader}} have no HBase columns in a given HBase column family, {{HBaseRecordReader}} doesn't create a Drill column for that HBase column family. Later, in a {{ProjectRecordBatch}}'s {{setupNewSchema}}, because no Drill column exists for that HBase column family, that {{setupNewSchema}} creates a dummy Drill column using the usual {{NullableIntVector}} type. In particular, it is not a map vector as {{HBaseRecordReader}} creates when it sees an HBase column family. Should {{HBaseRecordReader}} and/or something around setting up for reading HBase (including setting up that {{ProjectRecordBatch}}) make sure that all HBase column families are represented with map vectors so that {{setupNewSchema}} doesn't create a dummy field of type {{NullableIntVector}}? The problem is that, currently, when an HBase table is read in two separate fragments, one fragment (seeing rows with columns in the column family) can get a map vector for the column family while the other (seeing only rows with no columns in the column familar) can get the {{NullableIntVector}}. Downstream code that receives the two batches ends up with an unresolved conflict, yielding IndexOutOfBoundsExceptions as in DRILL-3954. It's not clear whether there is only one bug--that downstream code doesn't resolve {{NullableIntValue}} dummy fields right (DRILL-TBD)--or two--that the HBase reading code should set up a Drill column for every HBase column family (regardless of whether it has any columns in the rows that were read) and that downstream code doesn't resolve {{NullableIntValue}} dummy fields (resolution is applicable to sources other than just HBase). > Possible bug in creation of Drill columns for HBase column families > ------------------------------------------------------------------- > > Key: DRILL-3955 > URL: https://issues.apache.org/jira/browse/DRILL-3955 > Project: Apache Drill > Issue Type: Bug > Reporter: Daniel Barclay (Drill) > > If all of the rows read by a given {{HBaseRecordReader}} have no HBase > columns in a given HBase column family, {{HBaseRecordReader}} doesn't create > a Drill column for that HBase column family. > Later, in a {{ProjectRecordBatch}}'s {{setupNewSchema}}, because no Drill > column exists for that HBase column family, that {{setupNewSchema}} creates a > dummy Drill column using the usual {{NullableIntVector}} type. In > particular, it is not a map vector as {{HBaseRecordReader}} creates when it > sees an HBase column family. > Should {{HBaseRecordReader}} and/or something around setting up for reading > HBase (including setting up that {{ProjectRecordBatch}}) make sure that all > HBase column families are represented with map vectors so that > {{setupNewSchema}} doesn't create a dummy field of type {{NullableIntVector}}? > The problem is that, currently, when an HBase table is read in two separate > fragments, one fragment (seeing rows with columns in the column family) can > get a map vector for the column family while the other (seeing only rows with > no columns in the column familar) can get the {{NullableIntVector}}. > Downstream code that receives the two batches ends up with an unresolved > conflict, yielding IndexOutOfBoundsExceptions as in DRILL-3954. > It's not clear whether there is only one bug\--that downstream code doesn't > resolve {{NullableIntValue}} dummy fields right (DRILL-TBD)\--or two\--that > the HBase reading code should set up a Drill column for every HBase column > family (regardless of whether it has any columns in the rows that were read) > and that downstream code doesn't resolve {{NullableIntValue}} dummy fields > (resolution is applicable to sources other than just HBase). -- This message was sent by Atlassian JIRA (v6.3.4#6332)