[GitHub] incubator-hawq pull request #1225: HAWQ-1446: Introduce vectorized profile f...
Github user sansanichfb closed the pull request at: https://github.com/apache/incubator-hawq/pull/1225 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-hawq pull request #1225: HAWQ-1446: Introduce vectorized profile f...
Github user sansanichfb commented on a diff in the pull request: https://github.com/apache/incubator-hawq/pull/1225#discussion_r118808410 --- Diff: pxf/pxf-hive/src/test/java/org/apache/hawq/pxf/plugins/hive/utilities/ProfileFactoryTest.java --- @@ -34,31 +34,31 @@ public void get() throws Exception { // For TextInputFormat when table has no complex types, HiveText profile should be used -String profileName = ProfileFactory.get(new TextInputFormat(), false); +String profileName = ProfileFactory.get(new TextInputFormat(), false, null); --- End diff -- Sure --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-hawq pull request #1225: HAWQ-1446: Introduce vectorized profile f...
Github user denalex commented on a diff in the pull request: https://github.com/apache/incubator-hawq/pull/1225#discussion_r118807815 --- Diff: pxf/pxf-hive/src/test/java/org/apache/hawq/pxf/plugins/hive/utilities/ProfileFactoryTest.java --- @@ -34,31 +34,31 @@ public void get() throws Exception { // For TextInputFormat when table has no complex types, HiveText profile should be used -String profileName = ProfileFactory.get(new TextInputFormat(), false); +String profileName = ProfileFactory.get(new TextInputFormat(), false, null); --- End diff -- can revert back these changes now that the function with 2 arguments is back, right ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-hawq pull request #1225: HAWQ-1446: Introduce vectorized profile f...
Github user denalex commented on a diff in the pull request: https://github.com/apache/incubator-hawq/pull/1225#discussion_r118807905 --- Diff: pxf/pxf-service/src/main/java/org/apache/hawq/pxf/service/BridgeOutputBuilder.java --- @@ -137,6 +137,18 @@ public Writable getErrorOutput(Exception ex) throws Exception { return outputList; } +public LinkedList makeVectorizedOutput(ListrecordsBatch) throws BadRecordException { +outputList.clear(); +for (List record : recordsBatch) { --- End diff -- no null checks necessary for recordsBatch and record ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-hawq pull request #1225: HAWQ-1446: Introduce vectorized profile f...
Github user sansanichfb commented on a diff in the pull request: https://github.com/apache/incubator-hawq/pull/1225#discussion_r118788222 --- Diff: pxf/pxf-service/src/main/resources/pxf-profiles-default.xml --- @@ -101,6 +101,17 @@ under the License. org.apache.hawq.pxf.service.io.GPDBWritable + +HiveVectorizedORC --- End diff -- Renamed all classes to use "vectorized" --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-hawq pull request #1225: HAWQ-1446: Introduce vectorized profile f...
Github user sansanichfb commented on a diff in the pull request: https://github.com/apache/incubator-hawq/pull/1225#discussion_r118766404 --- Diff: pxf/pxf-service/src/main/java/org/apache/hawq/pxf/service/ReadVectorizedBridge.java --- @@ -0,0 +1,126 @@ +package org.apache.hawq.pxf.service; --- End diff -- Makes sense, extended. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-hawq pull request #1225: HAWQ-1446: Introduce vectorized profile f...
Github user sansanichfb commented on a diff in the pull request: https://github.com/apache/incubator-hawq/pull/1225#discussion_r118606404 --- Diff: pxf/pxf-hive/src/main/java/org/apache/hawq/pxf/plugins/hive/HiveORCBatchResolver.java --- @@ -0,0 +1,257 @@ +package org.apache.hawq.pxf.plugins.hive; + +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +import static org.apache.hawq.pxf.api.io.DataType.BIGINT; +import static org.apache.hawq.pxf.api.io.DataType.BOOLEAN; +import static org.apache.hawq.pxf.api.io.DataType.BPCHAR; +import static org.apache.hawq.pxf.api.io.DataType.BYTEA; +import static org.apache.hawq.pxf.api.io.DataType.DATE; +import static org.apache.hawq.pxf.api.io.DataType.FLOAT8; +import static org.apache.hawq.pxf.api.io.DataType.INTEGER; +import static org.apache.hawq.pxf.api.io.DataType.NUMERIC; +import static org.apache.hawq.pxf.api.io.DataType.REAL; +import static org.apache.hawq.pxf.api.io.DataType.SMALLINT; +import static org.apache.hawq.pxf.api.io.DataType.TEXT; +import static org.apache.hawq.pxf.api.io.DataType.TIMESTAMP; +import static org.apache.hawq.pxf.api.io.DataType.VARCHAR; + +import java.math.BigDecimal; +import java.util.ArrayList; +import java.util.Calendar; +import java.util.List; +import java.sql.Timestamp; +import java.sql.Date; + +import org.apache.commons.logging.Log; +import org.apache.commons.logging.LogFactory; +import org.apache.hadoop.hive.common.type.HiveDecimal; +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch; +import org.apache.hadoop.io.Writable; +import org.apache.hadoop.io.LongWritable; +import org.apache.hadoop.io.DoubleWritable; +import org.apache.hadoop.io.FloatWritable; +import org.apache.hadoop.io.Text; +import org.apache.hawq.pxf.api.OneField; +import org.apache.hawq.pxf.api.OneRow; +import org.apache.hawq.pxf.api.ReadVectorizedResolver; +import org.apache.hawq.pxf.api.UnsupportedTypeException; +import org.apache.hawq.pxf.api.io.DataType; +import org.apache.hawq.pxf.api.utilities.ColumnDescriptor; +import org.apache.hawq.pxf.api.utilities.InputData; +import org.apache.hawq.pxf.api.utilities.Plugin; +import org.apache.hawq.pxf.plugins.hive.utilities.HiveUtilities; +import org.apache.hadoop.hive.serde2.*; +import org.apache.hadoop.hive.serde2.objectinspector.*; +import org.apache.hadoop.hive.serde2.objectinspector.primitive.*; +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector.Category; +import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector.PrimitiveCategory; +import org.apache.hadoop.hive.ql.exec.vector.*; + +@SuppressWarnings("deprecation") +public class HiveORCBatchResolver extends Plugin implements ReadVectorizedResolver { + +private static final Log LOG = LogFactory.getLog(HiveORCBatchResolver.class); + +private ListresolvedBatch; +private StructObjectInspector soi; + +public HiveORCBatchResolver(InputData input) throws Exception { +super(input); +try { +soi = (StructObjectInspector) HiveUtilities.getOrcReader(input).getObjectInspector(); +} catch (Exception e) { +LOG.error("Unable to create an object inspector."); +throw e; +} +} + +@Override +public List
getFieldsForBatch(OneRow batch) { + +Writable writableObject = null; +Object fieldValue = null; +VectorizedRowBatch vectorizedBatch = (VectorizedRowBatch) batch.getData(); + +// Allocate empty result set +resolvedBatch = new ArrayList
(vectorizedBatch.size); +for (int i = 0; i < vectorizedBatch.size; i++) { +ArrayList row = new ArrayList(inputData.getColumns()); +resolvedBatch.add(row); +for (int j = 0; j < inputData.getColumns(); j++) { +
[GitHub] incubator-hawq pull request #1225: HAWQ-1446: Introduce vectorized profile f...
Github user sansanichfb commented on a diff in the pull request: https://github.com/apache/incubator-hawq/pull/1225#discussion_r118602026 --- Diff: pxf/pxf-hive/src/main/java/org/apache/hawq/pxf/plugins/hive/HiveORCBatchResolver.java --- @@ -0,0 +1,257 @@ +package org.apache.hawq.pxf.plugins.hive; + +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +import static org.apache.hawq.pxf.api.io.DataType.BIGINT; +import static org.apache.hawq.pxf.api.io.DataType.BOOLEAN; +import static org.apache.hawq.pxf.api.io.DataType.BPCHAR; +import static org.apache.hawq.pxf.api.io.DataType.BYTEA; +import static org.apache.hawq.pxf.api.io.DataType.DATE; +import static org.apache.hawq.pxf.api.io.DataType.FLOAT8; +import static org.apache.hawq.pxf.api.io.DataType.INTEGER; +import static org.apache.hawq.pxf.api.io.DataType.NUMERIC; +import static org.apache.hawq.pxf.api.io.DataType.REAL; +import static org.apache.hawq.pxf.api.io.DataType.SMALLINT; +import static org.apache.hawq.pxf.api.io.DataType.TEXT; +import static org.apache.hawq.pxf.api.io.DataType.TIMESTAMP; +import static org.apache.hawq.pxf.api.io.DataType.VARCHAR; + +import java.math.BigDecimal; +import java.util.ArrayList; +import java.util.Calendar; +import java.util.List; +import java.sql.Timestamp; +import java.sql.Date; + +import org.apache.commons.logging.Log; +import org.apache.commons.logging.LogFactory; +import org.apache.hadoop.hive.common.type.HiveDecimal; +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch; +import org.apache.hadoop.io.Writable; +import org.apache.hadoop.io.LongWritable; +import org.apache.hadoop.io.DoubleWritable; +import org.apache.hadoop.io.FloatWritable; +import org.apache.hadoop.io.Text; +import org.apache.hawq.pxf.api.OneField; +import org.apache.hawq.pxf.api.OneRow; +import org.apache.hawq.pxf.api.ReadVectorizedResolver; +import org.apache.hawq.pxf.api.UnsupportedTypeException; +import org.apache.hawq.pxf.api.io.DataType; +import org.apache.hawq.pxf.api.utilities.ColumnDescriptor; +import org.apache.hawq.pxf.api.utilities.InputData; +import org.apache.hawq.pxf.api.utilities.Plugin; +import org.apache.hawq.pxf.plugins.hive.utilities.HiveUtilities; +import org.apache.hadoop.hive.serde2.*; +import org.apache.hadoop.hive.serde2.objectinspector.*; +import org.apache.hadoop.hive.serde2.objectinspector.primitive.*; +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector.Category; +import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector.PrimitiveCategory; +import org.apache.hadoop.hive.ql.exec.vector.*; + +@SuppressWarnings("deprecation") +public class HiveORCBatchResolver extends Plugin implements ReadVectorizedResolver { + +private static final Log LOG = LogFactory.getLog(HiveORCBatchResolver.class); + +private ListresolvedBatch; +private StructObjectInspector soi; + +public HiveORCBatchResolver(InputData input) throws Exception { +super(input); +try { +soi = (StructObjectInspector) HiveUtilities.getOrcReader(input).getObjectInspector(); +} catch (Exception e) { +LOG.error("Unable to create an object inspector."); +throw e; +} +} + +@Override +public List
getFieldsForBatch(OneRow batch) { + +Writable writableObject = null; +Object fieldValue = null; +VectorizedRowBatch vectorizedBatch = (VectorizedRowBatch) batch.getData(); + +// Allocate empty result set +resolvedBatch = new ArrayList
(vectorizedBatch.size); +for (int i = 0; i < vectorizedBatch.size; i++) { +ArrayList row = new ArrayList(inputData.getColumns()); --- End diff -- Thanks, updated --- If your project is set up for it, you can reply to this email and have your reply appear on
[GitHub] incubator-hawq pull request #1225: HAWQ-1446: Introduce vectorized profile f...
Github user sansanichfb commented on a diff in the pull request: https://github.com/apache/incubator-hawq/pull/1225#discussion_r118601254 --- Diff: pxf/pxf-hive/src/main/java/org/apache/hawq/pxf/plugins/hive/HiveORCBatchAccessor.java --- @@ -0,0 +1,115 @@ +package org.apache.hawq.pxf.plugins.hive; + +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +import java.io.IOException; +import java.util.ArrayList; +import java.util.List; + +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.mapred.*; +import org.apache.hawq.pxf.api.OneRow; +import org.apache.hawq.pxf.api.ReadAccessor; +import org.apache.hawq.pxf.api.utilities.ColumnDescriptor; +import org.apache.hawq.pxf.api.utilities.InputData; +import org.apache.hawq.pxf.api.utilities.Plugin; +import org.apache.hawq.pxf.api.utilities.Utilities; +import org.apache.hawq.pxf.plugins.hdfs.utilities.HdfsUtilities; +import org.apache.hawq.pxf.plugins.hive.utilities.HiveUtilities; +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch; +import org.apache.hadoop.hive.ql.io.orc.OrcFile; +import org.apache.hadoop.hive.ql.io.orc.Reader; +import org.apache.hadoop.hive.ql.io.orc.Reader.Options; +import org.apache.hadoop.hive.ql.io.orc.RecordReader; +import org.apache.hadoop.io.LongWritable; + +/** + * Accessor class which reads data in batches. + * One batch is 1024 rows of all projected columns + * + */ +public class HiveORCBatchAccessor extends Plugin implements ReadAccessor { --- End diff -- Sure, updated --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-hawq pull request #1225: HAWQ-1446: Introduce vectorized profile f...
Github user sansanichfb commented on a diff in the pull request: https://github.com/apache/incubator-hawq/pull/1225#discussion_r118601231 --- Diff: pxf/pxf-hive/src/main/java/org/apache/hawq/pxf/plugins/hive/HiveORCBatchAccessor.java --- @@ -0,0 +1,115 @@ +package org.apache.hawq.pxf.plugins.hive; + +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +import java.io.IOException; +import java.util.ArrayList; +import java.util.List; + +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.mapred.*; +import org.apache.hawq.pxf.api.OneRow; +import org.apache.hawq.pxf.api.ReadAccessor; +import org.apache.hawq.pxf.api.utilities.ColumnDescriptor; +import org.apache.hawq.pxf.api.utilities.InputData; +import org.apache.hawq.pxf.api.utilities.Plugin; +import org.apache.hawq.pxf.api.utilities.Utilities; +import org.apache.hawq.pxf.plugins.hdfs.utilities.HdfsUtilities; +import org.apache.hawq.pxf.plugins.hive.utilities.HiveUtilities; +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch; +import org.apache.hadoop.hive.ql.io.orc.OrcFile; +import org.apache.hadoop.hive.ql.io.orc.Reader; +import org.apache.hadoop.hive.ql.io.orc.Reader.Options; +import org.apache.hadoop.hive.ql.io.orc.RecordReader; +import org.apache.hadoop.io.LongWritable; + +/** + * Accessor class which reads data in batches. + * One batch is 1024 rows of all projected columns + * + */ +public class HiveORCBatchAccessor extends Plugin implements ReadAccessor { + +protected RecordReader vrr; +private int batchIndex; +private VectorizedRowBatch batch; + +public HiveORCBatchAccessor(InputData input) throws Exception { +super(input); +} + +@Override +public boolean openForRead() throws Exception { +Reader reader = HiveUtilities.getOrcReader(inputData); +Options options = new Options(); +addColumns(options); +addFragments(options); +vrr = reader.rowsOptions(options); +return vrr.hasNext(); +} + +/** + * File might have multiple splits, so this method restricts + * reader to one split. + * @param options reader options to modify + */ +private void addFragments(Options options) { +FileSplit fileSplit = HdfsUtilities.parseFileSplit(inputData); +options.range(fileSplit.getStart(), fileSplit.getLength()); +} + +/** + * Reads next batch for current fragment. + * @return next batch in OneRow format, key is a batch number, data is a batch + */ +@Override +public OneRow readNextObject() throws IOException { +if (vrr.hasNext()) { +batch = vrr.nextBatch(batch); +batchIndex++; +return new OneRow(new LongWritable(batchIndex), batch); +} else { +//All batches are exhausted +return null; +} +} + +/** + * This method updated reader optionst to include projected columns only. --- End diff -- Thanks, fixed --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-hawq pull request #1225: HAWQ-1446: Introduce vectorized profile f...
Github user sansanichfb commented on a diff in the pull request: https://github.com/apache/incubator-hawq/pull/1225#discussion_r118599822 --- Diff: pxf/pxf-hive/src/main/java/org/apache/hawq/pxf/plugins/hive/HiveORCBatchAccessor.java --- @@ -0,0 +1,115 @@ +package org.apache.hawq.pxf.plugins.hive; + +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +import java.io.IOException; +import java.util.ArrayList; +import java.util.List; + +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.mapred.*; +import org.apache.hawq.pxf.api.OneRow; +import org.apache.hawq.pxf.api.ReadAccessor; +import org.apache.hawq.pxf.api.utilities.ColumnDescriptor; +import org.apache.hawq.pxf.api.utilities.InputData; +import org.apache.hawq.pxf.api.utilities.Plugin; +import org.apache.hawq.pxf.api.utilities.Utilities; +import org.apache.hawq.pxf.plugins.hdfs.utilities.HdfsUtilities; +import org.apache.hawq.pxf.plugins.hive.utilities.HiveUtilities; +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch; +import org.apache.hadoop.hive.ql.io.orc.OrcFile; +import org.apache.hadoop.hive.ql.io.orc.Reader; +import org.apache.hadoop.hive.ql.io.orc.Reader.Options; +import org.apache.hadoop.hive.ql.io.orc.RecordReader; +import org.apache.hadoop.io.LongWritable; + +/** + * Accessor class which reads data in batches. + * One batch is 1024 rows of all projected columns + * + */ +public class HiveORCBatchAccessor extends Plugin implements ReadAccessor { + +protected RecordReader vrr; +private int batchIndex; +private VectorizedRowBatch batch; + +public HiveORCBatchAccessor(InputData input) throws Exception { +super(input); +} + +@Override +public boolean openForRead() throws Exception { +Reader reader = HiveUtilities.getOrcReader(inputData); +Options options = new Options(); +addColumns(options); +addFragments(options); +vrr = reader.rowsOptions(options); +return vrr.hasNext(); +} + +/** + * File might have multiple splits, so this method restricts + * reader to one split. + * @param options reader options to modify + */ +private void addFragments(Options options) { +FileSplit fileSplit = HdfsUtilities.parseFileSplit(inputData); +options.range(fileSplit.getStart(), fileSplit.getLength()); +} + +/** + * Reads next batch for current fragment. + * @return next batch in OneRow format, key is a batch number, data is a batch + */ +@Override +public OneRow readNextObject() throws IOException { +if (vrr.hasNext()) { +batch = vrr.nextBatch(batch); +batchIndex++; +return new OneRow(new LongWritable(batchIndex), batch); +} else { +//All batches are exhausted +return null; +} +} + +/** + * This method updated reader optionst to include projected columns only. + * @param options reader options to modify + * @throws Exception + */ +private void addColumns(Options options) throws Exception { +boolean[] includeColumns = new boolean[inputData.getColumns() + 1]; --- End diff -- That's the way which ORC batch API expects this parameter. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-hawq pull request #1225: HAWQ-1446: Introduce vectorized profile f...
Github user shivzone commented on a diff in the pull request: https://github.com/apache/incubator-hawq/pull/1225#discussion_r118358798 --- Diff: pxf/pxf-service/src/main/java/org/apache/hawq/pxf/service/ReadVectorizedBridge.java --- @@ -0,0 +1,126 @@ +package org.apache.hawq.pxf.service; --- End diff -- ReadVectorizedBridge looks very similar to ReadBridge except for getNext() function. Please refactor both classes to avoid duplication --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-hawq pull request #1225: HAWQ-1446: Introduce vectorized profile f...
Github user shivzone commented on a diff in the pull request: https://github.com/apache/incubator-hawq/pull/1225#discussion_r118332954 --- Diff: pxf/pxf-hive/src/main/java/org/apache/hawq/pxf/plugins/hive/HiveORCBatchResolver.java --- @@ -0,0 +1,257 @@ +package org.apache.hawq.pxf.plugins.hive; + +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +import static org.apache.hawq.pxf.api.io.DataType.BIGINT; +import static org.apache.hawq.pxf.api.io.DataType.BOOLEAN; +import static org.apache.hawq.pxf.api.io.DataType.BPCHAR; +import static org.apache.hawq.pxf.api.io.DataType.BYTEA; +import static org.apache.hawq.pxf.api.io.DataType.DATE; +import static org.apache.hawq.pxf.api.io.DataType.FLOAT8; +import static org.apache.hawq.pxf.api.io.DataType.INTEGER; +import static org.apache.hawq.pxf.api.io.DataType.NUMERIC; +import static org.apache.hawq.pxf.api.io.DataType.REAL; +import static org.apache.hawq.pxf.api.io.DataType.SMALLINT; +import static org.apache.hawq.pxf.api.io.DataType.TEXT; +import static org.apache.hawq.pxf.api.io.DataType.TIMESTAMP; +import static org.apache.hawq.pxf.api.io.DataType.VARCHAR; + +import java.math.BigDecimal; +import java.util.ArrayList; +import java.util.Calendar; +import java.util.List; +import java.sql.Timestamp; +import java.sql.Date; + +import org.apache.commons.logging.Log; +import org.apache.commons.logging.LogFactory; +import org.apache.hadoop.hive.common.type.HiveDecimal; +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch; +import org.apache.hadoop.io.Writable; +import org.apache.hadoop.io.LongWritable; +import org.apache.hadoop.io.DoubleWritable; +import org.apache.hadoop.io.FloatWritable; +import org.apache.hadoop.io.Text; +import org.apache.hawq.pxf.api.OneField; +import org.apache.hawq.pxf.api.OneRow; +import org.apache.hawq.pxf.api.ReadVectorizedResolver; +import org.apache.hawq.pxf.api.UnsupportedTypeException; +import org.apache.hawq.pxf.api.io.DataType; +import org.apache.hawq.pxf.api.utilities.ColumnDescriptor; +import org.apache.hawq.pxf.api.utilities.InputData; +import org.apache.hawq.pxf.api.utilities.Plugin; +import org.apache.hawq.pxf.plugins.hive.utilities.HiveUtilities; +import org.apache.hadoop.hive.serde2.*; +import org.apache.hadoop.hive.serde2.objectinspector.*; +import org.apache.hadoop.hive.serde2.objectinspector.primitive.*; +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector.Category; +import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector.PrimitiveCategory; +import org.apache.hadoop.hive.ql.exec.vector.*; + +@SuppressWarnings("deprecation") +public class HiveORCBatchResolver extends Plugin implements ReadVectorizedResolver { + +private static final Log LOG = LogFactory.getLog(HiveORCBatchResolver.class); + +private ListresolvedBatch; +private StructObjectInspector soi; + +public HiveORCBatchResolver(InputData input) throws Exception { +super(input); +try { +soi = (StructObjectInspector) HiveUtilities.getOrcReader(input).getObjectInspector(); +} catch (Exception e) { +LOG.error("Unable to create an object inspector."); +throw e; +} +} + +@Override +public List
getFieldsForBatch(OneRow batch) { + +Writable writableObject = null; +Object fieldValue = null; +VectorizedRowBatch vectorizedBatch = (VectorizedRowBatch) batch.getData(); + +// Allocate empty result set +resolvedBatch = new ArrayList
(vectorizedBatch.size); +for (int i = 0; i < vectorizedBatch.size; i++) { +ArrayList row = new ArrayList(inputData.getColumns()); +resolvedBatch.add(row); +for (int j = 0; j < inputData.getColumns(); j++) { +row.add(null);
[GitHub] incubator-hawq pull request #1225: HAWQ-1446: Introduce vectorized profile f...
Github user shivzone commented on a diff in the pull request: https://github.com/apache/incubator-hawq/pull/1225#discussion_r118339930 --- Diff: pxf/pxf-hive/src/main/java/org/apache/hawq/pxf/plugins/hive/HiveORCBatchResolver.java --- @@ -0,0 +1,257 @@ +package org.apache.hawq.pxf.plugins.hive; + +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +import static org.apache.hawq.pxf.api.io.DataType.BIGINT; +import static org.apache.hawq.pxf.api.io.DataType.BOOLEAN; +import static org.apache.hawq.pxf.api.io.DataType.BPCHAR; +import static org.apache.hawq.pxf.api.io.DataType.BYTEA; +import static org.apache.hawq.pxf.api.io.DataType.DATE; +import static org.apache.hawq.pxf.api.io.DataType.FLOAT8; +import static org.apache.hawq.pxf.api.io.DataType.INTEGER; +import static org.apache.hawq.pxf.api.io.DataType.NUMERIC; +import static org.apache.hawq.pxf.api.io.DataType.REAL; +import static org.apache.hawq.pxf.api.io.DataType.SMALLINT; +import static org.apache.hawq.pxf.api.io.DataType.TEXT; +import static org.apache.hawq.pxf.api.io.DataType.TIMESTAMP; +import static org.apache.hawq.pxf.api.io.DataType.VARCHAR; + +import java.math.BigDecimal; +import java.util.ArrayList; +import java.util.Calendar; +import java.util.List; +import java.sql.Timestamp; +import java.sql.Date; + +import org.apache.commons.logging.Log; +import org.apache.commons.logging.LogFactory; +import org.apache.hadoop.hive.common.type.HiveDecimal; +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch; +import org.apache.hadoop.io.Writable; +import org.apache.hadoop.io.LongWritable; +import org.apache.hadoop.io.DoubleWritable; +import org.apache.hadoop.io.FloatWritable; +import org.apache.hadoop.io.Text; +import org.apache.hawq.pxf.api.OneField; +import org.apache.hawq.pxf.api.OneRow; +import org.apache.hawq.pxf.api.ReadVectorizedResolver; +import org.apache.hawq.pxf.api.UnsupportedTypeException; +import org.apache.hawq.pxf.api.io.DataType; +import org.apache.hawq.pxf.api.utilities.ColumnDescriptor; +import org.apache.hawq.pxf.api.utilities.InputData; +import org.apache.hawq.pxf.api.utilities.Plugin; +import org.apache.hawq.pxf.plugins.hive.utilities.HiveUtilities; +import org.apache.hadoop.hive.serde2.*; +import org.apache.hadoop.hive.serde2.objectinspector.*; +import org.apache.hadoop.hive.serde2.objectinspector.primitive.*; +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector.Category; +import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector.PrimitiveCategory; +import org.apache.hadoop.hive.ql.exec.vector.*; + +@SuppressWarnings("deprecation") +public class HiveORCBatchResolver extends Plugin implements ReadVectorizedResolver { + +private static final Log LOG = LogFactory.getLog(HiveORCBatchResolver.class); + +private ListresolvedBatch; +private StructObjectInspector soi; + +public HiveORCBatchResolver(InputData input) throws Exception { +super(input); +try { +soi = (StructObjectInspector) HiveUtilities.getOrcReader(input).getObjectInspector(); +} catch (Exception e) { +LOG.error("Unable to create an object inspector."); +throw e; +} +} + +@Override +public List
getFieldsForBatch(OneRow batch) { + +Writable writableObject = null; +Object fieldValue = null; +VectorizedRowBatch vectorizedBatch = (VectorizedRowBatch) batch.getData(); + +// Allocate empty result set +resolvedBatch = new ArrayList
(vectorizedBatch.size); +for (int i = 0; i < vectorizedBatch.size; i++) { +ArrayList row = new ArrayList(inputData.getColumns()); +resolvedBatch.add(row); +for (int j = 0; j < inputData.getColumns(); j++) { +row.add(null);
[GitHub] incubator-hawq pull request #1225: HAWQ-1446: Introduce vectorized profile f...
Github user sansanichfb commented on a diff in the pull request: https://github.com/apache/incubator-hawq/pull/1225#discussion_r118135496 --- Diff: pxf/pxf-api/src/main/java/org/apache/hawq/pxf/api/utilities/Utilities.java --- @@ -234,4 +235,15 @@ public static boolean useStats(ReadAccessor accessor, InputData inputData) { return false; } } + +public static boolean useVectorization(InputData inputData) { +boolean isVectorizedResolver = false; +try { +isVectorizedResolver = ArrayUtils.contains(Class.forName(inputData.getResolver()).getInterfaces(), ReadVectorizedResolver.class); +} catch (ClassNotFoundException e) { +LOG.error("Unable to load resolver class: " + e.getMessage()); +return false; --- End diff -- Sure, thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-hawq pull request #1225: HAWQ-1446: Introduce vectorized profile f...
Github user denalex commented on a diff in the pull request: https://github.com/apache/incubator-hawq/pull/1225#discussion_r118132215 --- Diff: pxf/pxf-service/src/main/java/org/apache/hawq/pxf/service/ReadBridge.java --- @@ -149,9 +149,10 @@ public static ReadAccessor getFileAccessor(InputData inputData) inputData.getAccessor(), inputData); } -public static ReadResolver getFieldsResolver(InputData inputData) +@SuppressWarnings("unchecked") --- End diff -- ouch, can you make Utilities.createAnyInstance templetized instead ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-hawq pull request #1225: HAWQ-1446: Introduce vectorized profile f...
Github user denalex commented on a diff in the pull request: https://github.com/apache/incubator-hawq/pull/1225#discussion_r118131278 --- Diff: pxf/pxf-hive/src/main/java/org/apache/hawq/pxf/plugins/hive/HiveORCBatchResolver.java --- @@ -0,0 +1,257 @@ +package org.apache.hawq.pxf.plugins.hive; + +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +import static org.apache.hawq.pxf.api.io.DataType.BIGINT; +import static org.apache.hawq.pxf.api.io.DataType.BOOLEAN; +import static org.apache.hawq.pxf.api.io.DataType.BPCHAR; +import static org.apache.hawq.pxf.api.io.DataType.BYTEA; +import static org.apache.hawq.pxf.api.io.DataType.DATE; +import static org.apache.hawq.pxf.api.io.DataType.FLOAT8; +import static org.apache.hawq.pxf.api.io.DataType.INTEGER; +import static org.apache.hawq.pxf.api.io.DataType.NUMERIC; +import static org.apache.hawq.pxf.api.io.DataType.REAL; +import static org.apache.hawq.pxf.api.io.DataType.SMALLINT; +import static org.apache.hawq.pxf.api.io.DataType.TEXT; +import static org.apache.hawq.pxf.api.io.DataType.TIMESTAMP; +import static org.apache.hawq.pxf.api.io.DataType.VARCHAR; + +import java.math.BigDecimal; +import java.util.ArrayList; +import java.util.Calendar; +import java.util.List; +import java.sql.Timestamp; +import java.sql.Date; + +import org.apache.commons.logging.Log; +import org.apache.commons.logging.LogFactory; +import org.apache.hadoop.hive.common.type.HiveDecimal; +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch; +import org.apache.hadoop.io.Writable; +import org.apache.hadoop.io.LongWritable; +import org.apache.hadoop.io.DoubleWritable; +import org.apache.hadoop.io.FloatWritable; +import org.apache.hadoop.io.Text; +import org.apache.hawq.pxf.api.OneField; +import org.apache.hawq.pxf.api.OneRow; +import org.apache.hawq.pxf.api.ReadVectorizedResolver; +import org.apache.hawq.pxf.api.UnsupportedTypeException; +import org.apache.hawq.pxf.api.io.DataType; +import org.apache.hawq.pxf.api.utilities.ColumnDescriptor; +import org.apache.hawq.pxf.api.utilities.InputData; +import org.apache.hawq.pxf.api.utilities.Plugin; +import org.apache.hawq.pxf.plugins.hive.utilities.HiveUtilities; +import org.apache.hadoop.hive.serde2.*; +import org.apache.hadoop.hive.serde2.objectinspector.*; +import org.apache.hadoop.hive.serde2.objectinspector.primitive.*; +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector.Category; +import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector.PrimitiveCategory; +import org.apache.hadoop.hive.ql.exec.vector.*; + +@SuppressWarnings("deprecation") +public class HiveORCBatchResolver extends Plugin implements ReadVectorizedResolver { + +private static final Log LOG = LogFactory.getLog(HiveORCBatchResolver.class); + +private ListresolvedBatch; +private StructObjectInspector soi; + +public HiveORCBatchResolver(InputData input) throws Exception { +super(input); +try { +soi = (StructObjectInspector) HiveUtilities.getOrcReader(input).getObjectInspector(); +} catch (Exception e) { +LOG.error("Unable to create an object inspector."); +throw e; +} +} + +@Override +public List
getFieldsForBatch(OneRow batch) { + +Writable writableObject = null; +Object fieldValue = null; +VectorizedRowBatch vectorizedBatch = (VectorizedRowBatch) batch.getData(); + +// Allocate empty result set +resolvedBatch = new ArrayList
(vectorizedBatch.size); +for (int i = 0; i < vectorizedBatch.size; i++) { +ArrayList row = new ArrayList(inputData.getColumns()); +resolvedBatch.add(row); +for (int j = 0; j < inputData.getColumns(); j++) { +row.add(null);
[GitHub] incubator-hawq pull request #1225: HAWQ-1446: Introduce vectorized profile f...
Github user denalex commented on a diff in the pull request: https://github.com/apache/incubator-hawq/pull/1225#discussion_r118129347 --- Diff: pxf/pxf-hive/src/main/java/org/apache/hawq/pxf/plugins/hive/HiveDataFragmenter.java --- @@ -289,7 +289,7 @@ private void fetchMetaData(HiveTablePartition tablePartition, boolean hasComplex if (inputData.getProfile() != null) { // evaluate optimal profile based on file format if profile was explicitly specified in url // if user passed accessor+fragmenter+resolver - use them -profile = ProfileFactory.get(fformat, hasComplexTypes); +profile = ProfileFactory.get(fformat, hasComplexTypes, inputData.getProfile()); --- End diff -- getProfile() is called twice (in if statement and here, its better to call once and then evaluate and reuse the variable) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-hawq pull request #1225: HAWQ-1446: Introduce vectorized profile f...
Github user denalex commented on a diff in the pull request: https://github.com/apache/incubator-hawq/pull/1225#discussion_r118129472 --- Diff: pxf/pxf-hive/src/main/java/org/apache/hawq/pxf/plugins/hive/HiveMetadataFetcher.java --- @@ -136,7 +136,7 @@ public HiveMetadataFetcher(InputData md) { private OutputFormat getOutputFormat(String inputFormat, boolean hasComplexTypes) throws Exception { OutputFormat outputFormat = null; InputFormat fformat = HiveDataFragmenter.makeInputFormat(inputFormat, jobConf); -String profile = ProfileFactory.get(fformat, hasComplexTypes); +String profile = ProfileFactory.get(fformat, hasComplexTypes, null); --- End diff -- passing explicit null params should be avoided, if possible, override the function if more/less params are desired. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-hawq pull request #1225: HAWQ-1446: Introduce vectorized profile f...
Github user denalex commented on a diff in the pull request: https://github.com/apache/incubator-hawq/pull/1225#discussion_r118129835 --- Diff: pxf/pxf-hive/src/main/java/org/apache/hawq/pxf/plugins/hive/HiveORCBatchAccessor.java --- @@ -0,0 +1,115 @@ +package org.apache.hawq.pxf.plugins.hive; + +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +import java.io.IOException; +import java.util.ArrayList; +import java.util.List; + +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.mapred.*; +import org.apache.hawq.pxf.api.OneRow; +import org.apache.hawq.pxf.api.ReadAccessor; +import org.apache.hawq.pxf.api.utilities.ColumnDescriptor; +import org.apache.hawq.pxf.api.utilities.InputData; +import org.apache.hawq.pxf.api.utilities.Plugin; +import org.apache.hawq.pxf.api.utilities.Utilities; +import org.apache.hawq.pxf.plugins.hdfs.utilities.HdfsUtilities; +import org.apache.hawq.pxf.plugins.hive.utilities.HiveUtilities; +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch; +import org.apache.hadoop.hive.ql.io.orc.OrcFile; +import org.apache.hadoop.hive.ql.io.orc.Reader; +import org.apache.hadoop.hive.ql.io.orc.Reader.Options; +import org.apache.hadoop.hive.ql.io.orc.RecordReader; +import org.apache.hadoop.io.LongWritable; + +/** + * Accessor class which reads data in batches. + * One batch is 1024 rows of all projected columns + * + */ +public class HiveORCBatchAccessor extends Plugin implements ReadAccessor { + +protected RecordReader vrr; --- End diff -- why protected, any child class is using it ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-hawq pull request #1225: HAWQ-1446: Introduce vectorized profile f...
Github user denalex commented on a diff in the pull request: https://github.com/apache/incubator-hawq/pull/1225#discussion_r118129724 --- Diff: pxf/pxf-hive/src/main/java/org/apache/hawq/pxf/plugins/hive/HiveORCBatchAccessor.java --- @@ -0,0 +1,115 @@ +package org.apache.hawq.pxf.plugins.hive; + +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +import java.io.IOException; +import java.util.ArrayList; +import java.util.List; + +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.mapred.*; +import org.apache.hawq.pxf.api.OneRow; +import org.apache.hawq.pxf.api.ReadAccessor; +import org.apache.hawq.pxf.api.utilities.ColumnDescriptor; +import org.apache.hawq.pxf.api.utilities.InputData; +import org.apache.hawq.pxf.api.utilities.Plugin; +import org.apache.hawq.pxf.api.utilities.Utilities; +import org.apache.hawq.pxf.plugins.hdfs.utilities.HdfsUtilities; +import org.apache.hawq.pxf.plugins.hive.utilities.HiveUtilities; +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch; +import org.apache.hadoop.hive.ql.io.orc.OrcFile; +import org.apache.hadoop.hive.ql.io.orc.Reader; +import org.apache.hadoop.hive.ql.io.orc.Reader.Options; +import org.apache.hadoop.hive.ql.io.orc.RecordReader; +import org.apache.hadoop.io.LongWritable; + +/** + * Accessor class which reads data in batches. + * One batch is 1024 rows of all projected columns + * + */ +public class HiveORCBatchAccessor extends Plugin implements ReadAccessor { --- End diff -- would it be useful if it extended the HiveORCAccessor and overwrite functions ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-hawq pull request #1225: HAWQ-1446: Introduce vectorized profile f...
Github user denalex commented on a diff in the pull request: https://github.com/apache/incubator-hawq/pull/1225#discussion_r118132761 --- Diff: pxf/pxf-service/src/main/resources/pxf-profiles-default.xml --- @@ -101,6 +101,17 @@ under the License. org.apache.hawq.pxf.service.io.GPDBWritable + +HiveVectorizedORC --- End diff -- seems like "batch" and "vectorized" are used interchangeably, should we use just one term ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-hawq pull request #1225: HAWQ-1446: Introduce vectorized profile f...
Github user denalex commented on a diff in the pull request: https://github.com/apache/incubator-hawq/pull/1225#discussion_r118131080 --- Diff: pxf/pxf-hive/src/main/java/org/apache/hawq/pxf/plugins/hive/HiveORCBatchResolver.java --- @@ -0,0 +1,257 @@ +package org.apache.hawq.pxf.plugins.hive; + +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +import static org.apache.hawq.pxf.api.io.DataType.BIGINT; +import static org.apache.hawq.pxf.api.io.DataType.BOOLEAN; +import static org.apache.hawq.pxf.api.io.DataType.BPCHAR; +import static org.apache.hawq.pxf.api.io.DataType.BYTEA; +import static org.apache.hawq.pxf.api.io.DataType.DATE; +import static org.apache.hawq.pxf.api.io.DataType.FLOAT8; +import static org.apache.hawq.pxf.api.io.DataType.INTEGER; +import static org.apache.hawq.pxf.api.io.DataType.NUMERIC; +import static org.apache.hawq.pxf.api.io.DataType.REAL; +import static org.apache.hawq.pxf.api.io.DataType.SMALLINT; +import static org.apache.hawq.pxf.api.io.DataType.TEXT; +import static org.apache.hawq.pxf.api.io.DataType.TIMESTAMP; +import static org.apache.hawq.pxf.api.io.DataType.VARCHAR; + +import java.math.BigDecimal; +import java.util.ArrayList; +import java.util.Calendar; +import java.util.List; +import java.sql.Timestamp; +import java.sql.Date; + +import org.apache.commons.logging.Log; +import org.apache.commons.logging.LogFactory; +import org.apache.hadoop.hive.common.type.HiveDecimal; +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch; +import org.apache.hadoop.io.Writable; +import org.apache.hadoop.io.LongWritable; +import org.apache.hadoop.io.DoubleWritable; +import org.apache.hadoop.io.FloatWritable; +import org.apache.hadoop.io.Text; +import org.apache.hawq.pxf.api.OneField; +import org.apache.hawq.pxf.api.OneRow; +import org.apache.hawq.pxf.api.ReadVectorizedResolver; +import org.apache.hawq.pxf.api.UnsupportedTypeException; +import org.apache.hawq.pxf.api.io.DataType; +import org.apache.hawq.pxf.api.utilities.ColumnDescriptor; +import org.apache.hawq.pxf.api.utilities.InputData; +import org.apache.hawq.pxf.api.utilities.Plugin; +import org.apache.hawq.pxf.plugins.hive.utilities.HiveUtilities; +import org.apache.hadoop.hive.serde2.*; +import org.apache.hadoop.hive.serde2.objectinspector.*; +import org.apache.hadoop.hive.serde2.objectinspector.primitive.*; +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector.Category; +import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector.PrimitiveCategory; +import org.apache.hadoop.hive.ql.exec.vector.*; + +@SuppressWarnings("deprecation") +public class HiveORCBatchResolver extends Plugin implements ReadVectorizedResolver { + +private static final Log LOG = LogFactory.getLog(HiveORCBatchResolver.class); + +private ListresolvedBatch; +private StructObjectInspector soi; + +public HiveORCBatchResolver(InputData input) throws Exception { +super(input); +try { +soi = (StructObjectInspector) HiveUtilities.getOrcReader(input).getObjectInspector(); +} catch (Exception e) { +LOG.error("Unable to create an object inspector."); +throw e; +} +} + +@Override +public List
getFieldsForBatch(OneRow batch) { + +Writable writableObject = null; +Object fieldValue = null; +VectorizedRowBatch vectorizedBatch = (VectorizedRowBatch) batch.getData(); + +// Allocate empty result set +resolvedBatch = new ArrayList
(vectorizedBatch.size); +for (int i = 0; i < vectorizedBatch.size; i++) { +ArrayList row = new ArrayList(inputData.getColumns()); +resolvedBatch.add(row); +for (int j = 0; j < inputData.getColumns(); j++) { +row.add(null);
[GitHub] incubator-hawq pull request #1225: HAWQ-1446: Introduce vectorized profile f...
Github user denalex commented on a diff in the pull request: https://github.com/apache/incubator-hawq/pull/1225#discussion_r118132449 --- Diff: pxf/pxf-service/src/main/java/org/apache/hawq/pxf/service/ReadVectorizedBridge.java --- @@ -0,0 +1,126 @@ +package org.apache.hawq.pxf.service; + +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +import java.io.DataInputStream; +import java.io.IOException; +import java.util.LinkedList; +import java.util.List; + +import org.apache.hawq.pxf.api.BadRecordException; +import org.apache.hawq.pxf.api.OneField; +import org.apache.hawq.pxf.api.OneRow; +import org.apache.hawq.pxf.api.ReadAccessor; +import org.apache.hawq.pxf.api.ReadVectorizedResolver; +import org.apache.hawq.pxf.service.io.Writable; +import org.apache.hawq.pxf.service.utilities.ProtocolData; + +public class ReadVectorizedBridge implements Bridge { + +ReadAccessor fileAccessor = null; +ReadVectorizedResolver fieldsResolver; +BridgeOutputBuilder outputBuilder = null; +LinkedList outputQueue = null; + +public ReadVectorizedBridge(ProtocolData protData) throws Exception { +outputBuilder = new BridgeOutputBuilder(protData); +outputQueue = new LinkedList(); +fileAccessor = ReadBridge.getFileAccessor(protData); +fieldsResolver = ReadBridge.getFieldsResolver(protData); +} + +@Override +public Writable getNext() throws Exception { +Writable output = null; +OneRow batch = null; + +if (!outputQueue.isEmpty()) { +return outputQueue.pop(); +} + +try { +while (outputQueue.isEmpty()) { +batch = fileAccessor.readNextObject(); +if (batch == null) { +output = outputBuilder.getPartialLine(); +if (output != null) { +//LOG.warn("A partial record in the end of the fragment"); --- End diff -- remove commented lines ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-hawq pull request #1225: HAWQ-1446: Introduce vectorized profile f...
Github user denalex commented on a diff in the pull request: https://github.com/apache/incubator-hawq/pull/1225#discussion_r118129564 --- Diff: pxf/pxf-hive/src/main/java/org/apache/hawq/pxf/plugins/hive/HiveORCBatchAccessor.java --- @@ -0,0 +1,115 @@ +package org.apache.hawq.pxf.plugins.hive; + +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +import java.io.IOException; +import java.util.ArrayList; +import java.util.List; + +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.mapred.*; +import org.apache.hawq.pxf.api.OneRow; +import org.apache.hawq.pxf.api.ReadAccessor; +import org.apache.hawq.pxf.api.utilities.ColumnDescriptor; +import org.apache.hawq.pxf.api.utilities.InputData; +import org.apache.hawq.pxf.api.utilities.Plugin; +import org.apache.hawq.pxf.api.utilities.Utilities; +import org.apache.hawq.pxf.plugins.hdfs.utilities.HdfsUtilities; +import org.apache.hawq.pxf.plugins.hive.utilities.HiveUtilities; +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch; +import org.apache.hadoop.hive.ql.io.orc.OrcFile; +import org.apache.hadoop.hive.ql.io.orc.Reader; +import org.apache.hadoop.hive.ql.io.orc.Reader.Options; +import org.apache.hadoop.hive.ql.io.orc.RecordReader; +import org.apache.hadoop.io.LongWritable; + +/** + * Accessor class which reads data in batches. + * One batch is 1024 rows of all projected columns + * + */ +public class HiveORCBatchAccessor extends Plugin implements ReadAccessor { + +protected RecordReader vrr; +private int batchIndex; +private VectorizedRowBatch batch; + +public HiveORCBatchAccessor(InputData input) throws Exception { +super(input); +} + +@Override +public boolean openForRead() throws Exception { +Reader reader = HiveUtilities.getOrcReader(inputData); +Options options = new Options(); +addColumns(options); +addFragments(options); +vrr = reader.rowsOptions(options); +return vrr.hasNext(); +} + +/** + * File might have multiple splits, so this method restricts + * reader to one split. + * @param options reader options to modify + */ +private void addFragments(Options options) { +FileSplit fileSplit = HdfsUtilities.parseFileSplit(inputData); +options.range(fileSplit.getStart(), fileSplit.getLength()); +} + +/** + * Reads next batch for current fragment. + * @return next batch in OneRow format, key is a batch number, data is a batch + */ +@Override +public OneRow readNextObject() throws IOException { +if (vrr.hasNext()) { +batch = vrr.nextBatch(batch); +batchIndex++; +return new OneRow(new LongWritable(batchIndex), batch); +} else { +//All batches are exhausted +return null; +} +} + +/** + * This method updated reader optionst to include projected columns only. --- End diff -- typo "optionst" --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-hawq pull request #1225: HAWQ-1446: Introduce vectorized profile f...
Github user denalex commented on a diff in the pull request: https://github.com/apache/incubator-hawq/pull/1225#discussion_r118131006 --- Diff: pxf/pxf-hive/src/main/java/org/apache/hawq/pxf/plugins/hive/HiveORCBatchResolver.java --- @@ -0,0 +1,257 @@ +package org.apache.hawq.pxf.plugins.hive; + +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +import static org.apache.hawq.pxf.api.io.DataType.BIGINT; +import static org.apache.hawq.pxf.api.io.DataType.BOOLEAN; +import static org.apache.hawq.pxf.api.io.DataType.BPCHAR; +import static org.apache.hawq.pxf.api.io.DataType.BYTEA; +import static org.apache.hawq.pxf.api.io.DataType.DATE; +import static org.apache.hawq.pxf.api.io.DataType.FLOAT8; +import static org.apache.hawq.pxf.api.io.DataType.INTEGER; +import static org.apache.hawq.pxf.api.io.DataType.NUMERIC; +import static org.apache.hawq.pxf.api.io.DataType.REAL; +import static org.apache.hawq.pxf.api.io.DataType.SMALLINT; +import static org.apache.hawq.pxf.api.io.DataType.TEXT; +import static org.apache.hawq.pxf.api.io.DataType.TIMESTAMP; +import static org.apache.hawq.pxf.api.io.DataType.VARCHAR; + +import java.math.BigDecimal; +import java.util.ArrayList; +import java.util.Calendar; +import java.util.List; +import java.sql.Timestamp; +import java.sql.Date; + +import org.apache.commons.logging.Log; +import org.apache.commons.logging.LogFactory; +import org.apache.hadoop.hive.common.type.HiveDecimal; +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch; +import org.apache.hadoop.io.Writable; +import org.apache.hadoop.io.LongWritable; +import org.apache.hadoop.io.DoubleWritable; +import org.apache.hadoop.io.FloatWritable; +import org.apache.hadoop.io.Text; +import org.apache.hawq.pxf.api.OneField; +import org.apache.hawq.pxf.api.OneRow; +import org.apache.hawq.pxf.api.ReadVectorizedResolver; +import org.apache.hawq.pxf.api.UnsupportedTypeException; +import org.apache.hawq.pxf.api.io.DataType; +import org.apache.hawq.pxf.api.utilities.ColumnDescriptor; +import org.apache.hawq.pxf.api.utilities.InputData; +import org.apache.hawq.pxf.api.utilities.Plugin; +import org.apache.hawq.pxf.plugins.hive.utilities.HiveUtilities; +import org.apache.hadoop.hive.serde2.*; +import org.apache.hadoop.hive.serde2.objectinspector.*; +import org.apache.hadoop.hive.serde2.objectinspector.primitive.*; +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector.Category; +import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector.PrimitiveCategory; +import org.apache.hadoop.hive.ql.exec.vector.*; + +@SuppressWarnings("deprecation") +public class HiveORCBatchResolver extends Plugin implements ReadVectorizedResolver { + +private static final Log LOG = LogFactory.getLog(HiveORCBatchResolver.class); + +private ListresolvedBatch; +private StructObjectInspector soi; + +public HiveORCBatchResolver(InputData input) throws Exception { +super(input); +try { +soi = (StructObjectInspector) HiveUtilities.getOrcReader(input).getObjectInspector(); +} catch (Exception e) { +LOG.error("Unable to create an object inspector."); +throw e; +} +} + +@Override +public List
getFieldsForBatch(OneRow batch) { + +Writable writableObject = null; +Object fieldValue = null; +VectorizedRowBatch vectorizedBatch = (VectorizedRowBatch) batch.getData(); + +// Allocate empty result set +resolvedBatch = new ArrayList
(vectorizedBatch.size); +for (int i = 0; i < vectorizedBatch.size; i++) { +ArrayList row = new ArrayList(inputData.getColumns()); +resolvedBatch.add(row); +for (int j = 0; j < inputData.getColumns(); j++) { +row.add(null);
[GitHub] incubator-hawq pull request #1225: HAWQ-1446: Introduce vectorized profile f...
Github user denalex commented on a diff in the pull request: https://github.com/apache/incubator-hawq/pull/1225#discussion_r118130590 --- Diff: pxf/pxf-hive/src/main/java/org/apache/hawq/pxf/plugins/hive/HiveORCBatchResolver.java --- @@ -0,0 +1,257 @@ +package org.apache.hawq.pxf.plugins.hive; + +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +import static org.apache.hawq.pxf.api.io.DataType.BIGINT; +import static org.apache.hawq.pxf.api.io.DataType.BOOLEAN; +import static org.apache.hawq.pxf.api.io.DataType.BPCHAR; +import static org.apache.hawq.pxf.api.io.DataType.BYTEA; +import static org.apache.hawq.pxf.api.io.DataType.DATE; +import static org.apache.hawq.pxf.api.io.DataType.FLOAT8; +import static org.apache.hawq.pxf.api.io.DataType.INTEGER; +import static org.apache.hawq.pxf.api.io.DataType.NUMERIC; +import static org.apache.hawq.pxf.api.io.DataType.REAL; +import static org.apache.hawq.pxf.api.io.DataType.SMALLINT; +import static org.apache.hawq.pxf.api.io.DataType.TEXT; +import static org.apache.hawq.pxf.api.io.DataType.TIMESTAMP; +import static org.apache.hawq.pxf.api.io.DataType.VARCHAR; + +import java.math.BigDecimal; +import java.util.ArrayList; +import java.util.Calendar; +import java.util.List; +import java.sql.Timestamp; +import java.sql.Date; + +import org.apache.commons.logging.Log; +import org.apache.commons.logging.LogFactory; +import org.apache.hadoop.hive.common.type.HiveDecimal; +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch; +import org.apache.hadoop.io.Writable; +import org.apache.hadoop.io.LongWritable; +import org.apache.hadoop.io.DoubleWritable; +import org.apache.hadoop.io.FloatWritable; +import org.apache.hadoop.io.Text; +import org.apache.hawq.pxf.api.OneField; +import org.apache.hawq.pxf.api.OneRow; +import org.apache.hawq.pxf.api.ReadVectorizedResolver; +import org.apache.hawq.pxf.api.UnsupportedTypeException; +import org.apache.hawq.pxf.api.io.DataType; +import org.apache.hawq.pxf.api.utilities.ColumnDescriptor; +import org.apache.hawq.pxf.api.utilities.InputData; +import org.apache.hawq.pxf.api.utilities.Plugin; +import org.apache.hawq.pxf.plugins.hive.utilities.HiveUtilities; +import org.apache.hadoop.hive.serde2.*; +import org.apache.hadoop.hive.serde2.objectinspector.*; +import org.apache.hadoop.hive.serde2.objectinspector.primitive.*; +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector.Category; +import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector.PrimitiveCategory; +import org.apache.hadoop.hive.ql.exec.vector.*; + +@SuppressWarnings("deprecation") +public class HiveORCBatchResolver extends Plugin implements ReadVectorizedResolver { + +private static final Log LOG = LogFactory.getLog(HiveORCBatchResolver.class); + +private ListresolvedBatch; +private StructObjectInspector soi; + +public HiveORCBatchResolver(InputData input) throws Exception { +super(input); +try { +soi = (StructObjectInspector) HiveUtilities.getOrcReader(input).getObjectInspector(); +} catch (Exception e) { +LOG.error("Unable to create an object inspector."); +throw e; +} +} + +@Override +public List
getFieldsForBatch(OneRow batch) { + +Writable writableObject = null; +Object fieldValue = null; +VectorizedRowBatch vectorizedBatch = (VectorizedRowBatch) batch.getData(); + +// Allocate empty result set +resolvedBatch = new ArrayList
(vectorizedBatch.size); +for (int i = 0; i < vectorizedBatch.size; i++) { +ArrayList row = new ArrayList(inputData.getColumns()); --- End diff -- call inputData.getColumns() once outside for loop if the data returned is always the same --- If your project is
[GitHub] incubator-hawq pull request #1225: HAWQ-1446: Introduce vectorized profile f...
GitHub user sansanichfb opened a pull request: https://github.com/apache/incubator-hawq/pull/1225 HAWQ-1446: Introduce vectorized profile for ORC. Work still in progress, want to get earlier feedback. You can merge this pull request into a Git repository by running: $ git pull https://github.com/sansanichfb/incubator-hawq HAWQ-1446 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-hawq/pull/1225.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1225 commit 9fb7929120910163e30043b4fd2ebd000f869b4c Author: Oleksandr DiachenkoDate: 2017-04-18T21:38:45Z [#143733171] Added vectorized accessor and new profile. commit b65e0e25f6a0520af9fc84ffe71d340c3c896948 Author: Oleksandr Diachenko Date: 2017-04-21T08:27:05Z [#143192433] Added batch resolver. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---