[GitHub] spark pull request: [SPARK-2523] [SQL] [WIP] Hadoop table scan bug...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/1439#discussion_r15438771 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveTableScanSuite.scala --- @@ -0,0 +1,47 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive.execution + +import org.apache.spark.{SparkConf, SparkContext} +import org.apache.spark.sql.hive.test.TestHive +import org.scalatest.{BeforeAndAfterAll, FunSuite} + +class HiveTableScanSuite extends HiveComparisonTest { + // MINOR HACK: You must run a query before calling reset the first time. + TestHive.hql(SHOW TABLES) + TestHive.reset() + + TestHive.hql(CREATE TABLE part_scan_test (key STRING, value STRING) PARTITIONED BY (ds STRING) + | ROW FORMAT SERDE + | 'org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe' + | STORED AS RCFILE + .stripMargin) + TestHive.hql(from src + | insert into table part_scan_test PARTITION (ds='2010-01-01') + | select 100,100 limit 1 + .stripMargin) + TestHive.hql(ALTER TABLE part_scan_test set SERDE + | 'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe' + .stripMargin) + TestHive.hql(from src insert into table part_scan_test PARTITION (ds='2010-01-02') + | select 200,200 limit 1 + .stripMargin) --- End diff -- nit: let's make all SQL keywords capital here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2523] [SQL] [WIP] Hadoop table scan bug...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/1439#discussion_r15435898 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala --- @@ -241,4 +251,37 @@ private[hive] object HadoopTableReader { val bufferSize = System.getProperty(spark.buffer.size, 65536) jobConf.set(io.file.buffer.size, bufferSize) } + + /** + * Transform the raw data(Writable object) into the Row object for an iterable input + * @param iter Iterable input which represented as Writable object + * @param deserializer Deserializer associated with the input writable object + * @param attrs Represents the row attribute names and its zero-based position in the MutableRow + * @param row reusable MutableRow object + * + * @return Iterable Row object that transformed from the given iterable input. + */ + def fillObject(iter: Iterator[Writable], deserializer: Deserializer, + attrs: Seq[(Attribute, Int)], row: GenericMutableRow): Iterator[Row] = { --- End diff -- ```scala def fillObject( iter: Iterator[Writable], deserializer: Deserializer, attrs: Seq[(Attribute, Int)], row: GenericMutableRow): Iterator[Row] = { ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2523] [SQL] [WIP] Hadoop table scan bug...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/1439#issuecomment-50248847 Also, can you delete `[WIP]` from the PR title? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2523] [SQL] [WIP] Hadoop table scan bug...
Github user chenghao-intel commented on the pull request: https://github.com/apache/spark/pull/1439#issuecomment-49883931 @yhuai @concretevitamin @marmbrus @liancheng Can you take a look of this? I think the test result gave us more confidence for the improvement. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2523] [SQL] [WIP] Hadoop table scan bug...
Github user chenghao-intel commented on the pull request: https://github.com/apache/spark/pull/1439#issuecomment-49574159 Thank you guys, I've updated the code as suggested, and the also attached the micro-benchmark result in the PR description. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2523] [SQL] [WIP] Hadoop table scan bug...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1439#issuecomment-49577438 QA results for PR 1439:br- This patch FAILED unit tests.brbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16896/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2523] [SQL] [WIP] Hadoop table scan bug...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1439#issuecomment-49578517 QA results for PR 1439:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds the following public classes (experimental):brclass HadoopTableReader(@transient attributes: Seq[Attribute], brbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16897/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2523] [SQL] [WIP] Hadoop table scan bug...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1439#issuecomment-49573309 QA tests have started for PR 1439. This patch DID NOT merge cleanly! brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16896/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2523] [SQL] [WIP] Hadoop table scan bug...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1439#issuecomment-49573615 QA tests have started for PR 1439. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16897/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2523] [SQL] [WIP] Hadoop table scan bug...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/1439#issuecomment-49501954 As for benchmarks, the micro benchmark code comes with #758 may be helpful. And I feel that partitioning support for Parquet should be considered together with the refactoring @yhuai suggested. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2523] [SQL] [WIP] Hadoop table scan bug...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/1439#issuecomment-49483743 I'll just add the the `HiveTableReader` vs `HiveTableScan` separation is purely artificial, and the split is based on what code was stolen from Shark vs what code was written for Spark SQL. It would be reasonable to combine them at some point. However, for this PR it would be great to just fix the bug at hand. If we are going to do major refactoring I'd want to see benchmarks showing that we aren't introducing any performance regressions. It would also be nice to see a test case that would be currently failing but passes after this PR is added. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2523] [SQL] [WIP] Hadoop table scan bug...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1439#issuecomment-49265512 QA results for PR 1439:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds the following public classes (experimental):brclass HadoopTableReader(@transient attributes: Seq[Attribute], brbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16766/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2523] [SQL] [WIP] Hadoop table scan bug...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/1439#discussion_r15067812 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScan.scala --- @@ -67,95 +61,12 @@ case class HiveTableScan( } @transient - private[this] val hadoopReader = new HadoopTableReader(relation.tableDesc, context) - - /** - * The hive object inspector for this table, which can be used to extract values from the - * serialized row representation. - */ - @transient - private[this] lazy val objectInspector = - relation.tableDesc.getDeserializer.getObjectInspector.asInstanceOf[StructObjectInspector] - - /** - * Functions that extract the requested attributes from the hive output. Partitioned values are - * casted from string to its declared data type. - */ - @transient - protected lazy val attributeFunctions: Seq[(Any, Array[String]) = Any] = { -attributes.map { a = - val ordinal = relation.partitionKeys.indexOf(a) - if (ordinal = 0) { -val dataType = relation.partitionKeys(ordinal).dataType -(_: Any, partitionKeys: Array[String]) = { - castFromString(partitionKeys(ordinal), dataType) -} - } else { -val ref = objectInspector.getAllStructFieldRefs - .find(_.getFieldName == a.name) - .getOrElse(sys.error(sCan't find attribute $a)) -val fieldObjectInspector = ref.getFieldObjectInspector - -val unwrapHiveData = fieldObjectInspector match { - case _: HiveVarcharObjectInspector = -(value: Any) = value.asInstanceOf[HiveVarchar].getValue - case _: HiveDecimalObjectInspector = -(value: Any) = BigDecimal(value.asInstanceOf[HiveDecimal].bigDecimalValue()) - case _ = -identity[Any] _ -} - -(row: Any, _: Array[String]) = { - val data = objectInspector.getStructFieldData(row, ref) - val hiveData = unwrapData(data, fieldObjectInspector) - if (hiveData != null) unwrapHiveData(hiveData) else null -} - } -} - } + private[this] val hadoopReader = new HadoopTableReader(attributes, relation, context) private[this] def castFromString(value: String, dataType: DataType) = { Cast(Literal(value), dataType).eval(null) } - private def addColumnMetadataToConf(hiveConf: HiveConf) { --- End diff -- I would keep it. It is important to set needed columns in conf. So, RCFile and ORC can know what columns should be skipped. Also, seems `hiveConf.set(serdeConstants.LIST_COLUMN_TYPES, columnTypeNames)` and `hiveConf.set(serdeConstants.LIST_COLUMNS, columnInternalNames)` will be used to push down filters. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2523] [SQL] [WIP] Hadoop table scan bug...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/1439#discussion_r15068484 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala --- @@ -156,33 +158,43 @@ class HadoopTableReader(@transient _tableDesc: TableDesc, @transient sc: HiveCon } // Create local references so that the outer object isn't serialized. - val tableDesc = _tableDesc + val tableDesc = relation.tableDesc val broadcastedHiveConf = _broadcastedHiveConf val localDeserializer = partDeserializer + val mutableRow = new GenericMutableRow(attributes.length) + + // split the attributes (output schema) into 2 categories: + // (partition keys, ordinal), (normal attributes, ordinal), the ordinal mean the + // position of the in the output Row. --- End diff -- position of the attribute? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2523] [SQL] [WIP] Hadoop table scan bug...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/1439#issuecomment-49338675 I think we are not clear on the boundary between a `TableReader` and a physical `TableScan` operator (e.g. `HiveTableScan`). Seems we just want `TableReader` to create `RDD`s (general-purpose work) and inside a `TableScan` operator, we create Catalyst `Row`s (table-specific work). However, when we look at `HadoopTableReader`, it is actually a `HiveTableReader`. For every Hive partition, we create a `HadoopRDD` (requiring Hive-specific code) and deserialize Hive rows. I am not sure if `TableReader` is a good abstraction. I think it makes sense to remove the trait of `TableReader` and add a abstract `TableScan` class (inheriting `LeafNode`). All existing TableScan operators will inherit this abstract `TableScan` class. If we think it is the right approach. I can do it in another PR. @marmbrus, @liancheng, @rxin, @chenghao-intel thoughts? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2523] [SQL] [WIP] Hadoop table scan bug...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/1439#issuecomment-49341477 @chenghao-intel explained the root cause in https://issues.apache.org/jira/browse/SPARK-2523. Basically, we should use partition-specific `ObjectInspectors` to extract fields instead of using the `ObjectInspectors` set in the `TableDesc`. If partitions of a table are using different SerDe, their `ObjectInspector`s will be different. @chenghao-intel can you add unit tests? Is there any Hive query test that can be included? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2523] [SQL] [WIP] Hadoop table scan bug...
Github user chenghao-intel commented on a diff in the pull request: https://github.com/apache/spark/pull/1439#discussion_r15092652 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScan.scala --- @@ -67,95 +61,12 @@ case class HiveTableScan( } @transient - private[this] val hadoopReader = new HadoopTableReader(relation.tableDesc, context) - - /** - * The hive object inspector for this table, which can be used to extract values from the - * serialized row representation. - */ - @transient - private[this] lazy val objectInspector = - relation.tableDesc.getDeserializer.getObjectInspector.asInstanceOf[StructObjectInspector] - - /** - * Functions that extract the requested attributes from the hive output. Partitioned values are - * casted from string to its declared data type. - */ - @transient - protected lazy val attributeFunctions: Seq[(Any, Array[String]) = Any] = { -attributes.map { a = - val ordinal = relation.partitionKeys.indexOf(a) - if (ordinal = 0) { -val dataType = relation.partitionKeys(ordinal).dataType -(_: Any, partitionKeys: Array[String]) = { - castFromString(partitionKeys(ordinal), dataType) -} - } else { -val ref = objectInspector.getAllStructFieldRefs - .find(_.getFieldName == a.name) - .getOrElse(sys.error(sCan't find attribute $a)) -val fieldObjectInspector = ref.getFieldObjectInspector - -val unwrapHiveData = fieldObjectInspector match { - case _: HiveVarcharObjectInspector = -(value: Any) = value.asInstanceOf[HiveVarchar].getValue - case _: HiveDecimalObjectInspector = -(value: Any) = BigDecimal(value.asInstanceOf[HiveDecimal].bigDecimalValue()) - case _ = -identity[Any] _ -} - -(row: Any, _: Array[String]) = { - val data = objectInspector.getStructFieldData(row, ref) - val hiveData = unwrapData(data, fieldObjectInspector) - if (hiveData != null) unwrapHiveData(hiveData) else null -} - } -} - } + private[this] val hadoopReader = new HadoopTableReader(attributes, relation, context) private[this] def castFromString(value: String, dataType: DataType) = { Cast(Literal(value), dataType).eval(null) } - private def addColumnMetadataToConf(hiveConf: HiveConf) { --- End diff -- Oh, yes, I didn't realize that. I will revert it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2523] [SQL] [WIP] Hadoop table scan bug...
Github user chenghao-intel commented on a diff in the pull request: https://github.com/apache/spark/pull/1439#discussion_r15092708 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala --- @@ -156,33 +158,43 @@ class HadoopTableReader(@transient _tableDesc: TableDesc, @transient sc: HiveCon } // Create local references so that the outer object isn't serialized. - val tableDesc = _tableDesc + val tableDesc = relation.tableDesc val broadcastedHiveConf = _broadcastedHiveConf val localDeserializer = partDeserializer + val mutableRow = new GenericMutableRow(attributes.length) + + // split the attributes (output schema) into 2 categories: + // (partition keys, ordinal), (normal attributes, ordinal), the ordinal mean the + // position of the in the output Row. --- End diff -- I should make a better document, actually `Row` is the sub interface of `Seq`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2523] [SQL] [WIP] Hadoop table scan bug...
Github user chenghao-intel commented on the pull request: https://github.com/apache/spark/pull/1439#issuecomment-49386103 @yhuai I agree with you we should make a clear boundary between `HiveTableScan` and `TableReader`, but I am not sure if it's a good idea to create multiple `HiveTableScan` classes instead of one. Routing to different table scan operators may requires exposing more details in the `SparkPlanner`, which sits inside of the `HiveTableScan` currently. Perhaps making multiple `TableReader` is more reasonable, for example, `TableReader`, `PartitionReader`, `MemoryTableReader` etc. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2523] [SQL] [WIP] Hadoop table scan bug...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/1439#issuecomment-49386783 @chenghao-intel I did not meant to introduce multiple `HiveTableScan`. I meant to have a abstract `TableScan` and make existing ones (e.g. `HiveTableScan` and `ParquetTableScan`) be subclasses of the abstract `TableScan`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2523] [SQL] [WIP] Hadoop table scan bug...
Github user chenghao-intel commented on the pull request: https://github.com/apache/spark/pull/1439#issuecomment-49387608 @yhuai sorry if I misundertood. Do you mean the `HiveTableScan` `ParquetTableScan` is the new operators, which created by SparkPlanner, right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2523] [SQL] [WIP] Hadoop table scan bug...
Github user chenghao-intel commented on the pull request: https://github.com/apache/spark/pull/1439#issuecomment-49388234 @yhuai I got your mean eventually, I think you're right, some of the logic could be shared among TableScan operators. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2523] [SQL] [WIP] Hadoop table scan bug...
Github user chenghao-intel commented on the pull request: https://github.com/apache/spark/pull/1439#issuecomment-49138992 `ObjectInspector` is not required by `Row` in Catalyst any more (not like in Shark), and it is tightly coupled with Deserializer the raw data, so I moved the `ObjectInspector` into `TableReader`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2523] [SQL] [WIP] Hadoop table scan bug...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1439#issuecomment-49147325 QA results for PR 1439:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds the following public classes (experimental):brclass HadoopTableReader(@transient attributes: Seq[Attribute], brbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16724/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2523] [SQL] [WIP] Hadoop table scan bug...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/1439#issuecomment-49175346 Could you elaborate on when we will see an exception? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2523] [SQL] [WIP] Hadoop table scan bug...
Github user concretevitamin commented on a diff in the pull request: https://github.com/apache/spark/pull/1439#discussion_r15013611 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala --- @@ -241,4 +252,37 @@ private[hive] object HadoopTableReader { val bufferSize = System.getProperty(spark.buffer.size, 65536) jobConf.set(io.file.buffer.size, bufferSize) } + + /** + * Transform the raw data(Writable object) into the Row object for an iterable input + * @param iter Iterable input which represented as Writable object + * @param deserializer Deserializer associated with the input writable object + * @param attrs Represents the row attribute names and its zero-based position in the MutableRow + * @param row reusable MutableRow object + * + * @return Iterable Row object that transformed from the given iterable input. + */ + def fillObject(iter: Iterator[Writable], deserializer: Deserializer, + attrs: Seq[(Attribute, Int)], row: GenericMutableRow): Iterator[Row] = { +val soi = deserializer.getObjectInspector().asInstanceOf[StructObjectInspector] +// get the field references according to the attributes(output of the reader) required +val fieldRefs = attrs.map { case (attr, idx) = (soi.getStructFieldRef(attr.name), idx) } + +// Map each tuple to a row object +iter.map { value = + val raw = deserializer.deserialize(value) + var idx = 0; + while(idx fieldRefs.length) { --- End diff -- nit: space after while --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2523] [SQL] [WIP] Hadoop table scan bug...
Github user chenghao-intel commented on the pull request: https://github.com/apache/spark/pull/1439#issuecomment-49258632 @yhuai @concretevitamin thanks for the commenting, I've updated the description in Jira, can you please jump there and take a look? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2523] [SQL] [WIP] Hadoop table scan bug...
Github user chenghao-intel commented on the pull request: https://github.com/apache/spark/pull/1439#issuecomment-49258678 Sorry, forgot to paste the link. https://issues.apache.org/jira/browse/SPARK-2523 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2523] [SQL] [WIP] Hadoop table scan bug...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1439#issuecomment-49259426 QA tests have started for PR 1439. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16766/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---