[GitHub] spark pull request: [SPARK-2523] [SQL] [WIP] Hadoop table scan bug...

2014-07-27 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/1439#discussion_r15438771
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveTableScanSuite.scala
 ---
@@ -0,0 +1,47 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive.execution
+
+import org.apache.spark.{SparkConf, SparkContext}
+import org.apache.spark.sql.hive.test.TestHive
+import org.scalatest.{BeforeAndAfterAll, FunSuite}
+
+class HiveTableScanSuite extends HiveComparisonTest {
+  // MINOR HACK: You must run a query before calling reset the first time.
+  TestHive.hql(SHOW TABLES)
+  TestHive.reset()
+
+  TestHive.hql(CREATE TABLE part_scan_test (key STRING, value STRING) 
PARTITIONED BY (ds STRING) 
+ | ROW FORMAT SERDE 
+ | 
'org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe' 
+ | STORED AS RCFILE
+   .stripMargin)
+  TestHive.hql(from src 
+ | insert into table part_scan_test PARTITION 
(ds='2010-01-01') 
+ | select 100,100 limit 1
+   .stripMargin)
+  TestHive.hql(ALTER TABLE part_scan_test set SERDE 
+ | 'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe'
+   .stripMargin)
+  TestHive.hql(from src insert into table part_scan_test PARTITION 
(ds='2010-01-02') 
+ | select 200,200 limit 1
+   .stripMargin)
--- End diff --

nit: let's make all SQL keywords capital here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2523] [SQL] [WIP] Hadoop table scan bug...

2014-07-26 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1439#discussion_r15435898
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ---
@@ -241,4 +251,37 @@ private[hive] object HadoopTableReader {
 val bufferSize = System.getProperty(spark.buffer.size, 65536)
 jobConf.set(io.file.buffer.size, bufferSize)
   }
+
+  /**
+   * Transform the raw data(Writable object) into the Row object for an 
iterable input
+   * @param iter Iterable input which represented as Writable object
+   * @param deserializer Deserializer associated with the input writable 
object
+   * @param attrs Represents the row attribute names and its zero-based 
position in the MutableRow
+   * @param row reusable MutableRow object
+   * 
+   * @return Iterable Row object that transformed from the given iterable 
input.
+   */
+  def fillObject(iter: Iterator[Writable], deserializer: Deserializer, 
+  attrs: Seq[(Attribute, Int)], row: GenericMutableRow): Iterator[Row] 
= {
--- End diff --

```scala
def fillObject(
iter: Iterator[Writable],
deserializer: Deserializer, 
attrs: Seq[(Attribute, Int)],
row: GenericMutableRow): Iterator[Row] = {
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2523] [SQL] [WIP] Hadoop table scan bug...

2014-07-26 Thread yhuai
Github user yhuai commented on the pull request:

https://github.com/apache/spark/pull/1439#issuecomment-50248847
  
Also, can you delete `[WIP]` from the PR title?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2523] [SQL] [WIP] Hadoop table scan bug...

2014-07-23 Thread chenghao-intel
Github user chenghao-intel commented on the pull request:

https://github.com/apache/spark/pull/1439#issuecomment-49883931
  
@yhuai @concretevitamin  @marmbrus @liancheng Can you take a look of this? 
I think the test result gave us more confidence for the improvement.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2523] [SQL] [WIP] Hadoop table scan bug...

2014-07-21 Thread chenghao-intel
Github user chenghao-intel commented on the pull request:

https://github.com/apache/spark/pull/1439#issuecomment-49574159
  
Thank you guys, I've updated the code as suggested, and the also attached 
the micro-benchmark result in the PR description.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2523] [SQL] [WIP] Hadoop table scan bug...

2014-07-21 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1439#issuecomment-49577438
  
QA results for PR 1439:br- This patch FAILED unit tests.brbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16896/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2523] [SQL] [WIP] Hadoop table scan bug...

2014-07-21 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1439#issuecomment-49578517
  
QA results for PR 1439:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds the following public classes 
(experimental):brclass HadoopTableReader(@transient attributes: 
Seq[Attribute], brbrFor more information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16897/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2523] [SQL] [WIP] Hadoop table scan bug...

2014-07-20 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1439#issuecomment-49573309
  
QA tests have started for PR 1439. This patch DID NOT merge cleanly! 
brView progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16896/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2523] [SQL] [WIP] Hadoop table scan bug...

2014-07-20 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1439#issuecomment-49573615
  
QA tests have started for PR 1439. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16897/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2523] [SQL] [WIP] Hadoop table scan bug...

2014-07-19 Thread liancheng
Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/1439#issuecomment-49501954
  
As for benchmarks, the micro benchmark code comes with #758 may be helpful. 
And I feel that partitioning support for Parquet should be considered together 
with the refactoring @yhuai suggested.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2523] [SQL] [WIP] Hadoop table scan bug...

2014-07-18 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/1439#issuecomment-49483743
  
I'll just add the the `HiveTableReader` vs `HiveTableScan` separation is 
purely artificial, and the split is based on what code was stolen from Shark vs 
what code was written for Spark SQL.  It would be reasonable to combine them at 
some point.  However, for this PR it would be great to just fix the bug at hand.

If we are going to do major refactoring I'd want to see benchmarks showing 
that we aren't introducing any performance regressions.

It would also be nice to see a test case that would be currently failing 
but passes after this PR is added.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2523] [SQL] [WIP] Hadoop table scan bug...

2014-07-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1439#issuecomment-49265512
  
QA results for PR 1439:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds the following public classes 
(experimental):brclass HadoopTableReader(@transient attributes: 
Seq[Attribute], brbrFor more information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16766/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2523] [SQL] [WIP] Hadoop table scan bug...

2014-07-17 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1439#discussion_r15067812
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScan.scala 
---
@@ -67,95 +61,12 @@ case class HiveTableScan(
   }
 
   @transient
-  private[this] val hadoopReader = new 
HadoopTableReader(relation.tableDesc, context)
-
-  /**
-   * The hive object inspector for this table, which can be used to 
extract values from the
-   * serialized row representation.
-   */
-  @transient
-  private[this] lazy val objectInspector =
-
relation.tableDesc.getDeserializer.getObjectInspector.asInstanceOf[StructObjectInspector]
-
-  /**
-   * Functions that extract the requested attributes from the hive output. 
 Partitioned values are
-   * casted from string to its declared data type.
-   */
-  @transient
-  protected lazy val attributeFunctions: Seq[(Any, Array[String]) = Any] 
= {
-attributes.map { a =
-  val ordinal = relation.partitionKeys.indexOf(a)
-  if (ordinal = 0) {
-val dataType = relation.partitionKeys(ordinal).dataType
-(_: Any, partitionKeys: Array[String]) = {
-  castFromString(partitionKeys(ordinal), dataType)
-}
-  } else {
-val ref = objectInspector.getAllStructFieldRefs
-  .find(_.getFieldName == a.name)
-  .getOrElse(sys.error(sCan't find attribute $a))
-val fieldObjectInspector = ref.getFieldObjectInspector
-
-val unwrapHiveData = fieldObjectInspector match {
-  case _: HiveVarcharObjectInspector =
-(value: Any) = value.asInstanceOf[HiveVarchar].getValue
-  case _: HiveDecimalObjectInspector =
-(value: Any) = 
BigDecimal(value.asInstanceOf[HiveDecimal].bigDecimalValue())
-  case _ =
-identity[Any] _
-}
-
-(row: Any, _: Array[String]) = {
-  val data = objectInspector.getStructFieldData(row, ref)
-  val hiveData = unwrapData(data, fieldObjectInspector)
-  if (hiveData != null) unwrapHiveData(hiveData) else null
-}
-  }
-}
-  }
+  private[this] val hadoopReader = new HadoopTableReader(attributes, 
relation, context)
 
   private[this] def castFromString(value: String, dataType: DataType) = {
 Cast(Literal(value), dataType).eval(null)
   }
 
-  private def addColumnMetadataToConf(hiveConf: HiveConf) {
--- End diff --

I would keep it. It is important to set needed columns in conf. So, RCFile 
and ORC can know what columns should be skipped. Also, seems 
`hiveConf.set(serdeConstants.LIST_COLUMN_TYPES, columnTypeNames)` and 
`hiveConf.set(serdeConstants.LIST_COLUMNS, columnInternalNames)` will be used 
to push down filters.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2523] [SQL] [WIP] Hadoop table scan bug...

2014-07-17 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1439#discussion_r15068484
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ---
@@ -156,33 +158,43 @@ class HadoopTableReader(@transient _tableDesc: 
TableDesc, @transient sc: HiveCon
   }
 
   // Create local references so that the outer object isn't serialized.
-  val tableDesc = _tableDesc
+  val tableDesc = relation.tableDesc
   val broadcastedHiveConf = _broadcastedHiveConf
   val localDeserializer = partDeserializer
+  val mutableRow = new GenericMutableRow(attributes.length)
+
+  // split the attributes (output schema) into 2 categories:
+  // (partition keys, ordinal), (normal attributes, ordinal), the 
ordinal mean the 
+  // position of the in the output Row.
--- End diff --

position of the attribute?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2523] [SQL] [WIP] Hadoop table scan bug...

2014-07-17 Thread yhuai
Github user yhuai commented on the pull request:

https://github.com/apache/spark/pull/1439#issuecomment-49338675
  
I think we are not clear on the boundary between a `TableReader` and a 
physical `TableScan` operator (e.g. `HiveTableScan`). Seems we just want 
`TableReader` to create `RDD`s (general-purpose work) and inside a `TableScan` 
operator, we create Catalyst `Row`s (table-specific work). However, when we 
look at `HadoopTableReader`, it is actually a `HiveTableReader`. For every Hive 
partition, we create a `HadoopRDD` (requiring Hive-specific code) and 
deserialize Hive rows. I am not sure if `TableReader` is a good abstraction. 

I think it makes sense to remove the trait of `TableReader` and add a 
abstract `TableScan` class (inheriting `LeafNode`). All existing TableScan 
operators will inherit this abstract `TableScan` class. If we think it is the 
right approach. I can do it in another PR.

@marmbrus, @liancheng, @rxin, @chenghao-intel thoughts?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2523] [SQL] [WIP] Hadoop table scan bug...

2014-07-17 Thread yhuai
Github user yhuai commented on the pull request:

https://github.com/apache/spark/pull/1439#issuecomment-49341477
  
@chenghao-intel explained the root cause in 
https://issues.apache.org/jira/browse/SPARK-2523. Basically, we should use 
partition-specific `ObjectInspectors` to extract fields instead of using the 
`ObjectInspectors` set in the `TableDesc`. If partitions of a table are using 
different SerDe, their `ObjectInspector`s will be different. @chenghao-intel 
can you add unit tests? Is there any Hive query test that can be included?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2523] [SQL] [WIP] Hadoop table scan bug...

2014-07-17 Thread chenghao-intel
Github user chenghao-intel commented on a diff in the pull request:

https://github.com/apache/spark/pull/1439#discussion_r15092652
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScan.scala 
---
@@ -67,95 +61,12 @@ case class HiveTableScan(
   }
 
   @transient
-  private[this] val hadoopReader = new 
HadoopTableReader(relation.tableDesc, context)
-
-  /**
-   * The hive object inspector for this table, which can be used to 
extract values from the
-   * serialized row representation.
-   */
-  @transient
-  private[this] lazy val objectInspector =
-
relation.tableDesc.getDeserializer.getObjectInspector.asInstanceOf[StructObjectInspector]
-
-  /**
-   * Functions that extract the requested attributes from the hive output. 
 Partitioned values are
-   * casted from string to its declared data type.
-   */
-  @transient
-  protected lazy val attributeFunctions: Seq[(Any, Array[String]) = Any] 
= {
-attributes.map { a =
-  val ordinal = relation.partitionKeys.indexOf(a)
-  if (ordinal = 0) {
-val dataType = relation.partitionKeys(ordinal).dataType
-(_: Any, partitionKeys: Array[String]) = {
-  castFromString(partitionKeys(ordinal), dataType)
-}
-  } else {
-val ref = objectInspector.getAllStructFieldRefs
-  .find(_.getFieldName == a.name)
-  .getOrElse(sys.error(sCan't find attribute $a))
-val fieldObjectInspector = ref.getFieldObjectInspector
-
-val unwrapHiveData = fieldObjectInspector match {
-  case _: HiveVarcharObjectInspector =
-(value: Any) = value.asInstanceOf[HiveVarchar].getValue
-  case _: HiveDecimalObjectInspector =
-(value: Any) = 
BigDecimal(value.asInstanceOf[HiveDecimal].bigDecimalValue())
-  case _ =
-identity[Any] _
-}
-
-(row: Any, _: Array[String]) = {
-  val data = objectInspector.getStructFieldData(row, ref)
-  val hiveData = unwrapData(data, fieldObjectInspector)
-  if (hiveData != null) unwrapHiveData(hiveData) else null
-}
-  }
-}
-  }
+  private[this] val hadoopReader = new HadoopTableReader(attributes, 
relation, context)
 
   private[this] def castFromString(value: String, dataType: DataType) = {
 Cast(Literal(value), dataType).eval(null)
   }
 
-  private def addColumnMetadataToConf(hiveConf: HiveConf) {
--- End diff --

Oh, yes, I didn't realize that. I will revert it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2523] [SQL] [WIP] Hadoop table scan bug...

2014-07-17 Thread chenghao-intel
Github user chenghao-intel commented on a diff in the pull request:

https://github.com/apache/spark/pull/1439#discussion_r15092708
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ---
@@ -156,33 +158,43 @@ class HadoopTableReader(@transient _tableDesc: 
TableDesc, @transient sc: HiveCon
   }
 
   // Create local references so that the outer object isn't serialized.
-  val tableDesc = _tableDesc
+  val tableDesc = relation.tableDesc
   val broadcastedHiveConf = _broadcastedHiveConf
   val localDeserializer = partDeserializer
+  val mutableRow = new GenericMutableRow(attributes.length)
+
+  // split the attributes (output schema) into 2 categories:
+  // (partition keys, ordinal), (normal attributes, ordinal), the 
ordinal mean the 
+  // position of the in the output Row.
--- End diff --

I should make a better document, actually `Row` is the sub interface of 
`Seq`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2523] [SQL] [WIP] Hadoop table scan bug...

2014-07-17 Thread chenghao-intel
Github user chenghao-intel commented on the pull request:

https://github.com/apache/spark/pull/1439#issuecomment-49386103
  
@yhuai I agree with you we should make a clear boundary between 
`HiveTableScan` and `TableReader`, but I am not sure if it's a good idea to 
create multiple `HiveTableScan` classes instead of one. Routing to different 
table scan operators may requires exposing more details in the `SparkPlanner`, 
which sits inside of the `HiveTableScan` currently. 
Perhaps making multiple `TableReader` is more reasonable, for example, 
`TableReader`, `PartitionReader`, `MemoryTableReader` etc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2523] [SQL] [WIP] Hadoop table scan bug...

2014-07-17 Thread yhuai
Github user yhuai commented on the pull request:

https://github.com/apache/spark/pull/1439#issuecomment-49386783
  
@chenghao-intel I did not meant to introduce multiple `HiveTableScan`. I 
meant to have a abstract `TableScan` and make existing ones (e.g. 
`HiveTableScan` and `ParquetTableScan`) be subclasses of the abstract 
`TableScan`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2523] [SQL] [WIP] Hadoop table scan bug...

2014-07-17 Thread chenghao-intel
Github user chenghao-intel commented on the pull request:

https://github.com/apache/spark/pull/1439#issuecomment-49387608
  
@yhuai sorry if I misundertood. Do you mean the `HiveTableScan` 
`ParquetTableScan` is the new operators, which created by SparkPlanner, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2523] [SQL] [WIP] Hadoop table scan bug...

2014-07-17 Thread chenghao-intel
Github user chenghao-intel commented on the pull request:

https://github.com/apache/spark/pull/1439#issuecomment-49388234
  
@yhuai I got your mean eventually, I think you're right, some of the logic 
could be shared among TableScan operators.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2523] [SQL] [WIP] Hadoop table scan bug...

2014-07-16 Thread chenghao-intel
Github user chenghao-intel commented on the pull request:

https://github.com/apache/spark/pull/1439#issuecomment-49138992
  
`ObjectInspector` is not required by `Row` in Catalyst any more (not like 
in Shark), and it is tightly coupled with Deserializer  the raw data, so I 
moved the `ObjectInspector` into `TableReader`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2523] [SQL] [WIP] Hadoop table scan bug...

2014-07-16 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1439#issuecomment-49147325
  
QA results for PR 1439:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds the following public classes 
(experimental):brclass HadoopTableReader(@transient attributes: 
Seq[Attribute], brbrFor more information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16724/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2523] [SQL] [WIP] Hadoop table scan bug...

2014-07-16 Thread yhuai
Github user yhuai commented on the pull request:

https://github.com/apache/spark/pull/1439#issuecomment-49175346
  
Could you elaborate on when we will see an exception?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2523] [SQL] [WIP] Hadoop table scan bug...

2014-07-16 Thread concretevitamin
Github user concretevitamin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1439#discussion_r15013611
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ---
@@ -241,4 +252,37 @@ private[hive] object HadoopTableReader {
 val bufferSize = System.getProperty(spark.buffer.size, 65536)
 jobConf.set(io.file.buffer.size, bufferSize)
   }
+
+  /**
+   * Transform the raw data(Writable object) into the Row object for an 
iterable input
+   * @param iter Iterable input which represented as Writable object
+   * @param deserializer Deserializer associated with the input writable 
object
+   * @param attrs Represents the row attribute names and its zero-based 
position in the MutableRow
+   * @param row reusable MutableRow object
+   * 
+   * @return Iterable Row object that transformed from the given iterable 
input.
+   */
+  def fillObject(iter: Iterator[Writable], deserializer: Deserializer, 
+  attrs: Seq[(Attribute, Int)], row: GenericMutableRow): Iterator[Row] 
= {
+val soi = 
deserializer.getObjectInspector().asInstanceOf[StructObjectInspector]
+// get the field references according to the attributes(output of the 
reader) required
+val fieldRefs = attrs.map { case (attr, idx) = 
(soi.getStructFieldRef(attr.name), idx) }
+  
+// Map each tuple to a row object
+iter.map { value =
+  val raw = deserializer.deserialize(value)
+  var idx = 0;
+  while(idx  fieldRefs.length) {
--- End diff --

nit: space after while


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2523] [SQL] [WIP] Hadoop table scan bug...

2014-07-16 Thread chenghao-intel
Github user chenghao-intel commented on the pull request:

https://github.com/apache/spark/pull/1439#issuecomment-49258632
  
@yhuai  @concretevitamin thanks for the commenting, I've updated the 
description in Jira, can you please jump there and take a look?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2523] [SQL] [WIP] Hadoop table scan bug...

2014-07-16 Thread chenghao-intel
Github user chenghao-intel commented on the pull request:

https://github.com/apache/spark/pull/1439#issuecomment-49258678
  
Sorry, forgot to paste the link. 
https://issues.apache.org/jira/browse/SPARK-2523


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2523] [SQL] [WIP] Hadoop table scan bug...

2014-07-16 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1439#issuecomment-49259426
  
QA tests have started for PR 1439. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16766/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---