[GitHub] spark pull request: [SPARK-12992][SQL] Support vectorized decoding...

2016-02-08 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/11055#issuecomment-181727263
  
LGTM, merging this into master, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12992][SQL] Support vectorized decoding...

2016-02-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11055#issuecomment-181667202
  
**[Test build #50951 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50951/consoleFull)**
 for PR 11055 at commit 
[`e79fc85`](https://github.com/apache/spark/commit/e79fc85d1d5c4e7aead4d648c264766c775fec40).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12992][SQL] Support vectorized decoding...

2016-02-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11055#issuecomment-181695332
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12992][SQL] Support vectorized decoding...

2016-02-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11055#issuecomment-181695338
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50951/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12992][SQL] Support vectorized decoding...

2016-02-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11055#issuecomment-181660379
  
**[Test build #50948 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50948/consoleFull)**
 for PR 11055 at commit 
[`028bc5d`](https://github.com/apache/spark/commit/028bc5d83b2c7d526722be61322d8b805fafb154).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12992][SQL] Support vectorized decoding...

2016-02-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11055#issuecomment-181660464
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50948/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12992][SQL] Support vectorized decoding...

2016-02-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11055#issuecomment-181660462
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12992][SQL] Support vectorized decoding...

2016-02-08 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/11055


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12992][SQL] Support vectorized decoding...

2016-02-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11055#issuecomment-181694918
  
**[Test build #50951 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50951/consoleFull)**
 for PR 11055 at commit 
[`e79fc85`](https://github.com/apache/spark/commit/e79fc85d1d5c4e7aead4d648c264766c775fec40).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12992][SQL] Support vectorized decoding...

2016-02-08 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/11055#issuecomment-181565008
  
I tried this patch with ss_max query, it failed with:
```
java.lang.NullPointerException
at 
org.apache.spark.sql.execution.datasources.parquet.UnsafeRowParquetRecordReader$ColumnReader.readPage(UnsafeRowParquetRecordReader.java:850)
at 
org.apache.spark.sql.execution.datasources.parquet.UnsafeRowParquetRecordReader$ColumnReader.readBatch(UnsafeRowParquetRecordReader.java:618)
at 
org.apache.spark.sql.execution.datasources.parquet.UnsafeRowParquetRecordReader$ColumnReader.access$000(UnsafeRowParquetRecordReader.java:461)
at 
org.apache.spark.sql.execution.datasources.parquet.UnsafeRowParquetRecordReader.nextBatch(UnsafeRowParquetRecordReader.java:224)
at 
org.apache.spark.sql.execution.datasources.parquet.UnsafeRowParquetRecordReader.nextKeyValue(UnsafeRowParquetRecordReader.java:174)
at 
org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:202)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:187)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.(TungstenAggregationIterator.scala:357)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$3.apply(TungstenAggregate.scala:94)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$3.apply(TungstenAggregate.scala:85)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:717)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:717)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:77)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:45)
at org.apache.spark.scheduler.Task.run(Task.scala:81)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12992][SQL] Support vectorized decoding...

2016-02-08 Thread nongli
Github user nongli commented on the pull request:

https://github.com/apache/spark/pull/11055#issuecomment-181639460
  
@davies Thanks for testing it out. I fixed the bug (reading multiple row 
groups)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12992][SQL] Support vectorized decoding...

2016-02-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11055#issuecomment-181642792
  
**[Test build #50948 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50948/consoleFull)**
 for PR 11055 at commit 
[`028bc5d`](https://github.com/apache/spark/commit/028bc5d83b2c7d526722be61322d8b805fafb154).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12992][SQL] Support vectorized decoding...

2016-02-05 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/11055#discussion_r52081720
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala ---
@@ -345,6 +345,14 @@ private[spark] object SQLConf {
 defaultValue = Some(true),
 doc = "Enables using the custom ParquetUnsafeRowRecordReader.")
 
+  // Note: this can not be enabled all the time because the reader will 
not be returning UnsafeRows.
+  // Doing so is very expensive and we should remove this requirement 
instead of fixing it here.
+  // Initial testing seems to indicate only sort requires this.
--- End diff --

These make sense, we could leave them as future work.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12992][SQL] Support vectorized decoding...

2016-02-05 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/11055#discussion_r52084570
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystSchemaConverter.scala
 ---
@@ -65,7 +65,7 @@ private[parquet] class CatalystSchemaConverter(
   def this(conf: Configuration) = this(
 assumeBinaryIsString = 
conf.get(SQLConf.PARQUET_BINARY_AS_STRING.key).toBoolean,
 assumeInt96IsTimestamp = 
conf.get(SQLConf.PARQUET_INT96_AS_TIMESTAMP.key).toBoolean,
-writeLegacyParquetFormat = 
conf.get(SQLConf.PARQUET_WRITE_LEGACY_FORMAT.key).toBoolean)
+writeLegacyParquetFormat = 
conf.get(SQLConf.PARQUET_WRITE_LEGACY_FORMAT.key, "false").toBoolean)
--- End diff --

The default value could change, we should use 
`SQLConf.PARQUET_WRITE_LEGACY_FORMAT.default.get.toString`

Could use also fix the above two?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12992][SQL] Support vectorized decoding...

2016-02-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11055#issuecomment-180047876
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50761/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12992][SQL] Support vectorized decoding...

2016-02-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11055#issuecomment-180047872
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12992][SQL] Support vectorized decoding...

2016-02-04 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/11055#discussion_r51952950
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala ---
@@ -345,6 +345,14 @@ private[spark] object SQLConf {
 defaultValue = Some(true),
 doc = "Enables using the custom ParquetUnsafeRowRecordReader.")
 
+  // Note: this can not be enabled all the time because the reader will 
not be returning UnsafeRows.
+  // Doing so is very expensive and we should remove this requirement 
instead of fixing it here.
+  // Initial testing seems to indicate only sort requires this.
--- End diff --

Right, all the operators output UnsafeRow, some operators may depends on 
some properties of UnsafeRow: 

1). copy() returns UnsafeRow

2). getStruct() return UnsafeRow, getArray() return UnsafeArrayData, 
getMap() returns UnsafeMap

3). hashCode() is murmur3 on bytes of UnsafeRow

4). compareTo() will compare the row as bytes

For example,  the in-memory cache requires 2), except requires 4)

@rxin Do you have more comment on this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12992][SQL] Support vectorized decoding...

2016-02-04 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11055#issuecomment-180047567
  
**[Test build #50761 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50761/consoleFull)**
 for PR 11055 at commit 
[`197f9b5`](https://github.com/apache/spark/commit/197f9b5c73d89a41f6da6c08dd1c6908be563bc5).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12992][SQL] Support vectorized decoding...

2016-02-04 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/11055#discussion_r51950606
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVectorUtils.java
 ---
@@ -100,6 +101,9 @@ private static void appendValue(ColumnVector dst, 
DataType t, Object o) {
 dst.appendStruct(false);
 dst.getChildColumn(0).appendInt(c.months);
 dst.getChildColumn(1).appendLong(c.microseconds);
+  } else if (t instanceof DateType) {
+Date date = (Date)o;
+dst.appendInt((int)date.getTime());
--- End diff --

Is this right? check `DateTimeUtils.fromJavaDate`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12992][SQL] Support vectorized decoding...

2016-02-04 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/11055#discussion_r51949138
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnarBatch.java
 ---
@@ -18,19 +18,22 @@
 
 import java.math.BigDecimal;
 import java.math.BigInteger;
+import java.sql.Date;
 import java.util.Arrays;
 import java.util.Iterator;
 
 import org.apache.spark.memory.MemoryMode;
 import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.expressions.GenericMutableRow;
 import org.apache.spark.sql.catalyst.expressions.UnsafeRow;
 import org.apache.spark.sql.catalyst.util.ArrayData;
+import org.apache.spark.sql.catalyst.util.DateTimeUtils;
--- End diff --

DateTimeUtils/Date are not used anywhere.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12992][SQL] Support vectorized decoding...

2016-02-04 Thread nongli
Github user nongli commented on a diff in the pull request:

https://github.com/apache/spark/pull/11055#discussion_r51959711
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala ---
@@ -345,6 +345,14 @@ private[spark] object SQLConf {
 defaultValue = Some(true),
 doc = "Enables using the custom ParquetUnsafeRowRecordReader.")
 
+  // Note: this can not be enabled all the time because the reader will 
not be returning UnsafeRows.
+  // Doing so is very expensive and we should remove this requirement 
instead of fixing it here.
+  // Initial testing seems to indicate only sort requires this.
--- End diff --

I think we can consider a few things.

1. Only turn this on when it is part of the whole stage codegen pipeline 
which shouldn't have any of these requirements.
2. Clean up InternalRow. It's not helpful to try to use InternalRow as a 
superclass if it needs a specific implementation in many places. I don't think 
we want to just have UnsafeRow since its requirements are too high (and 
therefore slow).
3. Relax the requirements so that they are enforced by the operator, not 
the row. I think for example, we should remove copy(). The places that 
currently need copy should use something like a row serializer that copies to a 
contiguous byte buffer or whatever the operator wants. I'm not convinced a 
general purpose copy is necessary internally.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12992][SQL] Support vectorized decoding...

2016-02-04 Thread nongli
Github user nongli commented on a diff in the pull request:

https://github.com/apache/spark/pull/11055#discussion_r51922204
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java
 ---
@@ -645,7 +676,15 @@ private void decodeDictionaryIds(int rowId, int num, 
ColumnVector column) {
 }
   } else if (column.dataType() == DataTypes.ByteType) {
 for (int i = rowId; i < rowId + num; ++i) {
-  column.putByte(i, 
(byte)dictionary.decodeToInt(dictionaryIds.getInt(i)));
+  column.putByte(i, (byte) 
dictionary.decodeToInt(dictionaryIds.getInt(i)));
+}
+  } else if (column.dataType() == DataTypes.ShortType) {
+for (int i = rowId; i < rowId + num; ++i) {
+  column.putShort(i, (short) 
dictionary.decodeToInt(dictionaryIds.getInt(i)));
+}
+  } else if (DecimalType.is64BitDecimalType(column.dataType())) {
--- End diff --

This code is like a nested loop. On the outer, we loop over the parquet 
type and on the inner, the column vector type.

In the case you mentioned, the parquet type is int64 and the column vector 
type is long so I think this just works. Let me make that more explicit by 
adding a check in the INT64 branch.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12992][SQL] Support vectorized decoding...

2016-02-04 Thread nongli
Github user nongli commented on a diff in the pull request:

https://github.com/apache/spark/pull/11055#discussion_r51922394
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java
 ---
@@ -716,6 +788,28 @@ private void readLongBatch(int rowId, int num, 
ColumnVector column) throws IOExc
   }
 }
 
+private void readFloatBatch(int rowId, int num, ColumnVector column) 
throws IOException {
+  // This is where we implement support for the valid type conversions.
+  // TODO: implement remaining type conversions
--- End diff --

I'llchange the TODO. I think if the parquet type is float and the spark 
type is double, we should probably allow that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12992][SQL] Support vectorized decoding...

2016-02-04 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11055#issuecomment-180016936
  
**[Test build #50761 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50761/consoleFull)**
 for PR 11055 at commit 
[`197f9b5`](https://github.com/apache/spark/commit/197f9b5c73d89a41f6da6c08dd1c6908be563bc5).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12992][SQL] Support vectorized decoding...

2016-02-04 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/11055#discussion_r51909478
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java
 ---
@@ -645,7 +676,15 @@ private void decodeDictionaryIds(int rowId, int num, 
ColumnVector column) {
 }
   } else if (column.dataType() == DataTypes.ByteType) {
 for (int i = rowId; i < rowId + num; ++i) {
-  column.putByte(i, 
(byte)dictionary.decodeToInt(dictionaryIds.getInt(i)));
+  column.putByte(i, (byte) 
dictionary.decodeToInt(dictionaryIds.getInt(i)));
+}
+  } else if (column.dataType() == DataTypes.ShortType) {
+for (int i = rowId; i < rowId + num; ++i) {
+  column.putShort(i, (short) 
dictionary.decodeToInt(dictionaryIds.getInt(i)));
+}
+  } else if (DecimalType.is64BitDecimalType(column.dataType())) {
--- End diff --

Decimal could also be INT64, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12992][SQL] Support vectorized decoding...

2016-02-04 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/11055#discussion_r51909811
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java
 ---
@@ -716,6 +788,28 @@ private void readLongBatch(int rowId, int num, 
ColumnVector column) throws IOExc
   }
 }
 
+private void readFloatBatch(int rowId, int num, ColumnVector column) 
throws IOException {
+  // This is where we implement support for the valid type conversions.
+  // TODO: implement remaining type conversions
--- End diff --

I think the dataType could only be FloatType


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12992][SQL] Support vectorized decoding...

2016-02-04 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/11055#discussion_r51909743
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java
 ---
@@ -716,6 +788,28 @@ private void readLongBatch(int rowId, int num, 
ColumnVector column) throws IOExc
   }
 }
 
+private void readFloatBatch(int rowId, int num, ColumnVector column) 
throws IOException {
+  // This is where we implement support for the valid type conversions.
+  // TODO: implement remaining type conversions
--- End diff --

I think the dataType could only be FloatType


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12992][SQL] Support vectorized decoding...

2016-02-04 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/11055#discussion_r51909239
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java
 ---
@@ -166,15 +170,23 @@ public void close() throws IOException {
   @Override
   public boolean nextKeyValue() throws IOException, InterruptedException {
 if (batchIdx >= numBatched) {
-  if (!loadBatch()) return false;
+  if (columnarBatch != null) {
--- End diff --

vectorizedDecode() ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12992][SQL] Support vectorized decoding...

2016-02-03 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11055#issuecomment-179456926
  
**[Test build #50670 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50670/consoleFull)**
 for PR 11055 at commit 
[`e601bbd`](https://github.com/apache/spark/commit/e601bbdc84e744f123ed840bcabc59f52433390a).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `public class UnsafeRowParquetRecordReader extends 
SpecificParquetRecordReaderBase `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12992][SQL] Support vectorized decoding...

2016-02-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11055#issuecomment-179457193
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12992][SQL] Support vectorized decoding...

2016-02-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11055#issuecomment-179457198
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50670/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12992][SQL] Support vectorized decoding...

2016-02-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11055#issuecomment-179525711
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12992][SQL] Support vectorized decoding...

2016-02-03 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11055#issuecomment-179525707
  
**[Test build #50696 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50696/consoleFull)**
 for PR 11055 at commit 
[`d307dbd`](https://github.com/apache/spark/commit/d307dbd60f4c5517b1ff299d93a16f8fb0122e6c).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `public class UnsafeRowParquetRecordReader extends 
SpecificParquetRecordReaderBase `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12992][SQL] Support vectorized decoding...

2016-02-03 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11055#issuecomment-179556622
  
**[Test build #50700 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50700/consoleFull)**
 for PR 11055 at commit 
[`a6d00a9`](https://github.com/apache/spark/commit/a6d00a9edbed79aa64b33cbd04489cabe8911f4a).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12992][SQL] Support vectorized decoding...

2016-02-03 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11055#issuecomment-179531536
  
**[Test build #50700 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50700/consoleFull)**
 for PR 11055 at commit 
[`a6d00a9`](https://github.com/apache/spark/commit/a6d00a9edbed79aa64b33cbd04489cabe8911f4a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12992][SQL] Support vectorized decoding...

2016-02-03 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11055#issuecomment-179524080
  
**[Test build #50696 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50696/consoleFull)**
 for PR 11055 at commit 
[`d307dbd`](https://github.com/apache/spark/commit/d307dbd60f4c5517b1ff299d93a16f8fb0122e6c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12992][SQL] Support vectorized decoding...

2016-02-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11055#issuecomment-179556823
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50700/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12992][SQL] Support vectorized decoding...

2016-02-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11055#issuecomment-179556820
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12992][SQL] Support vectorized decoding...

2016-02-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11055#issuecomment-179525717
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50696/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12992][SQL] Support vectorized decoding...

2016-02-03 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/11055#issuecomment-179669580
  
cc @davies for review


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12992][SQL] Support vectorized decoding...

2016-02-03 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11055#issuecomment-179421846
  
**[Test build #50670 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50670/consoleFull)**
 for PR 11055 at commit 
[`e601bbd`](https://github.com/apache/spark/commit/e601bbdc84e744f123ed840bcabc59f52433390a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12992][SQL] Support vectorized decoding...

2016-02-03 Thread nongli
GitHub user nongli opened a pull request:

https://github.com/apache/spark/pull/11055

[SPARK-12992][SQL] Support vectorized decoding in 
UnsafeRowParquetRecordReader.

WIP: running tests. Code needs a bit of clean up.

This patch completes the vectorized decoding with the goal of passing the 
existing
tests. There is still more patches to support the rest of the format spec, 
even
just for flat schemas.

This patch adds a new flag to enable the vectorized decoding. Tests were 
updated
to try with both modes where applicable.

Once this is working well, we can remove the previous code path.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/nongli/spark spark-12992-2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/11055.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #11055


commit e601bbdc84e744f123ed840bcabc59f52433390a
Author: Nong Li 
Date:   2016-02-03T06:59:47Z

[SPARK-12992][SQL] Support vectorized decoding in 
UnsafeRowParquetRecordReader.

WIP: running tests. Code needs a bit of clean up.

This patch completes the vectorized decoding with the goal of passing the 
existing
tests. There is still more patches to support the rest of the format spec, 
even
just for flat schemas.

This patch adds a new flag to enable the vectorized decoding. Tests were 
updated
to try with both modes where applicable.

Once this is working well, we can remove the previous code path.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org