[GitHub] spark issue #16541: [SPARK-19088][SQL] Optimize sequence type deserializatio...

2017-03-30 Thread michalsenkyr
Github user michalsenkyr commented on the issue: https://github.com/apache/spark/pull/16541 @ueshin Thanks for the fix --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark issue #16541: [SPARK-19088][SQL] Optimize sequence type deserializatio...

2017-03-29 Thread ueshin
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/16541 I sent a pr #17473. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so,

[GitHub] spark issue #16541: [SPARK-19088][SQL] Optimize sequence type deserializatio...

2017-03-27 Thread brkyvz
Github user brkyvz commented on the issue: https://github.com/apache/spark/pull/16541 This PR unfortunately broke Scala 2.10 compilation https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-sbt-scala-2.10/4110/console --- If your project is

[GitHub] spark issue #16541: [SPARK-19088][SQL] Optimize sequence type deserializatio...

2017-03-27 Thread cloud-fan
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/16541 thanks, merging to master! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and

[GitHub] spark issue #16541: [SPARK-19088][SQL] Optimize sequence type deserializatio...

2017-03-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16541 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #16541: [SPARK-19088][SQL] Optimize sequence type deserializatio...

2017-03-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16541 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75263/ Test PASSed. ---

[GitHub] spark issue #16541: [SPARK-19088][SQL] Optimize sequence type deserializatio...

2017-03-27 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16541 **[Test build #75263 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75263/testReport)** for PR 16541 at commit

[GitHub] spark issue #16541: [SPARK-19088][SQL] Optimize sequence type deserializatio...

2017-03-27 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16541 **[Test build #75263 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75263/testReport)** for PR 16541 at commit

[GitHub] spark issue #16541: [SPARK-19088][SQL] Optimize sequence type deserializatio...

2017-03-27 Thread cloud-fan
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/16541 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes

[GitHub] spark issue #16541: [SPARK-19088][SQL] Optimize sequence type deserializatio...

2017-03-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16541 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #16541: [SPARK-19088][SQL] Optimize sequence type deserializatio...

2017-03-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16541 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75255/ Test FAILed. ---

[GitHub] spark issue #16541: [SPARK-19088][SQL] Optimize sequence type deserializatio...

2017-03-27 Thread cloud-fan
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/16541 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the

[GitHub] spark issue #16541: [SPARK-19088][SQL] Optimize sequence type deserializatio...

2017-03-27 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16541 **[Test build #75255 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75255/testReport)** for PR 16541 at commit

[GitHub] spark issue #16541: [SPARK-19088][SQL] Optimize sequence type deserializatio...

2017-03-27 Thread cloud-fan
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/16541 ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if

[GitHub] spark issue #16541: [SPARK-19088][SQL] Optimize sequence type deserializatio...

2017-03-26 Thread michalsenkyr
Github user michalsenkyr commented on the issue: https://github.com/apache/spark/pull/16541 Thanks. Made the suggested changes in my latest commit. I also encountered a minor problem when doing final testing. When using a collection type that is a type alias (e.g.,

[GitHub] spark issue #16541: [SPARK-19088][SQL] Optimize sequence type deserializatio...

2017-03-24 Thread cloud-fan
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/16541 LGTM except 2 minor comments --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and

[GitHub] spark issue #16541: [SPARK-19088][SQL] Optimize sequence type deserializatio...

2017-03-15 Thread michalsenkyr
Github user michalsenkyr commented on the issue: https://github.com/apache/spark/pull/16541 That seems to be the case here, yes. What about the other benefits I mentioned (adding support for Java `List`s and future Scala 2.13 compatibility)? I think the codegen is also more

[GitHub] spark issue #16541: [SPARK-19088][SQL] Optimize sequence type deserializatio...

2017-03-10 Thread rxin
Github user rxin commented on the issue: https://github.com/apache/spark/pull/16541 I didn't look into the details here, but very often scanning data twice doesn't necessarily slow things down, especially in the case of sequential scan. --- If your project is set up for it, you

[GitHub] spark issue #16541: [SPARK-19088][SQL] Optimize sequence type deserializatio...

2017-03-10 Thread michalsenkyr
Github user michalsenkyr commented on the issue: https://github.com/apache/spark/pull/16541 Well, technically yes. But I would say it's a little more than that. The current approach to deserialization of `Seq`s is to copy the data into an array, construct a `WrappedArray`

[GitHub] spark issue #16541: [SPARK-19088][SQL] Optimize sequence type deserializatio...

2017-03-08 Thread cloud-fan
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/16541 is it a performance improvement? there is no difference in the benchmark results --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well.

[GitHub] spark issue #16541: [SPARK-19088][SQL] Optimize sequence type deserializatio...

2017-03-04 Thread michalsenkyr
Github user michalsenkyr commented on the issue: https://github.com/apache/spark/pull/16541 Also please note the [UnsafeArrayData-producing branch](https://github.com/michalsenkyr/spark/compare/dataset-seq-builder...michalsenkyr:dataset-seq-builder-unsafe) that is not yet merged into

[GitHub] spark issue #16541: [SPARK-19088][SQL] Optimize sequence type deserializatio...

2017-03-04 Thread michalsenkyr
Github user michalsenkyr commented on the issue: https://github.com/apache/spark/pull/16541 Would it be possible for somebody to review this PR for me? I have a few ideas that are dependent on this and I'd like to get to work on them. Most notably support for Java Lists. Maybe

[GitHub] spark issue #16541: [SPARK-19088][SQL] Optimize sequence type deserializatio...

2017-02-02 Thread michalsenkyr
Github user michalsenkyr commented on the issue: https://github.com/apache/spark/pull/16541 Apologies for taking so long. I tried modifying the serialization logic as best as I could to serialize into `UnsafeArrayData` ([branch

[GitHub] spark issue #16541: [SPARK-19088][SQL] Optimize sequence type deserializatio...

2017-01-18 Thread michalsenkyr
Github user michalsenkyr commented on the issue: https://github.com/apache/spark/pull/16541 I added the benchmarks based on the code you provided but I am getting almost the same results before and after the optimization (see description). So either the added benefit is really small

[GitHub] spark issue #16541: [SPARK-19088][SQL] Optimize sequence type deserializatio...

2017-01-16 Thread kiszk
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/16541 Can we get additional performance improvement if we could generate `UnsafeArrayData` instead of `GenericArrayData` for this statement ```/* 104 */ final ArrayData serializefromobject_value =

[GitHub] spark issue #16541: [SPARK-19088][SQL] Optimize sequence type deserializatio...

2017-01-16 Thread kiszk
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/16541 It looks like the similar optimization to https://github.com/apache/spark/pull/15044. Does [this code](https://github.com/apache/spark/pull/15044/files#diff-d6f03c9d3e82f3774d1110559b039a6d)

[GitHub] spark issue #16541: [SPARK-19088][SQL] Optimize sequence type deserializatio...

2017-01-15 Thread michalsenkyr
Github user michalsenkyr commented on the issue: https://github.com/apache/spark/pull/16541 Added benchmarks. I didn't find any standardized way of benchmarking codegen so I wrote a simple script for Spark Shell. Benchmarks were run on a laptop so the collections couldn't be

[GitHub] spark issue #16541: [SPARK-19088][SQL] Optimize sequence type deserializatio...

2017-01-11 Thread michalsenkyr
Github user michalsenkyr commented on the issue: https://github.com/apache/spark/pull/16541 Added codegen comparison for a simple `List` dataset. I will also prepare a benchmark and add some results later. Those will be for `List`, `mutable.Queue` and `Seq`. Where `List` and

[GitHub] spark issue #16541: [SPARK-19088][SQL] Optimize sequence type deserializatio...

2017-01-11 Thread rxin
Github user rxin commented on the issue: https://github.com/apache/spark/pull/16541 Is this a perf optimization? If yes, can you show some benchmarks? Also for codegen it's good to show the generated code before/after this change. You can get that with ```

[GitHub] spark issue #16541: [SPARK-19088][SQL] Optimize sequence type deserializatio...

2017-01-10 Thread michalsenkyr
Github user michalsenkyr commented on the issue: https://github.com/apache/spark/pull/16541 Also, the new `CollectObjects` copies quite a bit of code from `MapObjects`. Should I move the code into a common trait in order to reduce duplicity or should I leave it as is? --- If your

[GitHub] spark issue #16541: [SPARK-19088][SQL] Optimize sequence type deserializatio...

2017-01-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16541 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this