[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2017-02-23 Thread wesm
Github user wesm commented on the issue: https://github.com/apache/spark/pull/15821 Notable that this is the same technology that @hadley and I used to create Feather last year (https://github.com/wesm/feather). We'll be continuing to pursue collaborations in this area between the R a

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2017-02-23 Thread BryanCutler
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/15821 ## Future Work The support of all Spark data types to include complex types will be a quick follow up to ensure full compatibility. With the added ability to convert a Dataset on the e

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2017-02-23 Thread BryanCutler
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/15821 ## Benchmarks with Conversion done on Executor (updated) ## 1mm Longs _ | With Arrow | Without Arrow --||--- count | 50.00 |50.00

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2017-02-23 Thread BryanCutler
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/15821 ## Dependency Info This change does add Apache Arrow as a dependency, specifically the Java arrow-vector artifact. For Python, usage is optional and test are conditional on ability to

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2017-02-23 Thread BryanCutler
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/15821 The Python Arrow tests should have been skipped since they are conditional on pyarrow being installed and it's not in `requirements.txt`. I'll hold off on testing for now until we can make sure

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2017-02-23 Thread wesm
Github user wesm commented on the issue: https://github.com/apache/spark/pull/15821 @BryanCutler not sure if it's relevant, but though the conda-forge artifacts are not up to date, but they should still be wire compatible with the Arrow 0.2 JAR --- If your project is set up for it,

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2017-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15821 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73366/ Test FAILed. ---

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2017-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15821 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature e

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2017-02-23 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15821 **[Test build #73366 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73366/testReport)** for PR 15821 at commit [`9c8ea63`](https://github.com/apache/spark/commit/9

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2017-02-23 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15821 **[Test build #73366 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73366/testReport)** for PR 15821 at commit [`9c8ea63`](https://github.com/apache/spark/commit/9c

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2017-02-23 Thread BryanCutler
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/15821 jenkins retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2017-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15821 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature e

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2017-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15821 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73355/ Test FAILed. ---

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2017-02-23 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15821 **[Test build #73355 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73355/testReport)** for PR 15821 at commit [`9c8ea63`](https://github.com/apache/spark/commit/9

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2017-02-23 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15821 **[Test build #73355 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73355/testReport)** for PR 15821 at commit [`9c8ea63`](https://github.com/apache/spark/commit/9c

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2017-02-23 Thread BryanCutler
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/15821 Jenkins retest please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishe

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2017-02-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15821 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73321/ Test FAILed. ---

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2017-02-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15821 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature e

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2017-02-22 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15821 **[Test build #73321 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73321/testReport)** for PR 15821 at commit [`9c8ea63`](https://github.com/apache/spark/commit/9

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2017-02-22 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15821 **[Test build #73321 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73321/testReport)** for PR 15821 at commit [`9c8ea63`](https://github.com/apache/spark/commit/9c

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2017-02-22 Thread BryanCutler
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/15821 jenkins retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2017-02-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15821 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73310/ Test FAILed. ---

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2017-02-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15821 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature e

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2017-02-22 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15821 **[Test build #73310 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73310/testReport)** for PR 15821 at commit [`9c8ea63`](https://github.com/apache/spark/commit/9

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2017-02-22 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15821 **[Test build #73310 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73310/testReport)** for PR 15821 at commit [`9c8ea63`](https://github.com/apache/spark/commit/9c

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2017-02-22 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15821 **[Test build #73308 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73308/testReport)** for PR 15821 at commit [`42af1d5`](https://github.com/apache/spark/commit/4

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2017-02-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15821 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73308/ Test FAILed. ---

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2017-02-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15821 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature e

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2017-02-22 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15821 **[Test build #73308 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73308/testReport)** for PR 15821 at commit [`42af1d5`](https://github.com/apache/spark/commit/42

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2017-02-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15821 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73305/ Test FAILed. ---

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2017-02-22 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15821 **[Test build #73305 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73305/testReport)** for PR 15821 at commit [`54884ed`](https://github.com/apache/spark/commit/5

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2017-02-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15821 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature e

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2017-02-22 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15821 **[Test build #73305 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73305/testReport)** for PR 15821 at commit [`54884ed`](https://github.com/apache/spark/commit/54

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2017-02-22 Thread BryanCutler
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/15821 Great news @wesm! I'm am just cleaning up the tests and will post an update using the new release soon. I'm happy to help with any maintenance too, just let me know what you need. --- If you

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2017-02-21 Thread wesm
Github user wesm commented on the issue: https://github.com/apache/spark/pull/15821 The 0.2 Maven artifacts have been posted. I'll try to update the conda-forge packages this week -- if anyone can help with conda-forge maintenance that would be a big help. Thanks! --- If yo

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2017-02-05 Thread mariusvniekerk
Github user mariusvniekerk commented on the issue: https://github.com/apache/spark/pull/15821 Probably a good thing to look at is the R pieces since that is effectively constrained to InternalRow --- If your project is set up for it, you can reply to this email and have your reply ap

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2017-02-05 Thread wesm
Github user wesm commented on the issue: https://github.com/apache/spark/pull/15821 @zero323 that is exactly the plan. It's a bit complicated though because the Python UDF code path handles arbitrary iterators, not just `Array[InternalRow]` --- If your project is set up for it, you

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2017-02-05 Thread zero323
Github user zero323 commented on the issue: https://github.com/apache/spark/pull/15821 This looks amazing and I can't help but wonder - if the next step is _generating the arrow batches on executors_ is it possible to reuse this to pass data between JVM and Python UDFs? Right now, wit

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2017-01-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15821 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72180/ Test FAILed. ---

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2017-01-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15821 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature e

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2017-01-30 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15821 **[Test build #72180 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72180/testReport)** for PR 15821 at commit [`b35192c`](https://github.com/apache/spark/commit/b3

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2017-01-30 Thread wesm
Github user wesm commented on the issue: https://github.com/apache/spark/pull/15821 The conda-forge pyarrow package is now up to date with latest Arrow git master --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your p

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2017-01-26 Thread BryanCutler
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/15821 Thanks for the review @wesm! Those are good ideas, I'll work on an update to this. As for the packaging conflicts, I had to add exclusions for `com.fasterxml.jackson.core:jackson-annotations`,

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2017-01-25 Thread BryanCutler
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/15821 > Parallelizing the record batch conversion and streaming it to Python would be another significant perf win. Right, I should have also mentioned that this PR takes a simplistic approa

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2017-01-25 Thread wesm
Github user wesm commented on the issue: https://github.com/apache/spark/pull/15821 Very nice to see the improved wall clock times. I have been busy engineering the pipeline between the byte stream from Spark and the resulting DataFrame -- the only major thing still left on the table

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2017-01-24 Thread leifwalsh
Github user leifwalsh commented on the issue: https://github.com/apache/spark/pull/15821 The next iteration of this for perf would likely involve generating the arrow batches on executors and having the driver use the new streaming arrow format to just forward this to python. In our e

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2017-01-24 Thread holdenk
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/15821 On a personal note, those benchmarks certainly look very exciting (<3 max of with arrow less than min of without arrow) :) It certainly seems it would probably be worth the review bandwidth

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2017-01-24 Thread BryanCutler
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/15821 Here are some rough benchmarks done locally on machine with 16GB mem and 8 cores, using Spark config defaults and taken from 50 trials of calling `toPandas()` with and without Arrow enabled:

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2017-01-24 Thread BryanCutler
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/15821 This has been updated after integrating changes made with @icexelloss and @wesm. There has been good progress made and it would be great if others could take a look and review/test this out.

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2017-01-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15821 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature e

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2017-01-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15821 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71950/ Test FAILed. ---

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2017-01-24 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15821 **[Test build #71950 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71950/testReport)** for PR 15821 at commit [`9bb75de`](https://github.com/apache/spark/commit/9

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2017-01-24 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15821 **[Test build #71950 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71950/testReport)** for PR 15821 at commit [`9bb75de`](https://github.com/apache/spark/commit/9b

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2017-01-23 Thread BryanCutler
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/15821 That sounds good, I'm out of town atm but will update this tomorrow to get some some more eyes on it. On Jan 23, 2017 11:40 AM, "Li Jin" wrote: > Bryan, > > I am wor

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2017-01-23 Thread icexelloss
Github user icexelloss commented on the issue: https://github.com/apache/spark/pull/15821 Bryan, I am working on: (1) Add more numbers to benchmark.py (2) Add support for date/timestamp/binary type (3) Fix memory leaking in the code. All these should be don

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2017-01-18 Thread BryanCutler
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/15821 >Shall we update this PR to the latest and solicit from involvement from Spark committers? Yeah, I think it's about ready for that. After we integrate the latest changes, I'll go over

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2017-01-18 Thread wesm
Github user wesm commented on the issue: https://github.com/apache/spark/pull/15821 Shall we update this PR to the latest and solicit from involvement from Spark committers? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as wel

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2016-12-02 Thread wesm
Github user wesm commented on the issue: https://github.com/apache/spark/pull/15821 OK, let's open pull requests into that branch to help with not stepping on each other's toes. thank you --- If your project is set up for it, you can reply to this email and have your reply appear on

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2016-12-01 Thread BryanCutler
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/15821 @icexelloss, @wesm I branched off here for us to integrate our changes https://github.com/BryanCutler/spark/tree/arrow-integration cc @yinxusen --- If your project is set up for it, you c

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2016-12-01 Thread icexelloss
Github user icexelloss commented on the issue: https://github.com/apache/spark/pull/15821 @BryanCutler , I have been working based on your branch here: https://github.com/BryanCutler/spark/tree/wip-toPandas_with_arrow-SPARK-13534 Is this the right one? --- If your projec

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2016-12-01 Thread BryanCutler
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/15821 Hi @wesm and @icexelloss , that sounds good on our end. @yinxusen has been working on validating some basic conversion so far, but everything is still very preliminary so it would be great to w

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2016-12-01 Thread wesm
Github user wesm commented on the issue: https://github.com/apache/spark/pull/15821 Related to this we'll also want to be able to precisely instrument and benchmark the Dataset <-> Arrow conversion -- @icexelloss suggested might be able to push down the conversion into the executors i

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2016-12-01 Thread wesm
Github user wesm commented on the issue: https://github.com/apache/spark/pull/15821 @BryanCutler I'm working with @icexelloss on my end to get involved in this, we were going to start working on unit tests to validate converting each of the Spark SQL data types to Arrow format while t

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2016-11-30 Thread wesm
Github user wesm commented on the issue: https://github.com/apache/spark/pull/15821 Luckily we are on the home stretch for making the Java and C++ libraries binary compatible -- e.g. I'm working on automated testing today: https://github.com/apache/arrow/pull/219 --- If your project

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2016-11-30 Thread BryanCutler
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/15821 Thanks @mariusvniekerk, as @holdenk said we are going to try to get something basic working first and after we show some performance improvement, we can follow up with more things --- If your

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2016-11-29 Thread holdenk
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/15821 @mariusvniekerk I think just getting this working for local connection is going to be hard so breaking up using arrow on the driver side into a separate follow up piece of work would make sense. -

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2016-11-29 Thread mariusvniekerk
Github user mariusvniekerk commented on the issue: https://github.com/apache/spark/pull/15821 So this is very cool stuff. Would it be reasonable to add some api pieces so that on the python side things like DataFrame.mapPartitions makes use of Apache Arrow to lower the seri

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2016-11-22 Thread BryanCutler
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/15821 Hey @holdenk, I just had this in to do my own testing and hadn't thought about keeping the option, but if we do keep it then yeah you're right, it would be better to default to the origina

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2016-11-21 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15821 **[Test build #68954 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68954/consoleFull)** for PR 15821 at commit [`9191b96`](https://github.com/apache/spark/commit/

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2016-11-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15821 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature e

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2016-11-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15821 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68954/ Test FAILed. ---

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2016-11-21 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15821 **[Test build #68954 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68954/consoleFull)** for PR 15821 at commit [`9191b96`](https://github.com/apache/spark/commit/9

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2016-11-17 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15821 **[Test build #68812 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68812/consoleFull)** for PR 15821 at commit [`9191b96`](https://github.com/apache/spark/commit/

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2016-11-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15821 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68812/ Test FAILed. ---

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2016-11-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15821 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature e

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2016-11-17 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15821 **[Test build #68812 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68812/consoleFull)** for PR 15821 at commit [`9191b96`](https://github.com/apache/spark/commit/9

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2016-11-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15821 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature e

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2016-11-17 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15821 **[Test build #68806 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68806/consoleFull)** for PR 15821 at commit [`053e3a6`](https://github.com/apache/spark/commit/

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2016-11-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15821 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68806/ Test FAILed. ---

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2016-11-17 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15821 **[Test build #68806 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68806/consoleFull)** for PR 15821 at commit [`053e3a6`](https://github.com/apache/spark/commit/0

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2016-11-09 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15821 **[Test build #68427 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68427/consoleFull)** for PR 15821 at commit [`b06e11f`](https://github.com/apache/spark/commit/

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2016-11-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15821 Build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2016-11-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15821 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68427/ Test FAILed. ---

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2016-11-09 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15821 **[Test build #68427 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68427/consoleFull)** for PR 15821 at commit [`b06e11f`](https://github.com/apache/spark/commit/b

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2016-11-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15821 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature e

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2016-11-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15821 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68425/ Test FAILed. ---

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2016-11-09 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15821 **[Test build #68425 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68425/consoleFull)** for PR 15821 at commit [`3f855ec`](https://github.com/apache/spark/commit/

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2016-11-09 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15821 **[Test build #68425 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68425/consoleFull)** for PR 15821 at commit [`3f855ec`](https://github.com/apache/spark/commit/3

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2016-11-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15821 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68381/ Test FAILed. ---

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2016-11-08 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15821 **[Test build #68381 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68381/consoleFull)** for PR 15821 at commit [`4227ec6`](https://github.com/apache/spark/commit/

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2016-11-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15821 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature e

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2016-11-08 Thread BryanCutler
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/15821 the test currently fails with the stack trace ``` Traceback (most recent call last): File "/home/bryan/git/spark/python/pyspark/sql/tests.py", line 2000, in test_arrow_round_trip

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2016-11-08 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15821 **[Test build #68381 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68381/consoleFull)** for PR 15821 at commit [`4227ec6`](https://github.com/apache/spark/commit/4