[Catalyst] Code Generation and the Constant Pool Limit

2017-05-12 Thread Aleksander Eskilson
Hi all, I want to take a moment to highlight an issue and invite hopefully some developers to review a pull request [1] for SPARK-18016 [2]. Code generated by Catalyst currently places all split

Re: Uploading PySpark 2.1.1 to PyPi

2017-05-12 Thread Sameer Agarwal
Holden, Thanks again for pushing this forward! Out of curiosity, did we get an approval from the PyPi folks? Regards, Sameer On Mon, May 8, 2017 at 11:44 PM, Holden Karau wrote: > So I have a PR to add this to the release process documentation - I'm > waiting on the

Run an OS command or script supplied by the user at the start of each executor

2017-05-12 Thread Luca Canali
Hi, I have recently experimented with a few ways to run OS commands from the executors (in a YARN deployment) for a specific use case where we want to interact with an external system of interest for our environment. From that experience I thought that having the possibility to spawn a script

Re: Faster Spark on ORC with Apache ORC

2017-05-12 Thread Dong Joon Hyun
Hi, I have been wondering how much Apache Spark 2.2.0 will be improved more again. This is the prior record from the source code. Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz SQL Single Int Column Scan: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative

Re: RandomForest caching

2017-05-12 Thread madhu phatak
Hi, I opened a jira. https://issues.apache.org/jira/browse/SPARK-20723 Can some one have a look? On Fri, Apr 28, 2017 at 1:34 PM, madhu phatak wrote: > Hi, > > I am testing RandomForestClassification with 50gb of data which is cached > in memory. I have 64gb of ram, in