Re: Loading Files from HDFS Incurs Network Communication

2015-10-26 Thread Sean Owen
-dev +user How are you measuring network traffic? It's not in general true that there will be zero network traffic, since not all executors are local to all data. That can be the situation in many cases but not always. On Mon, Oct 26, 2015 at 8:57 AM, Jinfeng Li wrote: > Hi,

Re: [VOTE] Release Apache Spark 1.5.2 (RC1)

2015-10-26 Thread Patrick Wendell
I verified that the issue with build binaries being present in the source release is fixed. Haven't done enough vetting for a full vote, but did verify that. On Sun, Oct 25, 2015 at 12:07 AM, Reynold Xin wrote: > Please vote on releasing the following candidate as Apache

Loading Files from HDFS Incurs Network Communication

2015-10-26 Thread Jinfeng Li
Hi, I find that loading files from HDFS can incur huge amount of network traffic. Input size is 90G and network traffic is about 80G. By my understanding, local files should be read and thus no network communication is needed. I use Spark 1.5.1, and the following is my code: val textRDD =

RE: spark-sql / apache-drill / jboss-tiied

2015-10-26 Thread prajod.vettiyattil
Hi, Though not the comparison you wanted, I have implemented a SparkSQL vs Hive performance comparison with one master and two worker instances. Data was stored in HDFS. SparkSQL showed promise. I used Spark version 1.4 and Hadoop version 2.6. https://hivevssparksql.wordpress.com/ The table

Re: Spark Implementation of XGBoost

2015-10-26 Thread DB Tsai
Also, does it support categorical feature? Sincerely, DB Tsai -- Web: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D On Mon, Oct 26, 2015 at 4:06 PM, DB Tsai wrote: > Interesting. For feature sub-sampling, is it

Re: Spark Implementation of XGBoost

2015-10-26 Thread DB Tsai
Interesting. For feature sub-sampling, is it per-node or per-tree? Do you think you can implement generic GBM and have it merged as part of Spark codebase? Sincerely, DB Tsai -- Web: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D On Mon,

Re: Spark Implementation of XGBoost

2015-10-26 Thread YiZhi Liu
There's an xgboost exploration jira SPARK-8547. Can it be a good start? 2015-10-27 7:07 GMT+08:00 DB Tsai : > Also, does it support categorical feature? > > Sincerely, > > DB Tsai > -- > Web: https://www.dbtsai.com > PGP

Re: [VOTE] Release Apache Spark 1.5.2 (RC1)

2015-10-26 Thread Krishna Sankar
Guys, The sc.version returns 1.5.1 in python and scala. Is anyone getting the same results ? Probably I am doing something wrong. Cheers On Sun, Oct 25, 2015 at 12:07 AM, Reynold Xin wrote: > Please vote on releasing the following candidate as Apache Spark > version

Re: Spark Implementation of XGBoost

2015-10-26 Thread Meihua Wu
Hi YiZhi, Thank you for mentioning the jira. I will add a note to the jira. Meihua On Mon, Oct 26, 2015 at 6:16 PM, YiZhi Liu wrote: > There's an xgboost exploration jira SPARK-8547. Can it be a good start? > > 2015-10-27 7:07 GMT+08:00 DB Tsai : >>

Re: Re: repartitionAndSortWithinPartitions task shuffle phase is very slow

2015-10-26 Thread 周千昊
I have replace default java serialization with Kyro. It indeed reduce the shuffle size and the performance has been improved, however the shuffle speed remains unchanged. I am quite newbie to Spark, does anyone have idea about towards which direction I should go to find the root cause? 周千昊

Re: Spark Implementation of XGBoost

2015-10-26 Thread Meihua Wu
Hi DB Tsai, Thank you very much for your interest and comment. 1) feature sub-sample is per-node, like random forest. 2) The current code heavily exploits the tree structure to speed up the learning (such as processing multiple learning node in one pass of the training data). So a generic GBM

Spark Implementation of XGBoost

2015-10-26 Thread Meihua Wu
Hi Spark User/Dev, Inspired by the success of XGBoost, I have created a Spark package for gradient boosting tree with 2nd order approximation of arbitrary user-defined loss functions. https://github.com/rotationsymmetry/SparkXGBoost Currently linear (normal) regression, binary classification,