RE: Spark SQL question: why build hashtable for both sides in HashOuterJoin?

2014-09-29 Thread Haopu Wang
Hi, Liquan, thanks for the response. In your example, I think the hash table should be built on the "right" side, so Spark can iterate through the left side and find matches in the right side from the hash table efficiently. Please comment and suggest, thanks again! __

Re: Spark SQL question: why build hashtable for both sides in HashOuterJoin?

2014-09-29 Thread Liquan Pei
Hi Haopu, My understanding is that the hashtable on both left and right side is used for including null values in result in an efficient manner. If hash table is only built on one side, let's say left side and we perform a left outer join, for each row in left side, a scan over the right side is n

Spark SQL question: why build hashtable for both sides in HashOuterJoin?

2014-09-29 Thread Haopu Wang
I take a look at HashOuterJoin and it's building a Hashtable for both sides. This consumes quite a lot of memory when the partition is big. And it doesn't reduce the iteration on streamed relation, right? Thanks! - To unsubscrib

Re: Hyper Parameter Optimization Algorithms

2014-09-29 Thread Debasish Das
You should look into Evan Spark's talk from Spark Summit 2014 http://spark-summit.org/2014/talk/model-search-at-scale I am not sure if some of it is already open sourced through MLBase... On Mon, Sep 29, 2014 at 7:45 PM, Lochana Menikarachchi wrote: > Hi, > > Is there anyone who works on hyper

Re: jenkins downtime/system upgrade wednesday morning, 730am PDT

2014-09-29 Thread Nan Zhu
Just noticed these lines in the jenkins log = Running Apache RAT checks = Attempting to fetch rat Launching rat from /home/jenkins/workspace/SparkPul

Hyper Parameter Optimization Algorithms

2014-09-29 Thread Lochana Menikarachchi
Hi, Is there anyone who works on hyper parameter optimization algorithms? If not, is there any interest on the subject. We are thinking about implementing some of these algorithms and contributing to spark? thoughts? Lochana ---

Re: FYI: i've doubled the jenkins executors for every build node

2014-09-29 Thread shane knapp
yeah, this is why i'm gonna keep a close eye on things this week... as for VMs vs containers, please do the latter more than the former. one of our longer-term plans here at the lab is to move most of our jenkins infra to VMs, and running tests w/nested VMs is Bad[tm]. On Mon, Sep 29, 2014 at 2:

Re: FYI: i've doubled the jenkins executors for every build node

2014-09-29 Thread Reynold Xin
Thanks. We might see more failures due to contention on resources. Fingers acrossed ... At some point it might make sense to run the tests in a VM or container. On Mon, Sep 29, 2014 at 2:20 PM, shane knapp wrote: > we were running at 8 executors per node, and BARELY even stressing the > machine

FYI: i've doubled the jenkins executors for every build node

2014-09-29 Thread shane knapp
we were running at 8 executors per node, and BARELY even stressing the machines (32 cores, ~230G RAM). in the interest of actually using system resources, and giving ourselves some headroom, i upped the executors to 16 per node. i'll be keeping an eye on ganglia for the rest of the week to make s

jenkins downtime/system upgrade wednesday morning, 730am PDT

2014-09-29 Thread shane knapp
happy monday, everyone! remember a few weeks back when i upgraded jenkins, and unwittingly began DOSing our system due to massive log spam? well, that bug has been fixed w/the current release and i'd like to get our logging levels back to something more verbose that we have now. downtime will be

BasicOperationsSuite failing ?

2014-09-29 Thread Ted Yu
Hi, Running test suite in trunk, I got: ^[[32mBasicOperationsSuite:^[[0m ^[[32m- map^[[0m ^[[32m- flatMap^[[0m ^[[32m- filter^[[0m ^[[32m- glom^[[0m ^[[32m- mapPartitions^[[0m ^[[32m- repartition (more partitions)^[[0m ^[[32m- repartition (fewer partitions)^[[0m ^[[32m- groupByKey^[[0m ^[[32m- red

Re: How to use multi thread in RDD map function ?

2014-09-29 Thread Yi Tian
Hi, myasuka Have you checked the jvm gc time of each executor? I think you should increase the SPARK_EXECUTOR_CORES or SPARK_EXECUTOR_INSTANCES until you get the enough concurrency. Here is my recommend config: SPARK_EXECUTOR_CORES=8 SPARK_EXECUTOR_INSTANCES=4 SPARK_WORKER_MEMORY=8G note: ma

Re: How to use multi thread in RDD map function ?

2014-09-29 Thread myasuka
Our cluster is a standalone cluster with 16 computing nodes, each node has 16 cores. I set SPARK_WORKER_INSTANCES to 1, and set SPARK_WORKER_CORES to 32, we give 512 tasks all together, this situation can help increase the concurrency. But if I set SPARK_WORKER_INSTANCES to 2, SPARK_WORKER_CORES t

Re: [MLlib] LogisticRegressionWithSGD and LogisticRegressionWithLBFGS converge with different weights.

2014-09-29 Thread Yanbo Liang
Thank you for all your patient response. I can conclude that if the data is totally separable or over-fit occurs, weights may be different. And it also consistent with my experiment. I have evaluate two different dataset and the result as followed: Loss function: LogisticGradient Regularizer: L2

Re: [MLlib] LogisticRegressionWithSGD and LogisticRegressionWithLBFGS converge with different weights.

2014-09-29 Thread DB Tsai
Can you check the loss of both LBFGS and SGD implementation? One reason maybe SGD doesn't converge well and you can see that by comparing both log-likelihoods. One other potential reason maybe the label of your training data is totally separable, so you can always increase the log-likelihood by mul

Re: [MLlib] LogisticRegressionWithSGD and LogisticRegressionWithLBFGS converge with different weights.

2014-09-29 Thread Xiangrui Meng
The test accuracy doesn't mean the total loss. All points between (-1, 1) can separate points -1 and +1 and give you 1.0 accuracy, but their coressponding loss are different. -Xiangrui On Sun, Sep 28, 2014 at 2:48 AM, Yanbo Liang wrote: > Hi > > We have used LogisticRegression with two different