Re: Using sampleByKey

2014-11-18 Thread Sean Owen
I use randomSplit to make a train/CV/test set in one go. It definitely produces disjoint data sets and is efficient. The problem is you can't do it by key. I am not sure why your subtract does not work. I suspect it is because the values do not partition the same way, or they don't evaluate

Re: [MLlib] Contributing Algorithm for Outlier Detection

2014-11-18 Thread Ashutosh
Hi Anant, I have removed the counter and all possible side effects. Now I think we can go ahead with the testing. I have created another folder for testing. I will add you as a collaborator in github . _Ashutosh From: slcclimber [via Apache Spark Developers

Re: Quantile regression in tree models

2014-11-18 Thread Alessandro Baretta
Manish, My use case for (asymmetric) absolute error is quite trivially quantile regression. In other words, I want to use Spark to learn conditional cumulative distribution functions. See R's GBM quantile regression option. If you either find or create a Jira ticket, I would be happy to give it

Re: Using sampleByKey

2014-11-18 Thread Xiangrui Meng
`sampleByKey` with the same fraction per stratum acts the same as `sample`. The operation you want is perhaps `sampleByKeyExact` here. However, when you use stratified sampling, there should not be many strata. My question is why we need to split on each user's ratings. If a user is missing in

Re: Using sampleByKey

2014-11-18 Thread Debasish Das
For mllib PR, I will add this logic: If a user is missing in training and appears in test, we can simply ignore it. I was struggling since users appear in test on which the model was not trained on... For our internal tests we want to cross validate on every product / user as all of them are

Re: Using sampleByKey

2014-11-18 Thread Xiangrui Meng
If all users are equally important, then the average score should be representative. You shouldn't worry about missing one or two. For stratified sampling, wikipedia has a paragraph about its disadvantage: http://en.wikipedia.org/wiki/Stratified_sampling#Disadvantages It depends on the size of

Re: Implementing TinkerPop on top of GraphX

2014-11-18 Thread Kyle Ellrott
The new Tinkerpop3 API was different enough from V2, that it was worth starting a new implementation rather then trying to completely refactor my old code. I've started a new project: https://github.com/kellrott/spark-gremlin which compiles and runs the first set of unit tests (which it completely

Re: Quantile regression in tree models

2014-11-18 Thread Manish Amde
Hi Alex, Here is the ticket for refining tree predictions. Let's discuss this further on the JIRA. https://issues.apache.org/jira/browse/SPARK-4240 There is no ticket yet for quantile regression. It will be great if you could create one and note down the corresponding loss function and gradient

Intro to using IntelliJ to debug SPARK-1.1 Apps with mvn/sbt (for beginners)

2014-11-18 Thread Yiming (John) Zhang
Hi, I noticed it is hard to find a thorough introduction to using IntelliJ to debug SPARK-1.1 Apps with mvn/sbt, which is not straightforward for beginners. So I spent several days to figure it out and hope that it would be helpful for beginners like me and that professionals can help me

Re: Intro to using IntelliJ to debug SPARK-1.1 Apps with mvn/sbt (for beginners)

2014-11-18 Thread Chen He
Thank you Yiming. It is helpful. Regards! Chen On Tue, Nov 18, 2014 at 8:00 PM, Yiming (John) Zhang sdi...@gmail.com wrote: Hi, I noticed it is hard to find a thorough introduction to using IntelliJ to debug SPARK-1.1 Apps with mvn/sbt, which is not straightforward for beginners. So I

Re: Intro to using IntelliJ to debug SPARK-1.1 Apps with mvn/sbt (for beginners)

2014-11-18 Thread Chester @work
For sbt You can simplify run sbt/sbt gen-idea To generate the IntelliJ idea project module for you. You can the just open the generated project, which includes all the needed dependencies Sent from my iPhone On Nov 18, 2014, at 8:26 PM, Chen He airb...@gmail.com wrote: Thank you Yiming.

Re: Apache infra github sync down

2014-11-18 Thread Reynold Xin
This basically stops us from merging patches. I'm wondering if it is possible for ASF to give some Spark committers write permission to github repo. In that case, if the sync tool is down, we can manually push periodically. On Tue, Nov 18, 2014 at 10:24 PM, Patrick Wendell pwend...@gmail.com

re: Intro to using IntelliJ to debug SPARK-1.1 Apps with mvn/sbt (for beginners)

2014-11-18 Thread Yiming (John) Zhang
Hi Chester, thank you for your reply. But I tried this approach and it failed. It seems that there are more difficulty using sbt in IntelliJ than expected. And according to some references # sbt/sbt gen-idea is not necessary (after Spark-1.0.0?), you can simply import the spark project and