Help needed to publish SizeEstimator as separate library

2014-11-19 Thread madhu phatak
Hi, As I was going through spark source code, SizeEstimator https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/SizeEstimator.scala caught my eye. It's a very useful tool to do the size estimations on JVM which helps in use cases like memory bounded cache. It

Re: Intro to using IntelliJ to debug SPARK-1.1 Apps with mvn/sbt (for beginners)

2014-11-19 Thread Chester At Work
gen-idea should work. I use it all the time. But use the approach that works for you Sent from my iPad On Nov 18, 2014, at 11:12 PM, Yiming \(John\) Zhang sdi...@gmail.com wrote: Hi Chester, thank you for your reply. But I tried this approach and it failed. It seems that there are more

Build break

2014-11-19 Thread Patrick Wendell
Hey All, Just a heads up. I merged this patch last night which caused the Spark build to break: https://github.com/apache/spark/commit/397d3aae5bde96b01b4968dde048b6898bb6c914 The patch itself was fine and previously had passed on Jenkins. The issue was that other intermediate changes merged

Re: [VOTE] Release Apache Spark 1.1.1 (RC2)

2014-11-19 Thread Andrew Or
I will start with a +1 2014-11-19 14:51 GMT-08:00 Andrew Or and...@databricks.com: Please vote on releasing the following candidate as Apache Spark version 1 .1.1. This release fixes a number of bugs in Spark 1.1.0. Some of the notable ones are - [SPARK-3426] Sort-based shuffle compression

Re: [VOTE] Release Apache Spark 1.1.1 (RC2)

2014-11-19 Thread Xiangrui Meng
+1. Checked version numbers and doc. Tested a few ML examples with Java 6 and verified some recently merged bug fixes. -Xiangrui On Wed, Nov 19, 2014 at 2:51 PM, Andrew Or and...@databricks.com wrote: I will start with a +1 2014-11-19 14:51 GMT-08:00 Andrew Or and...@databricks.com: Please

Re: [MLlib] Contributing Algorithm for Outlier Detection

2014-11-19 Thread slcclimber
You could also use rdd.zipWithIndex() to create indexes. Anant -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9441.html Sent from the Apache Spark Developers List mailing list archive at

Re: [VOTE] Release Apache Spark 1.1.1 (RC2)

2014-11-19 Thread Krishna Sankar
+1 1. Compiled OSX 10.10 (Yosemite) mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package 10:49 min 2. Tested pyspark, mlib 2.1. statistics OK 2.2. Linear/Ridge/Laso Regression OK 2.3. Decision Tree, Naive Bayes OK 2.4. KMeans OK 2.5. rdd operations OK 2.6. recommendation OK

Too many open files error

2014-11-19 Thread Qiuzhuang Lian
Hi All, While doing some ETL, I run into error of 'Too many open files' as following logs, Thanks, Qiuzhuang 4/11/20 20:12:02 INFO collection.ExternalAppendOnlyMap: Thread 63 spilling in-memory map of 100.8 KB to disk (953 times so far) 14/11/20 20:12:02 ERROR storage.DiskBlockObjectWriter:

Re: [MLlib] Contributing Algorithm for Outlier Detection

2014-11-19 Thread Ashutosh
Done. Thanks. Added you as a collaborator. So that you can add code in it. Thanks, Ashutosh From: slcclimber [via Apache Spark Developers List] ml-node+s1001551n9441...@n3.nabble.com Sent: Thursday, November 20, 2014 7:49 AM To: Ashutosh Trivedi (MT2013030)

Re: Too many open files error

2014-11-19 Thread Dinesh J. Weerakkody
Hi Qiuzhuang, This is a linux related issue. Please go through this [1] and change the limits. hope this will solve your problem. [1] https://rtcamp.com/tutorials/linux/increase-open-files-limit/ On Thu, Nov 20, 2014 at 9:45 AM, Qiuzhuang Lian qiuzhuang.l...@gmail.com wrote: Hi All, While

Re: Too many open files error

2014-11-19 Thread Sandy Ryza
Quizhang, This is a known issue that ExternalAppendOnlyMap can do tons of tiny spills in certain situations. SPARK-4452 aims to deal with this issue, but we haven't finalized a solution yet. Dinesh's solution should help as a workaround, but you'll likely experience suboptimal performance when