Re: [discuss] DataFrame function namespacing

2015-05-04 Thread Reynold Xin
After talking with people on this thread and offline, I've decided to go with option 1, i.e. putting everything in a single "functions" object. On Thu, Apr 30, 2015 at 10:04 AM, Ted Yu wrote: > IMHO I would go with choice #1 > > Cheers > > On Wed, Apr 29, 2015 at 10:03 PM, Reynold Xin wrote: >

Re: Speeding up Spark build during development

2015-05-04 Thread Tathagata Das
In addition to Michael suggestion, in my SBT workflow I also use "~" to automatically kickoff build and unit test. For example, sbt/sbt "~streaming/test-only *BasicOperationsSuite*" It will automatically detect any file changes in the project and start of the compilation and testing. So my full w

OOM error with GMMs on 4GB dataset

2015-05-04 Thread Vinay Muttineni
Hi, I am training a GMM with 10 gaussians on a 4 GB dataset(720,000 * 760). The spark (1.3.1) job is allocated 120 executors with 6GB each and the driver also has 6GB. Spark Config Params: .set("spark.hadoop.validateOutputSpecs", "false").set("spark.dynamicAllocation.enabled", "false").set("spark.

Thanking Test Partners

2015-05-04 Thread Patrick Wendell
Hey All, Community testing during the QA window is an important part of the release cycle in Spark. It helps us deliver higher quality releases by vetting out issues not covered by our unit tests. I was thinking that from now on, it would be nice to recognize the organizations that donate time to

Re: Speeding up Spark build during development

2015-05-04 Thread Michael Armbrust
FWIW... My Spark SQL development workflow is usually to run "build/sbt sparkShell" or "build/sbt 'sql/test-only '". These commands starts in as little as 30s on my laptop, automatically figure out which subprojects need to be rebuilt, and don't require the expensive assembly creation. On Mon, May

Re: Multi-Line JSON in SparkSQL

2015-05-04 Thread Olivier Girardot
@joe, I'd be glad to help if you need. Le lun. 4 mai 2015 à 20:06, Matei Zaharia a écrit : > I don't know whether this is common, but we might also allow another > separator for JSON objects, such as two blank lines. > > Matei > > > On May 4, 2015, at 2:28 PM, Reynold Xin wrote: > > > > Joe - I

Re: Multi-Line JSON in SparkSQL

2015-05-04 Thread Matei Zaharia
I don't know whether this is common, but we might also allow another separator for JSON objects, such as two blank lines. Matei > On May 4, 2015, at 2:28 PM, Reynold Xin wrote: > > Joe - I think that's a legit and useful thing to do. Do you want to give it > a shot? > > On Mon, May 4, 2015 at

Re: Multi-Line JSON in SparkSQL

2015-05-04 Thread Reynold Xin
Joe - I think that's a legit and useful thing to do. Do you want to give it a shot? On Mon, May 4, 2015 at 12:36 AM, Joe Halliwell wrote: > I think Reynold’s argument shows the impossibility of the general case. > > But a “maximum object depth” hint could enable a new input format to do > its jo

Task scheduling times

2015-05-04 Thread Akshat Aranya
Hi, I have been investigating scheduling delays in Spark and I found some unexplained anomalies. In my use case, I have two stages after collapsing the transformations: the first is a mapPartitions() and the second is a sortByKey(). I found that the task serialization for the first stage takes m

Re: [discuss] ending support for Java 6?

2015-05-04 Thread shane knapp
sgtm On Mon, May 4, 2015 at 11:23 AM, Patrick Wendell wrote: > If we just set JAVA_HOME in dev/run-test-jenkins, I think it should work. > > On Mon, May 4, 2015 at 7:20 PM, shane knapp wrote: > > ...and now the workers all have java6 installed. > > > > https://issues.apache.org/jira/browse/SPAR

Re: [discuss] ending support for Java 6?

2015-05-04 Thread Patrick Wendell
If we just set JAVA_HOME in dev/run-test-jenkins, I think it should work. On Mon, May 4, 2015 at 7:20 PM, shane knapp wrote: > ...and now the workers all have java6 installed. > > https://issues.apache.org/jira/browse/SPARK-1437 > > sadly, the built-in jenkins jdk management doesn't allow us to c

Re: [discuss] ending support for Java 6?

2015-05-04 Thread shane knapp
...and now the workers all have java6 installed. https://issues.apache.org/jira/browse/SPARK-1437 sadly, the built-in jenkins jdk management doesn't allow us to choose a JDK version within matrix projects... so we need to manage this stuff manually. On Sun, May 3, 2015 at 8:57 AM, shane knapp

Re: Multi-Line JSON in SparkSQL

2015-05-04 Thread Paul Brown
It's not JSON, per se, but data formats like smile ( http://en.wikipedia.org/wiki/Smile_%28data_interchange_format%29) provide support for markers that can't be confused with content and also provide reasonably similar ergonomics. — p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/

Re: Update Wiki Developer instructions

2015-05-04 Thread Iulian Dragoș
Ok, here’s how it should be: - Eclipse Luna - Scala IDE 4.0 - Scala Test The easiest way is to download the Scala IDE bundle from the Scala IDE download page . It comes pre-installed with ScalaTest. Alternatively, use the provided upd

Re: LDA and PageRank Using GraphX

2015-05-04 Thread Robin East
There is an LDA example in the MLlib examples. You can run it like this: ./bin/run-example mllib.LDAExample --stopwordFile stop words is a file of stop words, 1 on each line. Input documents are the text of each document, 1 document per line. To see all the options just run with no options or

[ANNOUNCE] Spark branch-1.4

2015-05-04 Thread Patrick Wendell
Hi Devs, Just an announcement that I've cut Spark's branch 1.4 to form the basis of the 1.4 release. Other than a few stragglers, this represents the end of active feature development for Spark 1.4. Per usual, if committers are merging any features, please be in touch so I can help coordinate. Any

Re: Update Wiki Developer instructions

2015-05-04 Thread Sean Owen
I think it's only committers that can edit it. I suppose you can open a JIRA with a suggested text change if it is significant enough to need discussion. If it's trivial, just post it here and someone can take care of it. On Mon, May 4, 2015 at 2:32 PM, Iulian Dragoș wrote: > I'd like to update t

Update Wiki Developer instructions

2015-05-04 Thread Iulian Dragoș
I'd like to update the information about using Eclipse to develop on the Spark project found on this page: https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=38572224 I don't see any way to edit this page (I created an account). Since it's a wiki, I assumed it's supposed to be edita

Re: Speeding up Spark build during development

2015-05-04 Thread Meethu Mathew
* * ** ** ** ** ** ** Hi, Is it really necessary to run **mvn --projects assembly/ -DskipTests install ? Could you please explain why this is needed? I got the changes after running "mvn --projects streaming/ -DskipTests package". Regards, Meethu On Monday 04 May 2015 02:20 PM, Em

Re: Multi-Line JSON in SparkSQL

2015-05-04 Thread Olivier Girardot
I was wondering if it's possible to use existing Hive SerDes for this ? Le lun. 4 mai 2015 à 08:36, Joe Halliwell a écrit : > I think Reynold’s argument shows the impossibility of the general case. > > But a “maximum object depth” hint could enable a new input format to do > its job both efficie

Re: Speeding up Spark build during development

2015-05-04 Thread Emre Sevinc
Just to give you an example: When I was trying to make a small change only to the Streaming component of Spark, first I built and installed the whole Spark project (this took about 15 minutes on my 4-core, 4 GB RAM laptop). Then, after having changed files only in Streaming, I ran something like (

Re: Speeding up Spark build during development

2015-05-04 Thread Pramod Biligiri
No, I just need to build one project at a time. Right now SparkSql. Pramod On Mon, May 4, 2015 at 12:09 AM, Emre Sevinc wrote: > Hello Pramod, > > Do you need to build the whole project every time? Generally you don't, > e.g., when I was changing some files that belong only to Spark Streaming,

Re: Multi-Line JSON in SparkSQL

2015-05-04 Thread Joe Halliwell
I think Reynold’s argument shows the impossibility of the general case. But a “maximum object depth” hint could enable a new input format to do its job both efficiently and correctly in the common case where the input is an array of similarly structured objects! I’d certainly be interested in

Re: Speeding up Spark build during development

2015-05-04 Thread Emre Sevinc
Hello Pramod, Do you need to build the whole project every time? Generally you don't, e.g., when I was changing some files that belong only to Spark Streaming, I was building only the streaming (of course after having build and installed the whole project, but that was done only once), and then th

Re: Speeding up Spark build during development

2015-05-04 Thread Pramod Biligiri
Using the inbuilt maven and zinc it takes around 10 minutes for each build. Is that reasonable? My maven opts looks like this: $ echo $MAVEN_OPTS -Xmx12000m -XX:MaxPermSize=2048m I'm running it as build/mvn -DskipTests package Should I be tweaking my Zinc/Nailgun config? Pramod On Sun, May 3, 2