Re: LinearRegressionWithSGD accuracy

2015-01-28 Thread DB Tsai
Hi Robin, You can try this PR out. This has built-in features scaling, and has ElasticNet regularization (L1/L2 mix). This implementation can stably converge to model from R's glmnet package. https://github.com/apache/spark/pull/4259 Sincerely, DB Tsai --

Re: emergency jenkins restart soon

2015-01-28 Thread shane knapp
np! the master builds haven't triggered yet, but let's give the rube goldberg machine a minute to get it's bearings. On Wed, Jan 28, 2015 at 10:31 PM, Reynold Xin wrote: > Thanks for doing that, Shane! > > > On Wed, Jan 28, 2015 at 10:29 PM, shane knapp wrote: > >> jenkins is back up and all b

Re: emergency jenkins restart soon

2015-01-28 Thread Reynold Xin
Thanks for doing that, Shane! On Wed, Jan 28, 2015 at 10:29 PM, shane knapp wrote: > jenkins is back up and all builds have been retriggered... things are > building and looking good, and i'll keep an eye on the spark master builds > tonite and tomorrow. > > On Wed, Jan 28, 2015 at 9:56 PM, sh

Re: emergency jenkins restart soon

2015-01-28 Thread shane knapp
jenkins is back up and all builds have been retriggered... things are building and looking good, and i'll keep an eye on the spark master builds tonite and tomorrow. On Wed, Jan 28, 2015 at 9:56 PM, shane knapp wrote: > the spark master builds stopped triggering ~yesterday and the logs don't >

emergency jenkins restart soon

2015-01-28 Thread shane knapp
the spark master builds stopped triggering ~yesterday and the logs don't show anything. i'm going to give the current batch of spark pull request builder jobs a little more time (~30 mins) to finish, then kill whatever is left and restart jenkins. anything that was queued or killed will be retrig

Re: renaming SchemaRDD -> DataFrame

2015-01-28 Thread Evan R. Sparks
You've got to be a little bit careful here. "NA" in systems like R or pandas may have special meaning that is distinct from "null". See, e.g. http://www.r-bloggers.com/r-na-vs-null/ On Wed, Jan 28, 2015 at 4:42 PM, Reynold Xin wrote: > Isn't that just "null" in SQL? > > On Wed, Jan 28, 2015 a

Re: spark akka fork : is the source anywhere?

2015-01-28 Thread Patrick Wendell
It's maintained here: https://github.com/pwendell/akka/tree/2.2.3-shaded-proto Over time, this is something that would be great to get rid of, per rxin On Wed, Jan 28, 2015 at 3:33 PM, Reynold Xin wrote: > Hopefully problems like this will go away entirely in the next couple of > releases. http

Re: renaming SchemaRDD -> DataFrame

2015-01-28 Thread Michael Armbrust
In particular the performance tricks are in SpecificMutableRow. On Wed, Jan 28, 2015 at 5:49 PM, Evan Chan wrote: > Yeah, it's "null". I was worried you couldn't represent it in Row > because of primitive types like Int (unless you box the Int, which > would be a performance hit). Anyways, I'

Re: renaming SchemaRDD -> DataFrame

2015-01-28 Thread Evan Chan
Yeah, it's "null". I was worried you couldn't represent it in Row because of primitive types like Int (unless you box the Int, which would be a performance hit). Anyways, I'll take another look at the Row API again :-p On Wed, Jan 28, 2015 at 4:42 PM, Reynold Xin wrote: > Isn't that just "nul

Re: renaming SchemaRDD -> DataFrame

2015-01-28 Thread Reynold Xin
Isn't that just "null" in SQL? On Wed, Jan 28, 2015 at 4:41 PM, Evan Chan wrote: > I believe that most DataFrame implementations out there, like Pandas, > supports the idea of missing values / NA, and some support the idea of > Not Meaningful as well. > > Does Row support anything like that? Th

Re: renaming SchemaRDD -> DataFrame

2015-01-28 Thread Evan Chan
I believe that most DataFrame implementations out there, like Pandas, supports the idea of missing values / NA, and some support the idea of Not Meaningful as well. Does Row support anything like that? That is important for certain applications. I thought that Row worked by being a mutable objec

Re: renaming SchemaRDD -> DataFrame

2015-01-28 Thread Reynold Xin
It shouldn't change the data source api at all because data sources create RDD[Row], and that gets converted into a DataFrame automatically (previously to SchemaRDD). https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala One thing that w

Re: renaming SchemaRDD -> DataFrame

2015-01-28 Thread Evan Chan
Hey guys, How does this impact the data sources API? I was planning on using this for a project. +1 that many things from spark-sql / DataFrame is universally desirable and useful. By the way, one thing that prevents the columnar compression stuff in Spark SQL from being more useful is, at leas

Re: spark akka fork : is the source anywhere?

2015-01-28 Thread Reynold Xin
Hopefully problems like this will go away entirely in the next couple of releases. https://issues.apache.org/jira/browse/SPARK-5293 On Wed, Jan 28, 2015 at 3:12 PM, jay vyas wrote: > Hi spark. Where is akka coming from in spark ? > > I see the distribution referenced is a spark artifact... but

spark akka fork : is the source anywhere?

2015-01-28 Thread jay vyas
Hi spark. Where is akka coming from in spark ? I see the distribution referenced is a spark artifact... but not in the apache namespace. org.spark-project.akka 2.3.4-spark Clearly this is a deliberate thought out change (See SPARK-1812), but its not clear where 2.3.4 spark is coming fr

Re: [VOTE] Release Apache Spark 1.2.1 (RC2)

2015-01-28 Thread Krishna Sankar
+1 (non-binding, of course) 1. Compiled OSX 10.10 (Yosemite) OK Total time: 12:22 min mvn clean package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -DskipTests 2. Tested pyspark, mlib - running as well as compare results with 1.1.x & 1.2.0 2.1. statistics (min,max,m

Re: Data source API | Support for dynamic schema

2015-01-28 Thread Reynold Xin
It's an interesting idea, but there are major challenges with per row schema. 1. Performance - query optimizer and execution use assumptions about schema and data to generate optimized query plans. Having to re-reason about schema for each row can substantially slow down the engine, but due to opt

Re: Data source API | Support for dynamic schema

2015-01-28 Thread Cheng Lian
Hi Aniket, In general the schema of all rows in a single table must be same. This is a basic assumption made by Spark SQL. Schema union does make sense, and we're planning to support this for Parquet. But as you've mentioned, it doesn't help if types of different versions of a column differ fr

Re: Extending Scala style checks

2015-01-28 Thread Nicholas Chammas
FYI: scalastyle just merged in a patch to add support for external rules . I forget why I was following the linked issue, but I assume it's related to this discussion. Nick On Thu Oct 09 2014 at 2:56:30 AM Reynold Xin wrote

Re: Use mvn to build Spark 1.2.0 failed

2015-01-28 Thread Dirceu Semighini Filho
Before this I was facing the same problem, and fixed it adding the plugin at the root pom.xml Maybe this is related to the release, mine is: Apache Maven 3.2.3 (33f8c3e1027c3ddde99d3cdebad2656a31e8fdf4; 2014-08-11T17:58:10-03:00) Java version: 1.8.0_20, vendor: Oracle Corporation OS name: "linux",

Re: Use mvn to build Spark 1.2.0 failed

2015-01-28 Thread Sean Owen
I don't see how this would relate to the problem in the OP? the assemblies build fine already as far as I can tell. Your new error may be introduced by your change. On Wed, Jan 28, 2015 at 2:52 PM, Dirceu Semighini Filho wrote: > I was facing the same problem, and I fixed it by adding > > > mav

Re: Use mvn to build Spark 1.2.0 failed

2015-01-28 Thread Dirceu Semighini Filho
I was facing the same problem, and I fixed it by adding maven-assembly-plugin 2.4.1 assembly/src/main/assembly/assembly.xml in the root pom.xml, following the maven assembly plugin docs

Re: [VOTE] Release Apache Spark 1.2.1 (RC2)

2015-01-28 Thread Sean Owen
We had both been using Java 8; Ye reports that it fails on Java 6 too. We both believe this has been failing for a fair while, so I do not think it's a regression. I'll make a JIRA though. On Wed, Jan 28, 2015 at 1:22 PM, Ye Xianjin wrote: > Sean, > the MQRRStreamSuite is also failed for me on Ma

Re: [VOTE] Release Apache Spark 1.2.1 (RC2)

2015-01-28 Thread Ye Xianjin
Sean, the MQRRStreamSuite is also failed for me on Mac OS X, Though I don’t have time to invest that. -- Ye Xianjin Sent with Sparrow (http://www.sparrowmailapp.com/?sig) On Wednesday, January 28, 2015 at 9:17 PM, Sean Owen wrote: > +1 (nonbinding). I verified that all the hash / signing i

Re: [VOTE] Release Apache Spark 1.2.1 (RC2)

2015-01-28 Thread Sean Owen
+1 (nonbinding). I verified that all the hash / signing items I mentioned before are resolved. The source package compiles on Ubuntu / Java 8. I ran tests and the passed. Well, actually I see the same failure I've seeing locally on OS X and on Ubuntu for a while, but I think nobody else has seen t

Re: [SQL] Self join with ArrayType columns problems

2015-01-28 Thread PierreB
Should I file a JIRA for this? -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-Self-join-with-ArrayType-columns-problems-tp10269p10322.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: [VOTE] Release Apache Spark 1.2.1 (RC2)

2015-01-28 Thread Patrick Wendell
Yes - it fixes that issue. On Wed, Jan 28, 2015 at 2:17 AM, Aniket wrote: > Hi Patrick, > > I am wondering if this version will address issues around certain artifacts > not getting published in 1.2 which are gating people to migrate to 1.2. One > such issue is https://issues.apache.org/jira/brow

Re: [VOTE] Release Apache Spark 1.2.1 (RC2)

2015-01-28 Thread Aniket
Hi Patrick, I am wondering if this version will address issues around certain artifacts not getting published in 1.2 which are gating people to migrate to 1.2. One such issue is https://issues.apache.org/jira/browse/SPARK-5144 Thanks, Aniket On Wed Jan 28 2015 at 15:39:43 Patrick Wendell [via Ap

Data source API | Support for dynamic schema

2015-01-28 Thread Aniket Bhatnagar
I saw the talk on Spark data sources and looking at the interfaces, it seems that the schema needs to be provided upfront. This works for many data sources but I have a situation in which I would need to integrate a system that supports schema evolutions by allowing users to change schema without a

Re: [VOTE] Release Apache Spark 1.2.1 (RC2)

2015-01-28 Thread Patrick Wendell
Minor typo in the above e-mail - the tag is named v1.2.1-rc2 (not v1.2.1-rc1). On Wed, Jan 28, 2015 at 2:06 AM, Patrick Wendell wrote: > Please vote on releasing the following candidate as Apache Spark version > 1.2.1! > > The tag to be voted on is v1.2.1-rc1 (commit b77f876): > https://git-wip-

[VOTE] Release Apache Spark 1.2.1 (RC2)

2015-01-28 Thread Patrick Wendell
Please vote on releasing the following candidate as Apache Spark version 1.2.1! The tag to be voted on is v1.2.1-rc1 (commit b77f876): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b77f87673d1f9f03d4c83cf583158227c551359b The release files, including signatures, digests, etc. can

[RESULT] [VOTE] Release Apache Spark 1.2.1 (RC1)

2015-01-28 Thread Patrick Wendell
This vote is cancelled in favor of RC2. On Tue, Jan 27, 2015 at 4:20 PM, Reynold Xin wrote: > +1 > > Tested on Mac OS X > > On Tue, Jan 27, 2015 at 12:35 PM, Krishna Sankar > wrote: >> >> +1 >> 1. Compiled OSX 10.10 (Yosemite) OK Total time: 12:55 min >> mvn clean package -Pyarn -Dyarn.vers