Re: BlockMatrix multiplication

2015-07-15 Thread Burak Yavuz
Hi Alexander, I just noticed the error in my logic. There will always be a shuffle due to the `cogroup`. `join` also uses cogroup, therefore a shuffle is inevitable. However, the reduceByKey will not cause a shuffle. I forgot about how cogroup will try to match things, even if they don't exist. A

RE: BlockMatrix multiplication

2015-07-15 Thread Ulanov, Alexander
Hi Burak, I’ve modified my code as you suggested, however it still leads to shuffling. Could you suggest what’s wrong with my code or provide an example code with block matrices multiplication that preserves data locality and does not cause shuffling? Modified code: import org.apache.spark.ml

Announcing Spark 1.4.1!

2015-07-15 Thread Patrick Wendell
Hi All, I'm happy to announce the Spark 1.4.1 maintenance release. We recommend all users on the 1.4 branch upgrade to this release, which contain several important bug fixes. Download Spark 1.4.1 - http://spark.apache.org/downloads.html Release notes - http://spark.apache.org/releases/spark-rele

Re: Unable to use dynamicAllocation if spark.executor.instances is set in spark-defaults.conf

2015-07-15 Thread Kelly, Jonathan
I haven't gotten a response on user@ yet for these questions, but these are probably better questions for dev@ anyway, aren't they? Could somebody on dev@ please respond? Thanks, Jonathan From: Jonathan Kelly mailto:jonat...@amazon.com>> Date: Wednesday, July 15, 2015 at 12:18 PM To: "u...@spar

Re: Are These Issues Suitable for our Senior Project?

2015-07-15 Thread Joseph Bradley
Per recent comments on SPARK-6442, I'd recommend not working on that one for now. Instead, even if tasks are not that interesting to you, you should try some small tasks at first to get used to contributing. I am quite sure we'll want to solve SPARK-3703 by May 2016; that's pretty far in the futu

Re: Use of non-standard LIMIT keyword in JDBC tableExists code

2015-07-15 Thread Bob Beauchemin
Granted the 1=0 thing is ugly and assumes constant-folding support or reads way too much data. Submitted JIRA SPARK-9078 (thanks for pointers) and expounded on possible solutions a little bit more there. Cheers, and thanks, Bob -- View this message in context: http://apache-spark-developers-

Re: Slight API incompatibility caused by SPARK-4072

2015-07-15 Thread Marcelo Vanzin
Or, alternatively, the bus could catch that error and ignore / log it, instead of stopping the context... On Wed, Jul 15, 2015 at 12:20 PM, Marcelo Vanzin wrote: > Hmm, the Java listener was added in 1.3, so I think it will work for my > needs. > > Might be worth it to make it clear in the Spark

Re: Slight API incompatibility caused by SPARK-4072

2015-07-15 Thread Marcelo Vanzin
Hmm, the Java listener was added in 1.3, so I think it will work for my needs. Might be worth it to make it clear in the SparkListener documentation that people should avoid using it directly. Or follow Reynold's suggestion. On Wed, Jul 15, 2015 at 12:14 PM, Patrick Wendell wrote: > One related

Re: Slight API incompatibility caused by SPARK-4072

2015-07-15 Thread Reynold Xin
It's bad that expose a trait - even though we want to mixin stuff. We should really audit all of these and expose only abstract classes for anything beyond an extremely simple interface. That itself however would break binary compatibility. On Wed, Jul 15, 2015 at 12:15 PM, Patrick Wendell wrote

Re: Slight API incompatibility caused by SPARK-4072

2015-07-15 Thread Patrick Wendell
Actually the java one is a concrete class. On Wed, Jul 15, 2015 at 12:14 PM, Patrick Wendell wrote: > One related note here is that we have a Java version of this that is > an abstract class - in the doc it says that it exists more or less to > allow for binary compatibility (it says it's for Jav

Re: Slight API incompatibility caused by SPARK-4072

2015-07-15 Thread Patrick Wendell
One related note here is that we have a Java version of this that is an abstract class - in the doc it says that it exists more or less to allow for binary compatibility (it says it's for Java users, but really Scala could use this also): https://github.com/apache/spark/blob/master/core/src/main/j

Re: problems with build of latest the master

2015-07-15 Thread Sean Owen
Why does Spark need to depend on it? I'm missing that bit. If an openstack artifact is needed for openstack, shouldn't openstack add it? otherwise everybody gets it in their build. On Wed, Jul 15, 2015 at 7:52 PM, Gil Vernik wrote: > I mean currently users that wish to use Spark and configure Spa

Re: problems with build of latest the master

2015-07-15 Thread Gil Vernik
I mean currently users that wish to use Spark and configure Spark to use OpenStack Swift need to manually edit pom.xml of Spark ( main, core, yarn ) and add hadoop-openstack.jar to it and then compile Spark. My question is why not to include this dependency in Spark for Hadoop profiles 2.4 and

Slight API incompatibility caused by SPARK-4072

2015-07-15 Thread Marcelo Vanzin
Hey all, Just noticed this when some of our tests started to fail. SPARK-4072 added a new method to the "SparkListener" trait, and even though it has a default implementation, it doesn't seem like that applies retroactively. Namely, if you have an existing, compiled app that has an implementation

Re: Use of non-standard LIMIT keyword in JDBC tableExists code

2015-07-15 Thread Reynold Xin
Hi Bob, Thanks for the email. You can select Spark as the project when you file a JIRA ticket at https://issues.apache.org/jira/browse/SPARK For "select 1 from $table where 0=1" -- if the database's optimizer doesn't do constant folding and short-circuit execution, could the query end up scanni

Re: problems with build of latest the master

2015-07-15 Thread Sean Owen
You shouldn't get dependencies you need from Spark, right? you declare direct dependencies. Are we talking about re-scoping or excluding this dep from Hadoop transitively? On Wed, Jul 15, 2015 at 7:33 PM, Gil Vernik wrote: > Right, it's not currently dependence in Spark. > If we already mention i

Use of non-standard LIMIT keyword in JDBC tableExists code

2015-07-15 Thread Bob Beauchemin
tableExists in spark/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcUtils.scala uses non-standard SQL (specifically, the LIMIT keyword) to determine whether a table exists in a JDBC data source. This will cause an exception in many/most JDBC databases that doesn't support LIMIT keyword. See

Re: problems with build of latest the master

2015-07-15 Thread Gil Vernik
Right, it's not currently dependence in Spark. If we already mention it, is it possible to make it part of current dependence, but only for Hadoop profiles 2.4 and up? This will solve a lot of headache to those who use Spark + OpenStack Swift and need every time to manually edit pom.xml to add de

Re: Record metadata with RDDs and DataFrames

2015-07-15 Thread Reynold Xin
Yea - I'd just add a bunch of columns. Doesn't seem like that big of a deal. On Wed, Jul 15, 2015 at 10:53 AM, RJ Nowling wrote: > I'm considering a few approaches -- one of which is to provide new > functions like mapLeft, mapRight, filterLeft, etc. > > But this all falls shorts with DataFrame

Re: Record metadata with RDDs and DataFrames

2015-07-15 Thread RJ Nowling
I'm considering a few approaches -- one of which is to provide new functions like mapLeft, mapRight, filterLeft, etc. But this all falls shorts with DataFrames. RDDs can easily be extended from RDD[T] to RDD[Record[T]]. I guess with DataFrames, I could add special columns? On Wed, Jul 15, 2015

Re: Record metadata with RDDs and DataFrames

2015-07-15 Thread Reynold Xin
How about just using two fields, one boolean field to mark good/bad, and another to get the source file? On Wed, Jul 15, 2015 at 10:31 AM, RJ Nowling wrote: > Hi all, > > I'm working on an ETL task with Spark. As part of this work, I'd like to > mark records with some info such as: > > 1. Whet

Record metadata with RDDs and DataFrames

2015-07-15 Thread RJ Nowling
Hi all, I'm working on an ETL task with Spark. As part of this work, I'd like to mark records with some info such as: 1. Whether the record is good or bad (e.g, Either) 2. Originating file and lines Part of my motivation is to prevent errors with individual records from stopping the entire pipe

Re: PySpark GroupByKey implementation question

2015-07-15 Thread Davies Liu
I think we should start without map-side-combine for Scala, because it's easy to OOM in JVM than in Python (we don't have hard limit in Python yet). On Wed, Jul 15, 2015 at 9:52 AM, Matt Cheah wrote: > Should we actually enable map-side-combine for groupByKey in Scala RDD as > well, then? If we i

Re: Joining Apache Spark

2015-07-15 Thread Animesh Tripathy
I know Django users would love to use Apache Spark with Python 3 since 1.4 release is out. However, I was thinking of making a PHP library for Spark. Anyone know what can help me get started. I am kinda new to developing libraries I have only done library designing for Amazon MWS.

Re: PySpark GroupByKey implementation question

2015-07-15 Thread Matt Cheah
Should we actually enable map-side-combine for groupByKey in Scala RDD as well, then? If we implement external-group-by should we implement it with the map-side-combine semantics that Pyspark does? -Matt Cheah On 7/15/15, 8:21 AM, "Davies Liu" wrote: >If the map-side-combine is not that necessa

Re: problems with build of latest the master

2015-07-15 Thread Ted Yu
If I understand correctly, hadoop-openstack is not currently dependence in Spark. > On Jul 15, 2015, at 8:21 AM, Josh Rosen wrote: > > We may be able to fix this from the Spark side by adding appropriate > exclusions in our Hadoop dependencies, right? If possible, I think that we > should

Re: problems with build of latest the master

2015-07-15 Thread Josh Rosen
We may be able to fix this from the Spark side by adding appropriate exclusions in our Hadoop dependencies, right? If possible, I think that we should do this. On Wed, Jul 15, 2015 at 7:10 AM, Ted Yu wrote: > I attached a patch for HADOOP-12235 > > BTW openstack was not mentioned in the first e

Re: PySpark GroupByKey implementation question

2015-07-15 Thread Davies Liu
If the map-side-combine is not that necessary, given the fact that it cannot reduce the size of data for shuffling much (do need to serialized the key for each value), but can reduce the number of key-value pairs, and potential reduce the number of operations later (repartition and groupby). On Tu

Re: problems with build of latest the master

2015-07-15 Thread Ted Yu
I attached a patch for HADOOP-12235 BTW openstack was not mentioned in the first email from Gil. My email and Gil's second email were sent around the same moment. Cheers On Wed, Jul 15, 2015 at 2:06 AM, Steve Loughran wrote: > > On 14 Jul 2015, at 12:22, Ted Yu wrote: > > Looking at Jenkins

Re: RestSubmissionClient Basic Auth

2015-07-15 Thread Joel Zambrano
Thanks Akhil! For the one where I change the rest client, how likely would it be that a change like that goes thru? Would it be rejected as an uncommon scenario? I really don't want to have this as a separate form of the branch. Thanks, Joel From: Akhil Das Sent

Spark-SQL parameters like shuffle.partitions should be stored in the lineage

2015-07-15 Thread daniel.mescheder
Hey everyone, Consider the following use of spark.sql.shuffle.partitions: case class Data(A:String = f"${(math.random*1e8).toLong}%09.0f", B: String = f"${(math.random*1e8).toLong}%09.0f") val dataFrame = (1 to 1000).map(_ => Data()).toDF dataFrame.registerTempTable("data") sqlContext.setConf( "

Re: RestSubmissionClient Basic Auth

2015-07-15 Thread Akhil Das
Either way is fine. Relay proxy would be much easier, adding authentication to the REST client would require you to rebuild and test the piece of code that you wrote for authentication. Thanks Best Regards On Wed, Jul 15, 2015 at 4:51 AM, Joel Zambrano wrote: > Hi! We have a gateway with basic

Re: problems with build of latest the master

2015-07-15 Thread Steve Loughran
On 14 Jul 2015, at 12:22, Ted Yu mailto:yuzhih...@gmail.com>> wrote: Looking at Jenkins, master branch compiles. Can you try the following command ? mvn -Phive -Phadoop-2.6 -DskipTests clean package What version of Java are you using ? Ted, Giles has stuck in hadoop-openstack, it's that whic

Re: Should spark-ec2 get its own repo?

2015-07-15 Thread Sean Owen
The code can continue to be a good reference implementation, no matter where it lives. In fact, it can be a better more complete one, and easier to update. I agree that ec2/ needs to retain some kind of pointer to the new location. Yes, maybe a script as well that does the checkout as you say. We

Expression.resolved unmatched with the correct values in catalyst?

2015-07-15 Thread Takeshi Yamamuro
Hi, devs I found that the case of 'Expression.resolved != (Expression.childrenResolved && checkInputDataTypes().isSuccess)' occurs in the output of Analyzer. That is, some tests in o.a.s.sql.* fail if the codes below are added in CheckAnalysis: https://github.com/maropu/spark/commit/a488eee8351f5