Re: problem with using mapPartitions

2015-05-30 Thread unioah
Thank you for your reply. But the typo is not reason for the problem. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/problem-with-using-mapPartitions-tp12514p12520.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: Dataframe's .drop in PySpark doesn't accept Column

2015-05-30 Thread Reynold Xin
Name resolution is not as easy I think. Wenchen can maybe give you some advice on resolution about this one. On Sat, May 30, 2015 at 9:37 AM, Yijie Shen henry.yijies...@gmail.com wrote: I think just match the Column’s expr as UnresolvedAttribute and use UnresolvedAttribute’s name to match

Re: Using UDFs in Java without registration

2015-05-30 Thread Reynold Xin
We added all the typetags for arguments but haven't got around to use them yet. I think it'd make sense to have them and do the auto cast, but we can have rules in analysis to forbid certain casts (e.g. don't auto cast double to int). On Sat, May 30, 2015 at 7:12 AM, Justin Uang

Re: problem with using mapPartitions

2015-05-30 Thread Ted Yu
bq. val result = fDB.mappartitions(testMP).collect Not sure if you pasted the above code - there was a typo: method name should be mapPartitions Cheers On Sat, May 30, 2015 at 9:44 AM, unioah uni...@gmail.com wrote: Hi, I try to aggregate the value in each partition internally. For

Re: Catalyst: Reusing already computed expressions within a projection

2015-05-30 Thread Reynold Xin
I think you are looking for http://en.wikipedia.org/wiki/Common_subexpression_elimination in the optimizer. One thing to note is that as we do more and more optimization like this, the optimization time might increase. Do you see a case where this can bring you substantial performance gains? On

Re: problem with using mapPartitions

2015-05-30 Thread unioah
I solved the problem. It was caused by using spark-core_2.11 mvn repository. When I compiled with spark-core_2.10, the problem doesn't show up again. -- View this message in context:

Re: [VOTE] Release Apache Spark 1.4.0 (RC3)

2015-05-30 Thread Krishna Sankar
+1 (non-binding, of course) 1. Compiled OSX 10.10 (Yosemite) OK Total time: 17:07 min mvn clean package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4 -Dhadoop.version=2.6.0 -DskipTests 2. Tested pyspark, mlib - running as well as compare results with 1.3.1 2.1. statistics

Re: Catalyst: Reusing already computed expressions within a projection

2015-05-30 Thread Michael Armbrust
I think this is likely something that we'll want to do during the code generation phase. Though its probably not the lowest hanging fruit at this point. On Sun, May 31, 2015 at 5:02 AM, Reynold Xin r...@databricks.com wrote: I think you are looking for

Re: Using UDFs in Java without registration

2015-05-30 Thread Reynold Xin
I think you are right that there is no way to call Java UDF without registration right now. Adding another 20 methods to functions would be scary. Maybe the best way is to have a companion object for UserDefinedFunction, and define UDF there? e.g. object UserDefinedFunction { def define(f:

Re: StreamingContextSuite fails with NoSuchMethodError

2015-05-30 Thread Tathagata Das
Did was it a clean compilation? TD On Fri, May 29, 2015 at 10:48 PM, Ted Yu yuzhih...@gmail.com wrote: Hi, I ran the following command on 1.4.0 RC3: mvn -Phadoop-2.4 -Dhadoop.version=2.7.0 -Pyarn -Phive package I saw the following failure: ^[[32mStreamingContextSuite:^[[0m ^[[32m- from

Re: Dataframe's .drop in PySpark doesn't accept Column

2015-05-30 Thread Reynold Xin
Yea would be great to support a Column. Can you create a JIRA, and possibly a pull request? On Fri, May 29, 2015 at 2:45 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Actually, the Scala API too is only based on column name Le ven. 29 mai 2015 à 11:23, Olivier Girardot

Re: StreamingContextSuite fails with NoSuchMethodError

2015-05-30 Thread Ted Yu
I downloaded source tar ball and ran command similar to following with: clean package -DskipTests Then I ran the following command. Fyi On May 30, 2015, at 12:42 AM, Tathagata Das t...@databricks.com wrote: Did was it a clean compilation? TD On Fri, May 29, 2015 at 10:48 PM, Ted

Sidebar: issues targeted for 1.4.0

2015-05-30 Thread Sean Owen
No 1.4.0 Blockers at this point, which is great. Forking this thread to discuss something else. There are 92 issues targeted for 1.4.0, 28 of which are marked Critical. Many are procedural issues like update docs for 1.4 or check X for 1.4. Are these resolved? They sound like things that are

Re: Using UDFs in Java without registration

2015-05-30 Thread Justin Uang
The idea of asking for both the argument and return class is interesting. I don't think we do that for the scala APIs currently, right? In functions.scala, we only use the TypeTag for RT. def udf[RT: TypeTag, A1: TypeTag](f: Function1[A1, RT]): UserDefinedFunction = { UserDefinedFunction(f,

Re: Dataframe's .drop in PySpark doesn't accept Column

2015-05-30 Thread Olivier Girardot
Jira done : https://issues.apache.org/jira/browse/SPARK-7969 I've already started working on it but it's less trivial than it seems because I don't exactly now the inner workings of the catalog, and how to get the qualified name of a column to match it against the schema/catalog. Regards,

Catalyst: Reusing already computed expressions within a projection

2015-05-30 Thread Justin Uang
If I do the following df2 = df.withColumn('y', df['x'] * 7) df3 = df2.withColumn('z', df2.y * 3) df3.explain() Then the result is Project [date#56,id#57,timestamp#58,x#59,(x#59 * 7.0) AS y#64,((x#59 * 7.0) AS y#64 * 3.0) AS z#65] PhysicalRDD

Re: Catalyst: Reusing already computed expressions within a projection

2015-05-30 Thread Justin Uang
On second thought, perhaps can this be done by writing a rule that builds the dag of dependencies between expressions, then convert it into several layers of projections, where each new layer is allowed to depend on expression results from previous projections? Are there any pitfalls to this