Re: spark 1.4 - test-loading 1786 mysql tables / a few TB

2015-06-01 Thread René Treffer
Hi, I'm using sqlContext.jdbc(uri, table, where).map(_ = 1).aggregate(0)(_+_,_+_) on an interactive shell (where where is an Array[String] of 32 to 48 elements). (The code is tailored to your db, specifically through the where conditions, I'd have otherwise post it) That should be the DataFrame

Re: spark 1.4 - test-loading 1786 mysql tables / a few TB

2015-06-01 Thread Reynold Xin
Never mind my comment about 3. You were talking about the read side, while I was thinking about the write side. Your workaround actually is a pretty good idea. Can you create a JIRA for that as well? On Monday, June 1, 2015, Reynold Xin r...@databricks.com wrote: René, Thanks for sharing your

spark 1.4 - test-loading 1786 mysql tables / a few TB

2015-06-01 Thread René Treffer
Hi *, I used to run into a few problems with the jdbc/mysql integration and thought it would be nice to load our whole db, doing nothing but .map(_ = 1).aggregate(0)(_+_,_+_) on the DataFrames. SparkSQL has to load all columns and process them so this should reveal type errors like SPARK-7897

Re: please use SparkFunSuite instead of ScalaTest's FunSuite from now on

2015-06-01 Thread Steve Loughran
Is this backported to branch 1.3? On 31 May 2015, at 00:44, Reynold Xin r...@databricks.commailto:r...@databricks.com wrote: FYI we merged a patch that improves unit test log debugging. In order for that to work, all test suites have been changed to extend SparkFunSuite instead of ScalaTest's

Re: spark 1.4 - test-loading 1786 mysql tables / a few TB

2015-06-01 Thread Reynold Xin
René, Thanks for sharing your experience. Are you using the DataFrame API or SQL? (1) Any recommendations on what we do w.r.t. out of range values? Should we silently turn them into a null? Maybe based on an option? (2) Looks like a good idea to always quote column names. The small tricky thing

GraphX: New graph operator

2015-06-01 Thread Tarek Auel
Hello, Someone proposed in a Jira issue to implement new graph operations. Sean Owen recommended to check first with the mailing list, if this is interesting or not. So I would like to know, if it is interesting for GraphX to implement the operators like:

Re: please use SparkFunSuite instead of ScalaTest's FunSuite from now on

2015-06-01 Thread Reynold Xin
I don't think so. On Monday, June 1, 2015, Steve Loughran ste...@hortonworks.com wrote: Is this backported to branch 1.3? On 31 May 2015, at 00:44, Reynold Xin r...@databricks.com javascript:_e(%7B%7D,'cvml','r...@databricks.com'); wrote: FYI we merged a patch that improves unit test

Re: [VOTE] Release Apache Spark 1.4.0 (RC3)

2015-06-01 Thread Peter Rudenko
Still have problem using HiveContext from sbt. Here’s an example of dependencies: |val sparkVersion = 1.4.0-rc3 lazy val root = Project(id = spark-hive, base = file(.), settings = Project.defaultSettings ++ Seq( name := spark-1.4-hive, scalaVersion := 2.10.5, scalaBinaryVersion := 2.10,

Re: [VOTE] Release Apache Spark 1.4.0 (RC3)

2015-06-01 Thread Yin Huai
Hi Peter, Based on your error message, seems you were not using the RC3. For the error thrown at HiveContext's line 206, we have changed the message to this one https://github.com/apache/spark/blob/v1.4.0-rc3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala#L205-207 just before

Re: please use SparkFunSuite instead of ScalaTest's FunSuite from now on

2015-06-01 Thread Andrew Or
It will be within the next few days 2015-06-01 9:17 GMT-07:00 Reynold Xin r...@databricks.com: I don't think so. On Monday, June 1, 2015, Steve Loughran ste...@hortonworks.com wrote: Is this backported to branch 1.3? On 31 May 2015, at 00:44, Reynold Xin r...@databricks.com wrote:

Re: [VOTE] Release Apache Spark 1.4.0 (RC3)

2015-06-01 Thread Andrew Or
+1 (binding) Tested the standalone cluster mode REST submission gateway - submit / status / kill Tested simple applications on YARN client / cluster modes with and without --jars Tested python applications on YARN client / cluster modes with and without --py-files* Tested dynamic allocation on

Re: spark 1.4 - test-loading 1786 mysql tables / a few TB

2015-06-01 Thread Reynold Xin
Thanks, René. I actually added a warning to the new JDBC reader/writer interface for 1.4.0. Even with that, I think we should support throttling JDBC; otherwise it's too convenient for our users to DOS their production database servers! /** * Construct a [[DataFrame]] representing the

Re: [VOTE] Release Apache Spark 1.4.0 (RC3)

2015-06-01 Thread Peter Rudenko
Thanks Yin, tried on a clean VM - works now. But tests in my app still fails: |[info] Cause: javax.jdo.JDOFatalDataStoreException: Unable to open a test connection to the given database. JDBC url = jdbc:derby:;databaseName=metastore_db;create=true, username = APP. Terminating connection pool

Re: [VOTE] Release Apache Spark 1.4.0 (RC3)

2015-06-01 Thread Michael Armbrust
Its no longer valid to start more than one instance of HiveContext in a single JVM, as one of the goals of this refactoring was to allow connection to more than one metastore from a single context. For tests I suggest you use TestHive as we do in our unit tests. It has a reset() method you can

Re: [VOTE] Release Apache Spark 1.4.0 (RC3)

2015-06-01 Thread Sean Owen
I get a bunch of failures in VersionSuite with build/test params -Pyarn -Phive -Phadoop-2.6: - success sanity check *** FAILED *** java.lang.RuntimeException: [download failed: org.jboss.netty#netty;3.2.2.Final!netty.jar(bundle), download failed: commons-net#commons-net;3.1!commons-net.jar]

Re: [VOTE] Release Apache Spark 1.4.0 (RC3)

2015-06-01 Thread Bobby Chowdary
Hive Context works on RC3 for Mapr after adding spark.sql.hive.metastore.sharedPrefixes as suggested in SPARK-7819 https://issues.apache.org/jira/browse/SPARK-7819. However, there still seems to be some other issues with native libraries, i get below warning WARN NativeCodeLoader: Unable to load

[SQL] Write parquet files under partition directories?

2015-06-01 Thread Matt Cheah
Hi there, I noticed in the latest Spark SQL programming guide https://spark.apache.org/docs/latest/sql-programming-guide.html , there is support for optimized reading of partitioned Parquet files that have a particular directory structure (year=1/month=10/day=3, for example). However, I see no

Re: [VOTE] Release Apache Spark 1.4.0 (RC3)

2015-06-01 Thread Patrick Wendell
Hey Bobby, Those are generic warnings that the hadoop libraries throw. If you are using MapRFS they shouldn't matter since you are using the MapR client and not the default hadoop client. Do you have any issues with functionality... or was it just seeing the warnings that was the concern?

Re: [SQL] Write parquet files under partition directories?

2015-06-01 Thread Reynold Xin
There will be in 1.4. df.write.partitionBy(year, month, day).parquet(/path/to/output) On Mon, Jun 1, 2015 at 10:21 PM, Matt Cheah mch...@palantir.com wrote: Hi there, I noticed in the latest Spark SQL programming guide https://spark.apache.org/docs/latest/sql-programming-guide.html, there

Re: [Streaming] Configure executor logging on Mesos

2015-06-01 Thread Gerard Maas
Hi Tim, (added dev, removed user) I've created https://issues.apache.org/jira/browse/SPARK-8009 to track this. -kr, Gerard. On Sat, May 30, 2015 at 7:10 PM, Tim Chen t...@mesosphere.io wrote: So sounds like some generic downloadable uris support can solve this problem, that Mesos

Re: [VOTE] Release Apache Spark 1.4.0 (RC3)

2015-06-01 Thread Bobby Chowdary
Hi Patrick, Thanks for clarifying. No issues with functionality. +1 (non-binding) Thanks Bobby On Mon, Jun 1, 2015 at 9:41 PM, Patrick Wendell pwend...@gmail.com wrote: Hey Bobby, Those are generic warnings that the hadoop libraries throw. If you are using MapRFS they