Re: test cases stuck on local-cluster mode of ReplSuite?
Sorry to revive an old thread, but I just ran into this issue myself. It is likely that you do not have the assembly jar built, or that you have SPARK_HOME set incorrectly (it does not need to be set). Michael On Thu, Feb 27, 2014 at 8:13 AM, Nan Zhu zhunanmcg...@gmail.com wrote: Hi, all Actually this problem exists for months in my side, when I run the test cases, it will stop (actually pause?) at the ReplSuite [info] ReplSuite: 2014-02-27 10:57:37.220 java[3911:1303] Unable to load realm info from SCDynamicStore [info] - propagation of local properties (7 seconds, 646 milliseconds) [info] - simple foreach with accumulator (6 seconds, 204 milliseconds) [info] - external vars (4 seconds, 271 milliseconds) [info] - external classes (3 seconds, 186 milliseconds) [info] - external functions (4 seconds, 843 milliseconds) [info] - external functions that access vars (3 seconds, 503 milliseconds) [info] - broadcast vars (4 seconds, 313 milliseconds) [info] - interacting with files (2 seconds, 492 milliseconds) The next test case should be test(local-cluster mode) { val output = runInterpreter(local-cluster[1,1,512], |var v = 7 |def getV() = v |sc.parallelize(1 to 10).map(x = getV()).collect.reduceLeft(_+_) |v = 10 |sc.parallelize(1 to 10).map(x = getV()).collect.reduceLeft(_+_) |var array = new Array[Int](5) |val broadcastArray = sc.broadcast(array) |sc.parallelize(0 to 4).map(x = broadcastArray.value(x)).collect |array(0) = 5 |sc.parallelize(0 to 4).map(x = broadcastArray.value(x)).collect .stripMargin) assertDoesNotContain(error:, output) assertDoesNotContain(Exception, output) assertContains(res0: Int = 70, output) assertContains(res1: Int = 100, output) assertContains(res2: Array[Int] = Array(0, 0, 0, 0, 0), output) assertContains(res4: Array[Int] = Array(0, 0, 0, 0, 0), output) } I didn't see any reason for it spending so much time on it Any idea? I'm using mbp, OS X 10.9.1, Intel Core i7 2.9 GHz, Memory 8GB 1600 MHz DDR3 Best, -- Nan Zhu
Re: new Catalyst/SQL component merged into master
Hi Everyone, I'm very excited about merging this new feature into Spark! We have a lot of cool things in the pipeline, including: porting Shark's in-memory columnar format to Spark SQL, code-generation for expression evaluation and improved support for complex types in parquet. I would love to hear feedback on the interfaces, and what is missing. In particular, while we have pretty good test coverage for Hive, there has not been a lot of testing with real Hive deployments and there is certainly a lot more work to do. So, please test it out and if there are any missing features let me know! Michael On Thu, Mar 20, 2014 at 6:11 PM, Reynold Xin r...@databricks.com wrote: Hi All, I'm excited to announce a new module in Spark (SPARK-1251). After an initial review we've merged this as Spark as an alpha component to be included in Spark 1.0. This new component adds some exciting features, including: - schema-aware RDD programming via an experimental DSL - native Parquet support - support for executing SQL against RDDs The pull request itself contains more information: https://github.com/apache/spark/pull/146 You can also find the documentation for this new component here: http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html This contribution was lead by Michael Ambrust with work from several other contributors who I'd like to highlight here: Yin Huai, Cheng Lian, Andre Schumacher, Timothy Chen, Henry Cook, and Mark Hamstra. - Reynold
Re: new Catalyst/SQL component merged into master
It will be great if there are any examples or usecases to look at ? There are examples in the Spark documentation. Patrick posted and updated copy here so people can see them before 1.0 is released: http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html Does this feature has different usecases than shark or more cleaner as hive dependency is gone? Depending on how you use this, there is still a dependency on Hive (By default this is not the case. See the above documentation for more details). However, the dependency is on a stock version of Hive instead of one modified by the AMPLab. Furthermore, Spark SQL has its own optimizer, instead of relying on the Hive optimizer. Long term, this is going to give us a lot more flexibility to optimize queries specifically for the Spark execution engine. We are actively porting over the best parts of shark (specifically the in-memory columnar representation). Shark still has some features that are missing in Spark SQL, including SharkServer (and years of testing). Once SparkSQL graduates from Alpha status, it'll likely become the new backend for Shark.
Making RDDs Covariant
Hey Everyone, Here is a pretty major (but source compatible) change we are considering making to the RDD API for 1.0. Java and Python APIs would remain the same, but users of Scala would likely need to use less casts. This would be especially true for libraries whose functions take RDDs as parameters. Any comments would be appreciated! https://spark-project.atlassian.net/browse/SPARK-1296 Michael
Re: Making RDDs Covariant
From my experience, covariance often becomes a pain when dealing with serialization/deserialization (I've experienced a few cases while developing play-json datomisca). Moreover, if you have implicits, variance often becomes a headache... This is exactly the kind of feedback I was hoping for! Can you be any more specific about the kinds of problems you ran into here?
Re: Making RDDs Covariant
Hi Pascal, Thanks for the input. I think we are going to be okay here since, as Koert said, the current serializers use runtime type information. We could also keep at ClassTag around for the original type when the RDD was created. Good things to be aware of though. Michael On Sat, Mar 22, 2014 at 12:42 PM, Pascal Voitot Dev pascal.voitot@gmail.com wrote: On Sat, Mar 22, 2014 at 8:38 PM, David Hall d...@cs.berkeley.edu wrote: On Sat, Mar 22, 2014 at 8:59 AM, Pascal Voitot Dev pascal.voitot@gmail.com wrote: The problem I was talking about is when you try to use typeclass converters and make them contravariant/covariant for input/output. Something like: Reader[-I, +O] { def read(i:I): O } Doing this, you soon have implicit collisions and philosophical concerns about what it means to serialize/deserialize a Parent class and a Child class... You should (almost) never make a typeclass param contravariant. It's almost certainly not what you want: https://issues.scala-lang.org/browse/SI-2509 -- David I confirm that it's a pain and I must say I never do it but I've inherited historical code that did it :)
Re: new Catalyst/SQL component merged into master
Hi Evan, Index support is definitely something we would like to add, and it is possible that adding support for your custom indexing solution would not be too difficult. We already push predicates into hive table scan operators when the predicates are over partition keys. You can see an example of how we collect filters and decide which can pushed into the scan using the HiveTableScan query planning strategyhttps://github.com/marmbrus/spark/blob/0ae86cfcba56b700d8e7bd869379f0c663b21c1e/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala#L56 . I'd like to know more about your indexing solution. Is this something publicly available? One concern here is that the query planning code is not considered a public API and so is likely to change quite a bit as we improve the optimizer. Its not currently something that we plan to expose for external components to modify. Michael On Sun, Mar 23, 2014 at 11:49 PM, Evan Chan e...@ooyala.com wrote: Hi Michael, Congrats, this is really neat! What thoughts do you have regarding adding indexing support and predicate pushdown to this SQL framework?Right now we have custom bitmap indexing to speed up queries, so we're really curious as far as the architectural direction. -Evan On Fri, Mar 21, 2014 at 11:09 AM, Michael Armbrust mich...@databricks.com wrote: It will be great if there are any examples or usecases to look at ? There are examples in the Spark documentation. Patrick posted and updated copy here so people can see them before 1.0 is released: http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html Does this feature has different usecases than shark or more cleaner as hive dependency is gone? Depending on how you use this, there is still a dependency on Hive (By default this is not the case. See the above documentation for more details). However, the dependency is on a stock version of Hive instead of one modified by the AMPLab. Furthermore, Spark SQL has its own optimizer, instead of relying on the Hive optimizer. Long term, this is going to give us a lot more flexibility to optimize queries specifically for the Spark execution engine. We are actively porting over the best parts of shark (specifically the in-memory columnar representation). Shark still has some features that are missing in Spark SQL, including SharkServer (and years of testing). Once SparkSQL graduates from Alpha status, it'll likely become the new backend for Shark. -- -- Evan Chan Staff Engineer e...@ooyala.com |
Travis CI
Just a quick note to everyone that Patrick and I are playing around with Travis CI on the Spark github repository. For now, travis does not run all of the test cases, so will only be turned on experimentally. Long term it looks like Travis might give better integration with github, so we are going to see if it is feasible to get all of our tests running on it. *Jenkins remains the reference CI and should be consulted before merging pull requests, independent of what Travis says.* If you have any questions or want to help out with the investigation, let me know! Michael
Re: Travis CI
Is the migration from Jenkins to Travis finished? It is not finished and really at this point it is only something we are considering, not something that will happen for sure. We turned it on in addition to Jenkins so that we could start finding issues exactly like the ones you described below to determine if Travis is going to be a viable option. Basically it seems to me that the Travis environment is a little less predictable (probably because of virtualization) and this is pointing out some existing flakey-ness in the tests If there are tests that are regularly flakey we should probably file JIRAs so they can be fixed or switched off. If you have seen a test fail 2-3 times and then pass with no changes, I'd say go ahead and file an issue for it (others should feel free to chime in if we want some other process here) A few more specific comments inline below. 2. hive/test usually aborted because it doesn't output anything within 10 minutes Hmm, this is a little confusing. Do you have a pointer to this one? Was there any other error? 4. hive/test didn't finish in 50 minutes, and was aborted Here I think the right thing to do is probably break the hive tests in two and run them in parallel. There is already machinery for doing this, we just need to flip the options on in the travis.yml to make it happen. This is only going to get more critical as we whitelist more hive tests. We also talked about checking the PR and skipping the hive tests when there have been no changes in catalyst/sql/hive. I'm okay with this plan, just need to find someone with time to implement it
Re: Flaky streaming tests
There is a JIRA for one of the flakey tests here: https://issues.apache.org/jira/browse/SPARK-1409 On Mon, Apr 7, 2014 at 11:32 AM, Patrick Wendell pwend...@gmail.com wrote: TD - do you know what is going on here? I looked into this ab it and at least a few of these that use Thread.sleep() and assume the sleep will be exact, which is wrong. We should disable all the tests that do and probably they should be re-written to virtualize time. - Patrick On Mon, Apr 7, 2014 at 10:52 AM, Kay Ousterhout k...@eecs.berkeley.edu wrote: Hi all, The InputStreamsSuite seems to have some serious flakiness issues -- I've seen the file input stream fail many times and now I'm seeing some actor input stream test failures ( https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13846/consoleFull ) on what I think is an unrelated change. Does anyone know anything about these? Should we just remove some of these tests since they seem to be constantly failing? -Kay
Re: RFC: varargs in Logging.scala?
Hi Marcelo, Thanks for bringing this up here, as this has been a topic of debate recently. Some thoughts below. ... all of the suffer from the fact that the log message needs to be built even though it might not be used. This is not true of the current implementation (and this is actually why Spark has a logging trait instead of just using a logger directly.) If you look at the original function signatures: protected def logDebug(msg: = String) ... The = implies that we are passing the msg by name instead of by value. Under the covers, scala is creating a closure that can be used to calculate the log message, only if its actually required. This does result is a significant performance improvement, but still requires allocating an object for the closure. The bytecode is really something like this: val logMessage = new Function0() { def call() = Log message + someExpensiveComputation() } log.debug(logMessage) In Catalyst and Spark SQL we are using the scala-logging package, which uses macros to automatically rewrite all of your log statements. You write: logger.debug(sLog message $someExpensiveComputation) You get: if(logger.debugEnabled) { val logMsg = Log message + someExpensiveComputation() logger.debug(logMsg) } IMHO, this is the cleanest option (and is supported by Typesafe). Based on a micro-benchmark, it is also the fastest: std logging: 19885.48ms spark logging 914.408ms scala logging 729.779ms Once the dust settles from the 1.0 release, I'd be in favor of standardizing on scala-logging. Michael
Re: RFC: varargs in Logging.scala?
BTW... You can do calculations in string interpolation: sTime: ${timeMillis / 1000}s Or use format strings. fFloat with two decimal places: $floatValue%.2f More info: http://docs.scala-lang.org/overviews/core/string-interpolation.html On Thu, Apr 10, 2014 at 5:46 PM, Michael Armbrust mich...@databricks.comwrote: Hi Marcelo, Thanks for bringing this up here, as this has been a topic of debate recently. Some thoughts below. ... all of the suffer from the fact that the log message needs to be built even though it might not be used. This is not true of the current implementation (and this is actually why Spark has a logging trait instead of just using a logger directly.) If you look at the original function signatures: protected def logDebug(msg: = String) ... The = implies that we are passing the msg by name instead of by value. Under the covers, scala is creating a closure that can be used to calculate the log message, only if its actually required. This does result is a significant performance improvement, but still requires allocating an object for the closure. The bytecode is really something like this: val logMessage = new Function0() { def call() = Log message + someExpensiveComputation() } log.debug(logMessage) In Catalyst and Spark SQL we are using the scala-logging package, which uses macros to automatically rewrite all of your log statements. You write: logger.debug(sLog message $someExpensiveComputation) You get: if(logger.debugEnabled) { val logMsg = Log message + someExpensiveComputation() logger.debug(logMsg) } IMHO, this is the cleanest option (and is supported by Typesafe). Based on a micro-benchmark, it is also the fastest: std logging: 19885.48ms spark logging 914.408ms scala logging 729.779ms Once the dust settles from the 1.0 release, I'd be in favor of standardizing on scala-logging. Michael
Re: Problem creating objects through reflection
The Spark REPL is slightly modified from the normal Scala REPL to prevent work from being done twice when closures are deserialized on the workers. I'm not sure exactly why this causes your problem, but its probably worth filing a JIRA about it. Here is another issues with classes defined in the REPL. Not sure if it is related, but I'd be curious if the workaround helps you: https://issues.apache.org/jira/browse/SPARK-1199 Michael On Thu, Apr 24, 2014 at 3:14 AM, Piotr Kołaczkowski pkola...@datastax.comwrote: Hi, I'm working on Cassandra-Spark integration and I hit a pretty severe problem. One of the provided functionality is mapping Cassandra rows into objects of user-defined classes. E.g. like this: class MyRow(val key: String, val data: Int) sc.cassandraTable(keyspace, table).select(key, data).as[MyRow] // returns CassandraRDD[MyRow] In this example CassandraRDD creates MyRow instances by reflection, i.e. matches selected fields from Cassandra table and passes them to the constructor. Unfortunately this does not work in Spark REPL. Turns out any class declared on the REPL is an inner classes, and to be successfully created, it needs a reference to the outer object, even though it doesn't really use anything from the outer context. scala class SomeClass defined class SomeClass scala classOf[SomeClass].getConstructors()(0) res11: java.lang.reflect.Constructor[_] = public $iwC$$iwC$SomeClass($iwC$$iwC) I tried passing a null as a temporary workaround, and it also doesn't work - I get NPE. How can I get a reference to the current outer object representing the context of the current line? Also, plain non-spark Scala REPL doesn't exhibit this behaviour - and classes declared on the REPL are proper top-most classes, not inner ones. Why? Thanks, Piotr -- Piotr Kolaczkowski, Lead Software Engineer pkola...@datastax.com 777 Mariners Island Blvd., Suite 510 San Mateo, CA 94404
Re: [VOTE] Release Apache Spark 1.0.0 (rc8)
-1 We found a regression in the way configuration is passed to executors. https://issues.apache.org/jira/browse/SPARK-1864 https://github.com/apache/spark/pull/808 Michael On Fri, May 16, 2014 at 3:57 PM, Mark Hamstra m...@clearstorydata.comwrote: +1 On Fri, May 16, 2014 at 2:16 AM, Patrick Wendell pwend...@gmail.com wrote: [Due to ASF e-mail outage, I'm not if anyone will actually receive this.] Please vote on releasing the following candidate as Apache Spark version 1.0.0! This has only minor changes on top of rc7. The tag to be voted on is v1.0.0-rc8 (commit 80eea0f): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=80eea0f111c06260ffaa780d2f3f7facd09c17bc The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.0.0-rc8/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1016/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/ Please vote on releasing this package as Apache Spark 1.0.0! The vote is open until Monday, May 19, at 10:15 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.0.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == API Changes == We welcome users to compile Spark applications against 1.0. There are a few API changes in this release. Here are links to the associated upgrade guides - user facing changes have been kept as small as possible. changes to ML vector specification: http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/mllib-guide.html#from-09-to-10 changes to the Java API: http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark changes to the streaming API: http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x changes to the GraphX API: http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091 coGroup and related functions now return Iterable[T] instead of Seq[T] == Call toSeq on the result to restore the old behavior SparkContext.jarOfClass returns Option[String] instead of Seq[String] == Call toSeq on the result to restore old behavior
Re: Timestamp support in v1.0
Thanks for reporting this! https://issues.apache.org/jira/browse/SPARK-1964 https://github.com/apache/spark/pull/913 If you could test out that PR and see if it fixes your problems I'd really appreciate it! Michael On Thu, May 29, 2014 at 9:09 AM, Andrew Ash and...@andrewash.com wrote: I can confirm that the commit is included in the 1.0.0 release candidates (it was committed before branch-1.0 split off from master), but I can't confirm that it works in PySpark. Generally the Python and Java interfaces lag a little behind the Scala interface to Spark, but we're working to keep that diff much smaller going forward. Can you try the same thing in Scala? On Thu, May 29, 2014 at 8:54 AM, dataginjaninja rickett.stepha...@gmail.com wrote: Can anyone verify which rc [SPARK-1360] Add Timestamp Support for SQL #275 https://github.com/apache/spark/pull/275 is included in? I am running rc3, but receiving errors with TIMESTAMP as a datatype in my Hive tables when trying to use them in pyspark. *The error I get: * 14/05/29 15:44:47 INFO ParseDriver: Parsing command: SELECT COUNT(*) FROM aol 14/05/29 15:44:48 INFO ParseDriver: Parse Completed 14/05/29 15:44:48 INFO metastore: Trying to connect to metastore with URI thrift: 14/05/29 15:44:48 INFO metastore: Waiting 1 seconds before next connection attempt. 14/05/29 15:44:49 INFO metastore: Connected to metastore. Traceback (most recent call last): File stdin, line 1, in module File /opt/spark-1.0.0-rc3/python/pyspark/sql.py, line 189, in hql return self.hiveql(hqlQuery) File /opt/spark-1.0.0-rc3/python/pyspark/sql.py, line 183, in hiveql return SchemaRDD(self._ssql_ctx.hiveql(hqlQuery), self) File /opt/spark-1.0.0-rc3/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 537, in __call__ File /opt/spark-1.0.0-rc3/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o14.hiveql. : java.lang.RuntimeException: Unsupported dataType: timestamp *The table I loaded:* DROP TABLE IF EXISTS aol; CREATE EXTERNAL TABLE aol ( userid STRING, query STRING, query_time TIMESTAMP, item_rank INT, click_url STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION '/tmp/data/aol'; *The pyspark commands:* from pyspark.sql import HiveContext hctx= HiveContext(sc) results = hctx.hql(SELECT COUNT(*) FROM aol).collect() -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Timestamp-support-in-v1-0-tp6850.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
Re: Timestamp support in v1.0
Yes, you'll need to download the code from that PR and reassemble Spark (sbt/sbt assembly). On Thu, May 29, 2014 at 10:02 AM, dataginjaninja rickett.stepha...@gmail.com wrote: Michael, Will I have to rebuild after adding the change? Thanks -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Timestamp-support-in-v1-0-tp6850p6855.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
Re: Timestamp support in v1.0
You should be able to get away with only doing it locally. This bug is happening during analysis which only occurs on the driver. On Thu, May 29, 2014 at 10:17 AM, dataginjaninja rickett.stepha...@gmail.com wrote: Darn, I was hoping just to sneak it in that file. I am not the only person working on the cluster; if I rebuild it that means I have to redeploy everything to all the nodes as well. So I cannot do that ... today. If someone else doesn't beat me to it. I can rebuild at another time. - Cheers, Stephanie -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Timestamp-support-in-v1-0-tp6850p6857.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
Re: Timestamp support in v1.0
Awesome, thanks for testing! On Thu, Jun 5, 2014 at 1:30 PM, dataginjaninja rickett.stepha...@gmail.com wrote: I can confirm that the patch fixed my issue. :-) - Cheers, Stephanie -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Timestamp-support-in-v1-0-tp6850p6948.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
Re: question about Hive compatiblilty tests
I assume you are adding tests? because that is the only time you should see that message. That error could mean a couple of things: 1) The query is invalid and hive threw an exception 2) Your Hive setup is bad. Regarding #2, you need to have the source for Hive 0.12.0 available and built as well as a hadoop installation. You also have to have the environment vars set as specified here: https://github.com/apache/spark/tree/master/sql Michael On Thu, Jun 19, 2014 at 12:22 AM, Will Benton wi...@redhat.com wrote: Hi all, Does a Failed to generate golden answer for query message from HiveComparisonTests indicate that it isn't possible to run the query in question under Hive from Spark's test suite rather than anything about Spark's implementation of HiveQL? The stack trace I'm getting implicates Hive code and not Spark code, but I wanted to make sure I wasn't missing something. thanks, wb
Re: [VOTE] Release Apache Spark 1.0.1 (RC2)
+1 I tested sql/hive functionality. On Sat, Jul 5, 2014 at 9:30 AM, Mark Hamstra m...@clearstorydata.com wrote: +1 On Fri, Jul 4, 2014 at 12:40 PM, Patrick Wendell pwend...@gmail.com wrote: I'll start the voting with a +1 - ran tests on the release candidate and ran some basic programs. RC1 passed our performance regression suite, and there are no major changes from that RC. On Fri, Jul 4, 2014 at 12:39 PM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.0.1! The tag to be voted on is v1.0.1-rc1 (commit 7d1043c): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=7d1043c99303b87aef8ee19873629c2bfba4cc78 The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.0.1-rc2/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1021/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.0.1-rc2-docs/ Please vote on releasing this package as Apache Spark 1.0.1! The vote is open until Monday, July 07, at 20:45 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.0.1 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ === Differences from RC1 === This release includes only one blocking patch from rc1: https://github.com/apache/spark/pull/1255 There are also smaller fixes which came in over the last week. === About this release === This release fixes a few high-priority bugs in 1.0 and has a variety of smaller fixes. The full list is here: http://s.apache.org/b45. Some of the more visible patches are: SPARK-2043: ExternalAppendOnlyMap doesn't always find matching keys SPARK-2156 and SPARK-1112: Issues with jobs hanging due to akka frame size. SPARK-1790: Support r3 instance types on EC2. This is the first maintenance release on the 1.0 line. We plan to make additional maintenance releases as new fixes come in.
Re: sparkSQL thread safe?
Hey Ian, Thanks for bringing these up! Responses in-line: Just wondering if right now spark sql is expected to be thread safe on master? doing a simple hadoop file - RDD - schema RDD - write parquet will fail in reflection code if i run these in a thread pool. You are probably hitting SPARK-2178 https://issues.apache.org/jira/browse/SPARK-2178 which is caused by SI-6240 https://issues.scala-lang.org/browse/SI-6240. We have a plan to fix this by moving the schema introspection to compile time, using macros. The SparkSqlSerializer, seems to create a new Kryo instance each time it wants to serialize anything. I got a huge speedup when I had any non-primitive type in my SchemaRDD using the ResourcePool's from Chill for providing the KryoSerializer to it. (I can open an RB if there is some reason not to re-use them?) Sounds like SPARK-2102 https://issues.apache.org/jira/browse/SPARK-2102. There is no reason AFAIK to not reuse the instance. A PR would be greatly appreciated! With the Distinct Count operator there is no map-side operations, and a test to check for this. Is there any reason not to do a map side combine into a set and then merge the sets later? (similar to the approximate distinct count operator) Thats just not an optimization that we had implemented yet... but I've just done it here https://github.com/apache/spark/pull/1366 and it'll be in master soon :) Another thing while i'm mailing.. the 1.0.1 docs have a section like: // Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit, // you can use custom classes that implement the Product interface. Which sounds great, we have lots of data in thrift.. so via scrooge ( https://github.com/twitter/scrooge), we end up with ultimately instances of traits which implement product. Though the reflection code appears to look for the constructor of the class and base the types based on those parameters? Yeah, thats true that we only look in the constructor at the moment, but I don't think there is a really good reason for that (other than I guess we will need to add code to make sure we skip builtin object methods). If you want to open a JIRA, we can try fixing this. Michael
Re: Catalyst dependency on Spark Core
Yeah, sadly this dependency was introduced when someone consolidated the logging infrastructure. However, the dependency should be very small and thus easy to remove, and I would like catalyst to be usable outside of Spark. A pull request to make this possible would be welcome. Ideally, we'd create some sort of spark common package that has things like logging. That way catalyst could depend on that, without pulling in all of Hadoop, etc. Maybe others have opinions though, so I'm cc-ing the dev list. On Mon, Jul 14, 2014 at 12:21 AM, Yanbo Liang yanboha...@gmail.com wrote: Make Catalyst independent of Spark is the goal of Catalyst, maybe need time and evolution. I awared that package org.apache.spark.sql.catalyst.util embraced org.apache.spark.util.{Utils = SparkUtils}, so that Catalyst has a dependency on Spark core. I'm not sure whether it will be replaced by other component independent of Spark in later release. 2014-07-14 11:51 GMT+08:00 Aniket Bhatnagar aniket.bhatna...@gmail.com: As per the recent presentation given in Scala days ( http://people.apache.org/~marmbrus/talks/SparkSQLScalaDays2014.pdf), it was mentioned that Catalyst is independent of Spark. But on inspecting pom.xml of sql/catalyst module, it seems it has a dependency on Spark Core. Any particular reason for the dependency? I would love to use Catalyst outside Spark (reposted as previous email bounced. Sorry if this is a duplicate).
Change when loading/storing String data using Parquet
I just wanted to send out a quick note about a change in the handling of strings when loading / storing data using parquet and Spark SQL. Before, Spark SQL did not support binary data in Parquet, so all binary blobs were implicitly treated as Strings. 9fe693 https://github.com/apache/spark/commit/9fe693b5b6ed6af34ee1e800ab89c8a11991ea38 fixes this limitation by adding support for binary data. However, data written out with a prior version of Spark SQL will be missing the annotation telling us to interpret a given column as a String, so old string data will now be loaded as binary data. If you would like to use the data as a string, you will need to add a CAST to convert the datatype. New string data written out after this change, will correctly be loaded in as a string as now we will include an annotation about the desired type. Additionally, this should now interoperate correctly with other systems that write Parquet data (hive, thrift, etc). Michael
Re: SQLQuerySuite error
Thanks for reporting back. I was pretty confused trying to reproduce the error :) On Thu, Jul 24, 2014 at 1:09 PM, Stephen Boesch java...@gmail.com wrote: OK I did find my error. The missing step: mvn install I should have republished (mvn install) all of the other modules . The mvn -pl will rely on the modules locally published and so the latest code that I had git pull'ed was not being used (except the sql/core module code). The tests are passing after having properly performed the mvn install before running with the mvn -pl sql/core. 2014-07-24 12:04 GMT-07:00 Stephen Boesch java...@gmail.com: Are other developers seeing the following error for the recently added substr() method? If not, any ideas why the following invocation of tests would be failing for me - i.e. how the given invocation would need to be tweaked? mvn -Pyarn -Pcdh5 test -pl sql/core -DwildcardSuites=org.apache.spark.sql.SQLQuerySuite (note cdh5 is a custom profile for cdh5.0.0 but should not be affecting these results) Only the test(SPARK-2407 Added Parser of SQL SUBSTR()) fails: all of the other 33 tests pass. SQLQuerySuite: - SPARK-2041 column name equals tablename - SPARK-2407 Added Parser of SQL SUBSTR() *** FAILED *** Exception thrown while executing query: == Logical Plan == java.lang.UnsupportedOperationException == Optimized Logical Plan == java.lang.UnsupportedOperationException == Physical Plan == java.lang.UnsupportedOperationException == Exception == java.lang.UnsupportedOperationException java.lang.UnsupportedOperationException at org.apache.spark.sql.catalyst.analysis.EmptyFunctionRegistry$.lookupFunction(FunctionRegistry.scala:33) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$5$$anonfun$applyOrElse$3.applyOrElse(Analyzer.scala:131) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$5$$anonfun$applyOrElse$3.applyOrElse(Analyzer.scala:129) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:183) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to (TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:212) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:168) at org.apache.spark.sql.catalyst.plans.QueryPlan.org $apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionDown$1(QueryPlan.scala:52) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1$$anonfun$apply$1.apply(QueryPlan.scala:66) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at
Re: [VOTE] Release Apache Spark 1.0.2 (RC1)
That query is looking at Fix Version not Target Version. The fact that the first one is still open is only because the bug is not resolved in master. It is fixed in 1.0.2. The second one is partially fixed in 1.0.2, but is not worth blocking the release for. On Fri, Jul 25, 2014 at 4:23 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: TD, there are a couple of unresolved issues slated for 1.0.2 https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%201.0.2%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20priority%20DESC . Should they be edited somehow? On Fri, Jul 25, 2014 at 7:08 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.0.2. This release fixes a number of bugs in Spark 1.0.1. Some of the notable ones are - SPARK-2452: Known issue is Spark 1.0.1 caused by attempted fix for SPARK-1199. The fix was reverted for 1.0.2. - SPARK-2576: NoClassDefFoundError when executing Spark QL query on HDFS CSV file. The full list is at http://s.apache.org/9NJ The tag to be voted on is v1.0.2-rc1 (commit 8fb6f00e): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=8fb6f00e195fb258f3f70f04756e07c259a2351f The release files, including signatures, digests, etc can be found at: http://people.apache.org/~tdas/spark-1.0.2-rc1/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/tdas.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1024/ The documentation corresponding to this release can be found at: http://people.apache.org/~tdas/spark-1.0.2-rc1-docs/ Please vote on releasing this package as Apache Spark 1.0.2! The vote is open until Tuesday, July 29, at 23:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.0.2 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/
Re: new JDBC server test cases seems failed ?
How recent is this? We've already reverted this patch once due to failing tests. It would be helpful to include a link to the failed build. If its failing again we'll have to revert again. On Sun, Jul 27, 2014 at 5:26 PM, Nan Zhu zhunanmcg...@gmail.com wrote: Hi, all It seems that the JDBC test cases are failed unexpectedly in Jenkins? [info] - test query execution against a Hive Thrift server *** FAILED *** [info] java.sql.SQLException: Could not open connection to jdbc:hive2://localhost:45518/: java.net.ConnectException: Connection refused [info] at org.apache.hive.jdbc.HiveConnection.openTransport(HiveConnection.java:146) [info] at org.apache.hive.jdbc.HiveConnection.init(HiveConnection.java:123) [info] at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105) [info] at java.sql.DriverManager.getConnection(DriverManager.java:571) [info] at java.sql.DriverManager.getConnection(DriverManager.java:215) [info] at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2Suite.getConnection(HiveThriftServer2Suite.scala:131) [info] at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2Suite.createStatement(HiveThriftServer2Suite.scala:134) [info] at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2Suite$$anonfun$1.apply$mcV$sp(HiveThriftServer2Suite.scala:110) [info] at org.apache.spark.sql.hive.thri ftserver.HiveThriftServer2Suite$$anonfun$1.apply(HiveThriftServer2Suite.scala:107) [info] at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2Suite$$anonfun$1.apply(HiveThriftServer2Suite.scala:107) [info] ... [info] Cause: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection refused [info] at org.apache.thrift.transport.TSocket.open(TSocket.java:185) [info] at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:248) [info] at org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37) [info] at org.apache.hive.jdbc.HiveConnection.openTransport(HiveConnection.java:144) [info] at org.apache.hive.jdbc.HiveConnection.init(HiveConnection.java:123) [info] at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105) [info] at java.sql.DriverManager.getConnection(DriverManager.java:571) [info] at java.sql.DriverManager.getConnection(DriverManager.java:215) [info] at org.apache.spark.sql.hive.thriftserver.H iveThriftServer2Suite.getConnection(HiveThriftServer2Suite.scala:131) [info] at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2Suite.createStatement(HiveThriftServer2Suite.scala:134) [info] ... [info] Cause: java.net.ConnectException: Connection refused [info] at java.net.PlainSocketImpl.socketConnect(Native Method) [info] at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) [info] at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) [info] at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) [info] at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) [info] at java.net.Socket.connect(Socket.java:579) [info] at org.apache.thrift.transport.TSocket.open(TSocket.java:180) [info] at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:248) [info] at org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37) [info] at org.apache.hive.jdbc.HiveConn ection.openTransport(HiveConnection.java:144) [info] ... [info] CliSuite: Executing: create table hive_test1(key int, val string);, expecting output: OK [warn] four warnings found [warn] Note: /home/jenkins/workspace/SparkPullRequestBuilder@4/core/src/test/java/org/apache/spark/JavaAPISuite.java uses or overrides a deprecated API. [warn] Note: Recompile with -Xlint:deprecation for details. [info] - simple commands *** FAILED *** [info] java.lang.AssertionError: assertion failed: Didn't find OK in the output: [info] at scala.Predef$.assert(Predef.scala:179) [info] at org.apache.spark.sql.hive.thriftserver.TestUtils$class.waitForQuery(TestUtils.scala:70) [info] at org.apache.spark.sql.hive.thriftserver.CliSuite.waitForQuery(CliSuite.scala:25) [info] at org.apache.spark.sql.hive.thriftserver.TestUtils$class.executeQuery(TestUtils.scala:62) [info] at org.apache.spark.sql.hive.thriftserver.CliSuite.executeQuery(CliSuite.scala:25) [info] at org.apache.spark.sql.hive.thriftserver.CliSuite $$anonfun$1.apply$mcV$sp(CliSuite.scala:53) [info] at org.apache.spark.sql.hive.thriftserver.CliSuite$$anonfun$1.apply(CliSuite.scala:51) [info] at org.apache.spark.sql.hive.thriftserver.CliSuite$$anonfun$1.apply(CliSuite.scala:51) [info] at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) [info] at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) [log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell). log4j:WARN Please initialize the log4j system properly. log4j:WARN See
Re: Working Formula for Hive 0.13?
A few things: - When we upgrade to Hive 0.13.0, Patrick will likely republish the hive-exec jar just as we did for 0.12.0 - Since we have to tie into some pretty low level APIs it is unsurprising that the code doesn't just compile out of the box against 0.13.0 - ScalaReflection is for determining Schema from Scala classes, not reflection based bridge code. Either way its unclear to if there is any reason to use reflection to support multiple versions, instead of just upgrading to Hive 0.13.0 One question I have is, What is the goal of upgrading to hive 0.13.0? Is it purely because you are having problems connecting to newer metastores? Are there some features you are hoping for? This will help me prioritize this effort. Michael On Mon, Jul 28, 2014 at 4:05 PM, Ted Yu yuzhih...@gmail.com wrote: I was looking for a class where reflection-related code should reside. I found this but don't think it is the proper class for bridging differences between hive 0.12 and 0.13.1: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala Cheers On Mon, Jul 28, 2014 at 3:41 PM, Ted Yu yuzhih...@gmail.com wrote: After manually copying hive 0.13.1 jars to local maven repo, I got the following errors when building spark-hive_2.10 module : [ERROR] /homes/xx/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala:182: type mismatch; found : String required: Array[String] [ERROR] val proc: CommandProcessor = CommandProcessorFactory.get(tokens(0), hiveconf) [ERROR] ^ [ERROR] /homes/xx/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala:60: value getAllPartitionsForPruner is not a member of org.apache. hadoop.hive.ql.metadata.Hive [ERROR] client.getAllPartitionsForPruner(table).toSeq [ERROR]^ [ERROR] /homes/xx/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala:267: overloaded method constructor TableDesc with alternatives: (x$1: Class[_ : org.apache.hadoop.mapred.InputFormat[_, _]],x$2: Class[_],x$3: java.util.Properties)org.apache.hadoop.hive.ql.plan.TableDesc and ()org.apache.hadoop.hive.ql.plan.TableDesc cannot be applied to (Class[org.apache.hadoop.hive.serde2.Deserializer], Class[(some other)?0(in value tableDesc)(in value tableDesc)], Class[?0(in value tableDesc)(in value tableDesc)], java.util.Properties) [ERROR] val tableDesc = new TableDesc( [ERROR] ^ [WARNING] Class org.antlr.runtime.tree.CommonTree not found - continuing with a stub. [WARNING] Class org.antlr.runtime.Token not found - continuing with a stub. [WARNING] Class org.antlr.runtime.tree.Tree not found - continuing with a stub. [ERROR] while compiling: /homes/xx/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala during phase: typer library version: version 2.10.4 compiler version: version 2.10.4 The above shows incompatible changes between 0.12 and 0.13.1 e.g. the first error corresponds to the following method in CommandProcessorFactory : public static CommandProcessor get(String[] cmd, HiveConf conf) Cheers On Mon, Jul 28, 2014 at 1:32 PM, Steve Nunez snu...@hortonworks.com wrote: So, do we have a short-term fix until Hive 0.14 comes out? Perhaps adding the hive-exec jar to the spark-project repo? It doesn¹t look like there¹s a release date schedule for 0.14. On 7/28/14, 10:50, Cheng Lian lian.cs@gmail.com wrote: Exactly, forgot to mention Hulu team also made changes to cope with those incompatibility issues, but they said that¹s relatively easy once the re-packaging work is done. On Tue, Jul 29, 2014 at 1:20 AM, Patrick Wendell pwend...@gmail.com wrote: I've heard from Cloudera that there were hive internal changes between 0.12 and 0.13 that required code re-writing. Over time it might be possible for us to integrate with hive using API's that are more stable (this is the domain of Michael/Cheng/Yin more than me!). It would be interesting to see what the Hulu folks did. - Patrick On Mon, Jul 28, 2014 at 10:16 AM, Cheng Lian lian.cs@gmail.com wrote: AFAIK, according a recent talk, Hulu team in China has built Spark SQL against Hive 0.13 (or 0.13.1?) successfully. Basically they also re-packaged Hive 0.13 as what the Spark team did. The slides of the talk hasn't been released yet though. On Tue, Jul 29, 2014 at 1:01 AM, Ted Yu yuzhih...@gmail.com wrote: Owen helped me find this: https://issues.apache.org/jira/browse/HIVE-7423 I guess this means that for Hive 0.14, Spark should be able to directly pull in hive-exec-core.jar Cheers On Mon, Jul 28, 2014 at 9:55 AM, Patrick Wendell pwend...@gmail.com wrote: It would be great if the hive team can fix
Re: How to run specific sparkSQL test with maven
It seems that the HiveCompatibilitySuite need a hadoop and hive environment, am I right? Relative path in absolute URI: file:$%7Bsystem:test.tmp.dir%7D/tmp_showcrt1” You should only need Hadoop and Hive if you are creating new tests that we need to compute the answers for. Existing tests are run with cached answers. There are details about the configuration here: https://github.com/apache/spark/tree/master/sql
Re: Working Formula for Hive 0.13?
Could you make a PR as described here: https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark On Fri, Aug 8, 2014 at 1:57 PM, Zhan Zhang zhaz...@gmail.com wrote: Sorry, forget to upload files. I have never posted before :) hive.diff http://apache-spark-developers-list.1001551.n3.nabble.com/file/n/hive.diff -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Working-Formula-for-Hive-0-13-tp7551p.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Working Formula for Hive 0.13?
Thanks for working on this! Its unclear at the moment exactly how we are going to handle this, since the end goal is to be compatible with as many versions of Hive as possible. That said, I think it would be great to open a PR in this case. Even if we don't merge it, thats a good way to get it on people's radar and have a discussion about the changes that are required. On Sun, Aug 24, 2014 at 7:11 PM, scwf wangf...@huawei.com wrote: I have worked for a branch update the hive version to hive-0.13(by org.apache.hive)---https://github.com/scwf/spark/tree/hive-0.13 I am wondering whether it's ok to make a PR now because hive-0.13 version is not compatible with hive-0.12 and here i used org.apache.hive. On 2014/7/29 8:22, Michael Armbrust wrote: A few things: - When we upgrade to Hive 0.13.0, Patrick will likely republish the hive-exec jar just as we did for 0.12.0 - Since we have to tie into some pretty low level APIs it is unsurprising that the code doesn't just compile out of the box against 0.13.0 - ScalaReflection is for determining Schema from Scala classes, not reflection based bridge code. Either way its unclear to if there is any reason to use reflection to support multiple versions, instead of just upgrading to Hive 0.13.0 One question I have is, What is the goal of upgrading to hive 0.13.0? Is it purely because you are having problems connecting to newer metastores? Are there some features you are hoping for? This will help me prioritize this effort. Michael On Mon, Jul 28, 2014 at 4:05 PM, Ted Yu yuzhih...@gmail.com wrote: I was looking for a class where reflection-related code should reside. I found this but don't think it is the proper class for bridging differences between hive 0.12 and 0.13.1: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ ScalaReflection.scala Cheers On Mon, Jul 28, 2014 at 3:41 PM, Ted Yu yuzhih...@gmail.com wrote: After manually copying hive 0.13.1 jars to local maven repo, I got the following errors when building spark-hive_2.10 module : [ERROR] /homes/xx/spark/sql/hive/src/main/scala/org/apache/spark/ sql/hive/HiveContext.scala:182: type mismatch; found : String required: Array[String] [ERROR] val proc: CommandProcessor = CommandProcessorFactory.get(tokens(0), hiveconf) [ERROR] ^ [ERROR] /homes/xx/spark/sql/hive/src/main/scala/org/apache/spark/ sql/hive/HiveMetastoreCatalog.scala:60: value getAllPartitionsForPruner is not a member of org.apache. hadoop.hive.ql.metadata.Hive [ERROR] client.getAllPartitionsForPruner(table).toSeq [ERROR]^ [ERROR] /homes/xx/spark/sql/hive/src/main/scala/org/apache/spark/ sql/hive/HiveMetastoreCatalog.scala:267: overloaded method constructor TableDesc with alternatives: (x$1: Class[_ : org.apache.hadoop.mapred.InputFormat[_, _]],x$2: Class[_],x$3: java.util.Properties)org.apache.hadoop.hive.ql.plan.TableDesc and ()org.apache.hadoop.hive.ql.plan.TableDesc cannot be applied to (Class[org.apache.hadoop.hive. serde2.Deserializer], Class[(some other)?0(in value tableDesc)(in value tableDesc)], Class[?0(in value tableDesc)(in value tableDesc)], java.util.Properties) [ERROR] val tableDesc = new TableDesc( [ERROR] ^ [WARNING] Class org.antlr.runtime.tree.CommonTree not found - continuing with a stub. [WARNING] Class org.antlr.runtime.Token not found - continuing with a stub. [WARNING] Class org.antlr.runtime.tree.Tree not found - continuing with a stub. [ERROR] while compiling: /homes/xx/spark/sql/hive/src/main/scala/org/apache/spark/ sql/hive/HiveQl.scala during phase: typer library version: version 2.10.4 compiler version: version 2.10.4 The above shows incompatible changes between 0.12 and 0.13.1 e.g. the first error corresponds to the following method in CommandProcessorFactory : public static CommandProcessor get(String[] cmd, HiveConf conf) Cheers On Mon, Jul 28, 2014 at 1:32 PM, Steve Nunez snu...@hortonworks.com wrote: So, do we have a short-term fix until Hive 0.14 comes out? Perhaps adding the hive-exec jar to the spark-project repo? It doesn¹t look like there¹s a release date schedule for 0.14. On 7/28/14, 10:50, Cheng Lian lian.cs@gmail.com wrote: Exactly, forgot to mention Hulu team also made changes to cope with those incompatibility issues, but they said that¹s relatively easy once the re-packaging work is done. On Tue, Jul 29, 2014 at 1:20 AM, Patrick Wendell pwend...@gmail.com wrote: I've heard from Cloudera that there were hive internal changes between 0.12 and 0.13 that required code re-writing. Over time it might be possible for us to integrate with hive using API's that are more stable (this is the domain of Michael/Cheng/Yin more than me!). It would be interesting to see what the Hulu folks did. - Patrick On Mon, Jul 28, 2014
Re: Storage Handlers in Spark SQL
- dev list + user list You should be able to query Spark SQL using JDBC, starting with the 1.1 release. There is some documentation is the repo https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md#running-the-thrift-jdbc-server, and we'll update the official docs once the release is out. On Thu, Aug 21, 2014 at 4:43 AM, Niranda Perera nira...@wso2.com wrote: Hi, I have been playing around with Spark for the past few days, and evaluating the possibility of migrating into Spark (Spark SQL) from Hive/Hadoop. I am working on the WSO2 Business Activity Monitor (WSO2 BAM, https://docs.wso2.com/display/BAM241/WSO2+Business+Activity+Monitor+Documentation ) which has currently employed Hive. We are considering Spark as a successor for Hive, given it's performance enhancement. We have currently employed several custom storage-handlers in Hive. Example: WSO2 JDBC and Cassandra storage handlers: https://docs.wso2.com/display/BAM241/JDBC+Storage+Handler+for+Hive https://docs.wso2.com/display/BAM241/Creating+Hive+Queries+to+Analyze+Data#CreatingHiveQueriestoAnalyzeData-cas I would like to know where Spark SQL can work with these storage handlers (while using HiveContext may be) ? Best regards -- *Niranda Perera* Software Engineer, WSO2 Inc. Mobile: +94-71-554-8430 Twitter: @n1r44 https://twitter.com/N1R44
Re: [Spark SQL] off-heap columnar store
Any initial proposal or design about the caching to Tachyon that you can share so far? Caching parquet files in tachyon with saveAsParquetFile and then reading them with parquetFile should already work. You can use SQL on these tables by using registerTempTable. Some of the general parquet work that we have been doing includes: #1935 https://github.com/apache/spark/pull/1935, SPARK-2721 https://issues.apache.org/jira/browse/SPARK-2721, SPARK-3036 https://issues.apache.org/jira/browse/SPARK-3036, SPARK-3037 https://issues.apache.org/jira/browse/SPARK-3037 and #1819 https://github.com/apache/spark/pull/1819 The reason I'm asking about the columnar compressed format is that there are some problems for which Parquet is not practical. Can you elaborate?
Re: CoHadoop Papers
It seems like there are two things here: - Co-locating blocks with the same keys to avoid network transfer. - Leveraging partitioning information to avoid a shuffle when data is already partitioned correctly (even if those partitions aren't yet on the same machine). The former seems more complicated and probably requires the support from Hadoop you linked to. However, the latter might be easier as there is already a framework for reasoning about partitioning and the need to shuffle in the Spark SQL planner. On Tue, Aug 26, 2014 at 8:37 AM, Gary Malouf malouf.g...@gmail.com wrote: Christopher, can you expand on the co-partitioning support? We have a number of spark SQL tables (saved in parquet format) that all could be considered to have a common hash key. Our analytics team wants to do frequent joins across these different data-sets based on this key. It makes sense that if the data for each key across 'tables' was co-located on the same server, shuffles could be minimized and ultimately performance could be much better. From reading the HDFS issue I posted before, the way is being paved for implementing this type of behavior though there are a lot of complications to make it work I believe. On Tue, Aug 26, 2014 at 10:40 AM, Christopher Nguyen c...@adatao.com wrote: Gary, do you mean Spark and HDFS separately, or Spark's use of HDFS? If the former, Spark does support copartitioning. If the latter, it's an HDFS scope that's outside of Spark. On that note, Hadoop does also make attempts to collocate data, e.g., rack awareness. I'm sure the paper makes useful contributions for its set of use cases. Sent while mobile. Pls excuse typos etc. On Aug 26, 2014 5:21 AM, Gary Malouf malouf.g...@gmail.com wrote: It appears support for this type of control over block placement is going out in the next version of HDFS: https://issues.apache.org/jira/browse/HDFS-2576 On Tue, Aug 26, 2014 at 7:43 AM, Gary Malouf malouf.g...@gmail.com wrote: One of my colleagues has been questioning me as to why Spark/HDFS makes no attempts to try to co-locate related data blocks. He pointed to this paper: http://www.vldb.org/pvldb/vol4/p575-eltabakh.pdf from 2011 on the CoHadoop research and the performance improvements it yielded for Map/Reduce jobs. Would leveraging these ideas for writing data from Spark make sense/be worthwhile?
Re: [VOTE] Release Apache Spark 1.1.0 (RC3)
+1 On Tue, Sep 2, 2014 at 5:18 PM, Matei Zaharia matei.zaha...@gmail.com wrote: +1 Tested on Mac OS X. Matei On September 2, 2014 at 5:03:19 PM, Kan Zhang (kzh...@apache.org) wrote: +1 Verified PySpark InputFormat/OutputFormat examples. On Tue, Sep 2, 2014 at 4:10 PM, Reynold Xin r...@databricks.com wrote: +1 On Tue, Sep 2, 2014 at 3:08 PM, Cheng Lian lian.cs@gmail.com wrote: +1 - Tested Thrift server and SQL CLI locally on OSX 10.9. - Checked datanucleus dependencies in distribution tarball built by make-distribution.sh without SPARK_HIVE defined. On Tue, Sep 2, 2014 at 2:30 PM, Will Benton wi...@redhat.com wrote: +1 Tested Scala/MLlib apps on Fedora 20 (OpenJDK 7) and OS X 10.9 (Oracle JDK 8). best, wb - Original Message - From: Patrick Wendell pwend...@gmail.com To: dev@spark.apache.org Sent: Saturday, August 30, 2014 5:07:52 PM Subject: [VOTE] Release Apache Spark 1.1.0 (RC3) Please vote on releasing the following candidate as Apache Spark version 1.1.0! The tag to be voted on is v1.1.0-rc3 (commit b2d0493b): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b2d0493b223c5f98a593bb6d7372706cc02bebad The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.1.0-rc3/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1030/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.1.0-rc3-docs/ Please vote on releasing this package as Apache Spark 1.1.0! The vote is open until Tuesday, September 02, at 23:07 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.1.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == Regressions fixed since RC1 == - Build issue for SQL support: https://issues.apache.org/jira/browse/SPARK-3234 - EC2 script version bump to 1.1.0. == What justifies a -1 vote for this release? == This vote is happening very late into the QA period compared with previous votes, so -1 votes should only occur for significant regressions from 1.0.2. Bugs already present in 1.0.X will not block this release. == What default changes should I be aware of? == 1. The default value of spark.io.compression.codec is now snappy -- Old behavior can be restored by switching to lzf 2. PySpark now performs external spilling during aggregations. -- Old behavior can be restored by setting spark.shuffle.spill to false. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.1.0 (RC4)
+1 On Wed, Sep 3, 2014 at 12:29 AM, Reynold Xin r...@databricks.com wrote: +1 Tested locally on Mac OS X with local-cluster mode. On Wed, Sep 3, 2014 at 12:24 AM, Patrick Wendell pwend...@gmail.com wrote: I'll kick it off with a +1 On Wed, Sep 3, 2014 at 12:24 AM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.1.0! The tag to be voted on is v1.1.0-rc4 (commit 2f9b2bd): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=2f9b2bd7844ee8393dc9c319f4fefedf95f5e460 The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.1.0-rc4/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1031/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.1.0-rc4-docs/ Please vote on releasing this package as Apache Spark 1.1.0! The vote is open until Saturday, September 06, at 08:30 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.1.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == Regressions fixed since RC3 == SPARK-3332 - Issue with tagging in EC2 scripts SPARK-3358 - Issue with regression for m3.XX instances == What justifies a -1 vote for this release? == This vote is happening very late into the QA period compared with previous votes, so -1 votes should only occur for significant regressions from 1.0.2. Bugs already present in 1.0.X will not block this release. == What default changes should I be aware of? == 1. The default value of spark.io.compression.codec is now snappy -- Old behavior can be restored by switching to lzf 2. PySpark now performs external spilling during aggregations. -- Old behavior can be restored by setting spark.shuffle.spill to false. 3. PySpark uses a new heuristic for determining the parallelism of shuffle operations. -- Old behavior can be restored by setting spark.default.parallelism to the number of cores in the cluster. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: trimming unnecessary test output
Feel free to submit a PR to add a log4j.properies file to sql/catalyst/src/test/resources similar to what we do in core/hive. On Sat, Sep 6, 2014 at 2:50 PM, Sean Owen so...@cloudera.com wrote: This is just a line logging that one test succeeded right? I don't find that noise. Recently I wanted to search test run logs for a test case success and it was important that the individual test case was logged. On Sep 6, 2014 4:13 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Continuing the discussion started here https://github.com/apache/spark/pull/2279, I’m wondering if people already know that certain test output is unnecessary and should be trimmed. For example https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19917/consoleFull , I see a bunch of lines like this: 14/09/06 07:54:13 INFO GenerateProjection: Code generated expression List(IS NOT NULL 1) in 128.33733 ms Can/should this type of output be suppressed? Is there any other test output that is obviously more noise than signal? Nick
Re: parquet predicate / projection pushdown into unionAll
On Tue, Sep 9, 2014 at 10:17 AM, Cody Koeninger c...@koeninger.org wrote: Is there a reason in general not to push projections and predicates down into the individual ParquetTableScans in a union? This would be a great case to add to ColumnPruning. Would be awesome if you could open a JIRA or even a PR :)
Re: parquet predicate / projection pushdown into unionAll
Thanks! On Tue, Sep 9, 2014 at 11:07 AM, Cody Koeninger c...@koeninger.org wrote: Opened https://issues.apache.org/jira/browse/SPARK-3462 I'll take a look at ColumnPruning and see what I can do On Tue, Sep 9, 2014 at 12:46 PM, Michael Armbrust mich...@databricks.com wrote: On Tue, Sep 9, 2014 at 10:17 AM, Cody Koeninger c...@koeninger.org wrote: Is there a reason in general not to push projections and predicates down into the individual ParquetTableScans in a union? This would be a great case to add to ColumnPruning. Would be awesome if you could open a JIRA or even a PR :)
Re: parquet predicate / projection pushdown into unionAll
I think usually people add these directories as multiple partitions of the same table instead of union. This actually allows us to efficiently prune directories when reading in addition to standard column pruning. On Tue, Sep 9, 2014 at 11:26 AM, Gary Malouf malouf.g...@gmail.com wrote: I'm kind of surprised this was not run into before. Do people not segregate their data by day/week in the HDFS directory structure? On Tue, Sep 9, 2014 at 2:08 PM, Michael Armbrust mich...@databricks.com wrote: Thanks! On Tue, Sep 9, 2014 at 11:07 AM, Cody Koeninger c...@koeninger.org wrote: Opened https://issues.apache.org/jira/browse/SPARK-3462 I'll take a look at ColumnPruning and see what I can do On Tue, Sep 9, 2014 at 12:46 PM, Michael Armbrust mich...@databricks.com wrote: On Tue, Sep 9, 2014 at 10:17 AM, Cody Koeninger c...@koeninger.org wrote: Is there a reason in general not to push projections and predicates down into the individual ParquetTableScans in a union? This would be a great case to add to ColumnPruning. Would be awesome if you could open a JIRA or even a PR :)
Re: parquet predicate / projection pushdown into unionAll
What Patrick said is correct. Two other points: - In the 1.2 release we are hoping to beef up the support for working with partitioned parquet independent of the metastore. - You can actually do operations like INSERT INTO for parquet tables to add data. This creates new parquet files for each insertion. This will break if there are multiple concurrent writers to the same table. On Tue, Sep 9, 2014 at 12:09 PM, Patrick Wendell pwend...@gmail.com wrote: I think what Michael means is people often use this to read existing partitioned Parquet tables that are defined in a Hive metastore rather than data generated directly from within Spark and then reading it back as a table. I'd expect the latter case to become more common, but for now most users connect to an existing metastore. I think you could go this route by creating a partitioned external table based on the on-disk layout you create. The downside is that you'd have to go through a hive metastore whereas what you are doing now doesn't need hive at all. We should also just fix the case you are mentioning where a union is used directly from within spark. But that's the context. - Patrick On Tue, Sep 9, 2014 at 12:01 PM, Cody Koeninger c...@koeninger.org wrote: Maybe I'm missing something, I thought parquet was generally a write-once format and the sqlContext interface to it seems that way as well. d1.saveAsParquetFile(/foo/d1) // another day, another table, with same schema d2.saveAsParquetFile(/foo/d2) Will give a directory structure like /foo/d1/_metadata /foo/d1/part-r-1.parquet /foo/d1/part-r-2.parquet /foo/d1/_SUCCESS /foo/d2/_metadata /foo/d2/part-r-1.parquet /foo/d2/part-r-2.parquet /foo/d2/_SUCCESS // ParquetFileReader will fail, because /foo/d1 is a directory, not a parquet partition sqlContext.parquetFile(/foo) // works, but has the noted lack of pushdown sqlContext.parquetFile(/foo/d1).unionAll(sqlContext.parquetFile(/foo/d2)) Is there another alternative? On Tue, Sep 9, 2014 at 1:29 PM, Michael Armbrust mich...@databricks.com wrote: I think usually people add these directories as multiple partitions of the same table instead of union. This actually allows us to efficiently prune directories when reading in addition to standard column pruning. On Tue, Sep 9, 2014 at 11:26 AM, Gary Malouf malouf.g...@gmail.com wrote: I'm kind of surprised this was not run into before. Do people not segregate their data by day/week in the HDFS directory structure? On Tue, Sep 9, 2014 at 2:08 PM, Michael Armbrust mich...@databricks.com wrote: Thanks! On Tue, Sep 9, 2014 at 11:07 AM, Cody Koeninger c...@koeninger.org wrote: Opened https://issues.apache.org/jira/browse/SPARK-3462 I'll take a look at ColumnPruning and see what I can do On Tue, Sep 9, 2014 at 12:46 PM, Michael Armbrust mich...@databricks.com wrote: On Tue, Sep 9, 2014 at 10:17 AM, Cody Koeninger c...@koeninger.org wrote: Is there a reason in general not to push projections and predicates down into the individual ParquetTableScans in a union? This would be a great case to add to ColumnPruning. Would be awesome if you could open a JIRA or even a PR :)
Re: parquet predicate / projection pushdown into unionAll
Hey Cody, Thanks for doing this! Will look at your PR later today. Michael On Wed, Sep 10, 2014 at 9:31 AM, Cody Koeninger c...@koeninger.org wrote: Tested the patch against a cluster with some real data. Initial results seem like going from one table to a union of 2 tables is now closer to a doubling of query time as expected, instead of 5 to 10x. Let me know if you see any issues with that PR. On Wed, Sep 10, 2014 at 8:19 AM, Cody Koeninger c...@koeninger.org wrote: So the obvious thing I was missing is that the analyzer has already resolved attributes by the time the optimizer runs, so the references in the filter / projection need to be fixed up to match the children. Created a PR, let me know if there's a better way to do it. I'll see about testing performance against some actual data sets. On Tue, Sep 9, 2014 at 6:09 PM, Cody Koeninger c...@koeninger.org wrote: Ok, so looking at the optimizer code for the first time and trying the simplest rule that could possibly work, object UnionPushdown extends Rule[LogicalPlan] { def apply(plan: LogicalPlan): LogicalPlan = plan transform { // Push down filter into union case f @ Filter(condition, u @ Union(left, right)) = u.copy(left = f.copy(child = left), right = f.copy(child = right)) // Push down projection into union case p @ Project(projectList, u @ Union(left, right)) = u.copy(left = p.copy(child = left), right = p.copy(child = right)) } } If I try manually applying that rule to a logical plan in the repl, it produces the query shape I'd expect, and executing that plan results in parquet pushdowns as I'd expect. But adding those cases to ColumnPruning results in a runtime exception (below) I can keep digging, but it seems like I'm missing some obvious initial context around naming of attributes. If you can provide any pointers to speed me on my way I'd appreciate it. java.lang.AssertionError: assertion failed: ArrayBuffer() + ArrayBuffer() != WrappedArray(name#6, age#7), List(name#9, age#10, phones#11) at scala.Predef$.assert(Predef.scala:179) at org.apache.spark.sql.parquet.ParquetTableScan.init(ParquetTableOperations.scala:75) at org.apache.spark.sql.execution.SparkStrategies$ParquetOperations$$anonfun$9.apply(SparkStrategies.scala:234) at org.apache.spark.sql.execution.SparkStrategies$ParquetOperations$$anonfun$9.apply(SparkStrategies.scala:234) at org.apache.spark.sql.SQLContext$SparkPlanner.pruneFilterProject(SQLContext.scala:367) at org.apache.spark.sql.execution.SparkStrategies$ParquetOperations$.apply(SparkStrategies.scala:230) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59) at org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54) at org.apache.spark.sql.execution.SparkStrategies$BasicOperators$$anonfun$12.apply(SparkStrategies.scala:282) at org.apache.spark.sql.execution.SparkStrategies$BasicOperators$$anonfun$12.apply(SparkStrategies.scala:282) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:282) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:402) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:400) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:406) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:406) at org.apache.spark.sql.SQLContext$QueryExecution.toString(SQLContext.scala:431) On Tue, Sep 9, 2014 at 3:02 PM, Michael Armbrust mich...@databricks.com wrote: What Patrick said is correct. Two other points: - In the 1.2 release we are hoping to beef up the support
Re: parquet predicate / projection pushdown into unionAll
Yeah, thanks for implementing it! Since Spark SQL is an alpha component and moving quickly the plan is to backport all of master into the next point release in the 1.1 series. On Fri, Sep 12, 2014 at 9:27 AM, Cody Koeninger c...@koeninger.org wrote: Cool, thanks for your help on this. Any chance of adding it to the 1.1.1 point release, assuming there ends up being one? On Wed, Sep 10, 2014 at 11:39 AM, Michael Armbrust mich...@databricks.com wrote: Hey Cody, Thanks for doing this! Will look at your PR later today. Michael On Wed, Sep 10, 2014 at 9:31 AM, Cody Koeninger c...@koeninger.org wrote: Tested the patch against a cluster with some real data. Initial results seem like going from one table to a union of 2 tables is now closer to a doubling of query time as expected, instead of 5 to 10x. Let me know if you see any issues with that PR. On Wed, Sep 10, 2014 at 8:19 AM, Cody Koeninger c...@koeninger.org wrote: So the obvious thing I was missing is that the analyzer has already resolved attributes by the time the optimizer runs, so the references in the filter / projection need to be fixed up to match the children. Created a PR, let me know if there's a better way to do it. I'll see about testing performance against some actual data sets. On Tue, Sep 9, 2014 at 6:09 PM, Cody Koeninger c...@koeninger.org wrote: Ok, so looking at the optimizer code for the first time and trying the simplest rule that could possibly work, object UnionPushdown extends Rule[LogicalPlan] { def apply(plan: LogicalPlan): LogicalPlan = plan transform { // Push down filter into union case f @ Filter(condition, u @ Union(left, right)) = u.copy(left = f.copy(child = left), right = f.copy(child = right)) // Push down projection into union case p @ Project(projectList, u @ Union(left, right)) = u.copy(left = p.copy(child = left), right = p.copy(child = right)) } } If I try manually applying that rule to a logical plan in the repl, it produces the query shape I'd expect, and executing that plan results in parquet pushdowns as I'd expect. But adding those cases to ColumnPruning results in a runtime exception (below) I can keep digging, but it seems like I'm missing some obvious initial context around naming of attributes. If you can provide any pointers to speed me on my way I'd appreciate it. java.lang.AssertionError: assertion failed: ArrayBuffer() + ArrayBuffer() != WrappedArray(name#6, age#7), List(name#9, age#10, phones#11) at scala.Predef$.assert(Predef.scala:179) at org.apache.spark.sql.parquet.ParquetTableScan.init(ParquetTableOperations.scala:75) at org.apache.spark.sql.execution.SparkStrategies$ParquetOperations$$anonfun$9.apply(SparkStrategies.scala:234) at org.apache.spark.sql.execution.SparkStrategies$ParquetOperations$$anonfun$9.apply(SparkStrategies.scala:234) at org.apache.spark.sql.SQLContext$SparkPlanner.pruneFilterProject(SQLContext.scala:367) at org.apache.spark.sql.execution.SparkStrategies$ParquetOperations$.apply(SparkStrategies.scala:230) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59) at org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54) at org.apache.spark.sql.execution.SparkStrategies$BasicOperators$$anonfun$12.apply(SparkStrategies.scala:282) at org.apache.spark.sql.execution.SparkStrategies$BasicOperators$$anonfun$12.apply(SparkStrategies.scala:282) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:282) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:402) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:400
Re: problem with HiveContext inside Actor
- dev Is it possible that you are constructing more than one HiveContext in a single JVM? Due to global state in Hive code this is not allowed. Michael On Wed, Sep 17, 2014 at 7:21 PM, Cheng, Hao hao.ch...@intel.com wrote: Hi, Du I am not sure what you mean “triggers the HiveContext to create a database”, do you create the sub class of HiveContext? Just be sure you call the “HiveContext.sessionState” eagerly, since it will set the proper “hiveconf” into the SessionState, otherwise the HiveDriver will always get the null value when retrieving HiveConf. Cheng Hao *From:* Du Li [mailto:l...@yahoo-inc.com.INVALID] *Sent:* Thursday, September 18, 2014 7:51 AM *To:* u...@spark.apache.org; dev@spark.apache.org *Subject:* problem with HiveContext inside Actor Hi, Wonder anybody had similar experience or any suggestion here. I have an akka Actor that processes database requests in high-level messages. Inside this Actor, it creates a HiveContext object that does the actual db work. The main thread creates the needed SparkContext and passes in to the Actor to create the HiveContext. When a message is sent to the Actor, it is processed properly except that, when the message triggers the HiveContext to create a database, it throws a NullPointerException in hive.ql.Driver.java which suggests that its conf variable is not initialized. Ironically, it works fine if my main thread directly calls actor.hiveContext to create the database. The spark version is 1.1.0. Thanks, Du
Re: Support for Hive buckets
Hi Cody, There are currently no concrete plans for adding buckets to Spark SQL, but thats mostly due to lack of resources / demand for this feature. Adding full support is probably a fair amount of work since you'd have to make changes throughout parsing/optimization/execution. That said, there are probably some smaller tasks that could be easier (for example, you might be able to avoid a shuffle when doing joins on tables that are already bucketed by exposing more metastore information to the planner). Michael On Sun, Sep 14, 2014 at 3:10 PM, Cody Koeninger c...@koeninger.org wrote: I noticed that the release notes for 1.1.0 said that spark doesn't support Hive buckets yet. I didn't notice any jira issues related to adding support. Broadly speaking, what would be involved in supporting buckets, especially the bucketmapjoin and sortedmerge optimizations?
Re: OutOfMemoryError on parquet SnappyDecompressor
I actually submitted a patch to do this yesterday: https://github.com/apache/spark/pull/2493 Can you tell us more about your configuration. In particular how much memory/cores do the executors have and what does the schema of your data look like? On Tue, Sep 23, 2014 at 7:39 AM, Cody Koeninger c...@koeninger.org wrote: So as a related question, is there any reason the settings in SQLConf aren't read from the spark context's conf? I understand why the sql conf is mutable, but it's not particularly user friendly to have most spark configuration set via e.g. defaults.conf or --properties-file, but for spark sql to ignore those. On Mon, Sep 22, 2014 at 4:34 PM, Cody Koeninger c...@koeninger.org wrote: After commit 8856c3d8 switched from gzip to snappy as default parquet compression codec, I'm seeing the following when trying to read parquet files saved using the new default (same schema and roughly same size as files that were previously working): java.lang.OutOfMemoryError: Direct buffer memory java.nio.Bits.reserveMemory(Bits.java:658) java.nio.DirectByteBuffer.init(DirectByteBuffer.java:123) java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) parquet.hadoop.codec.SnappyDecompressor.setInput(SnappyDecompressor.java:99) parquet.hadoop.codec.NonBlockedDecompressorStream.read(NonBlockedDecompressorStream.java:43) java.io.DataInputStream.readFully(DataInputStream.java:195) java.io.DataInputStream.readFully(DataInputStream.java:169) parquet.bytes.BytesInput$StreamBytesInput.toByteArray(BytesInput.java:201) parquet.column.impl.ColumnReaderImpl.readPage(ColumnReaderImpl.java:521) parquet.column.impl.ColumnReaderImpl.checkRead(ColumnReaderImpl.java:493) parquet.column.impl.ColumnReaderImpl.consume(ColumnReaderImpl.java:546) parquet.column.impl.ColumnReaderImpl.init(ColumnReaderImpl.java:339) parquet.column.impl.ColumnReadStoreImpl.newMemColumnReader(ColumnReadStoreImpl.java:63) parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:58) parquet.io.RecordReaderImplementation.init(RecordReaderImplementation.java:265) parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:60) parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:74) parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:110) parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:172) parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:130) org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:139) org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388) scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) scala.collection.Iterator$class.isEmpty(Iterator.scala:256) scala.collection.AbstractIterator.isEmpty(Iterator.scala:1157) org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:220) org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:219) org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:722)
Re: view not supported in spark thrift server?
Views are not supported yet. Its not currently on the near term roadmap, but that can change if there is sufficient demand or someone in the community is interested in implementing them. I do not think it would be very hard. Michael On Sun, Sep 28, 2014 at 11:59 AM, Du Li l...@yahoo-inc.com.invalid wrote: Can anybody confirm whether or not view is currently supported in spark? I found “create view translate” in the blacklist of HiveCompatibilitySuite.scala and also the following scenario threw NullPointerException on beeline/thriftserver (1.1.0). Any plan to support it soon? create table src(k string, v string); load data local inpath '/home/y/share/yspark/examples/src/main/resources/kv1.txt' into table src; create view kv as select k, v from src; select * from kv; Error: java.lang.NullPointerException (state=,code=0)
Re: Extending Scala style checks
The hard part here is updating the existing code base... which is going to create merge conflicts with like all of the open PRs... On Wed, Oct 1, 2014 at 6:13 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Ah, since there appears to be a built-in rule for end-of-line whitespace, Michael and Cheng, y'all should be able to add this in pretty easily. Nick On Wed, Oct 1, 2014 at 6:37 PM, Patrick Wendell pwend...@gmail.com wrote: Hey Nick, We can always take built-in rules. Back when we added this Prashant Sharma actually did some great work that lets us write our own style rules in cases where rules don't exist. You can see some existing rules here: https://github.com/apache/spark/tree/master/project/spark-style/src/main/scala/org/apache/spark/scalastyle Prashant has over time contributed a lot of our custom rules upstream to stalastyle, so now there are only a couple there. - Patrick On Wed, Oct 1, 2014 at 2:36 PM, Ted Yu yuzhih...@gmail.com wrote: Please take a look at WhitespaceEndOfLineChecker under: http://www.scalastyle.org/rules-0.1.0.html Cheers On Wed, Oct 1, 2014 at 2:01 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: As discussed here https://github.com/apache/spark/pull/2619, it would be good to extend our Scala style checks to programmatically enforce as many of our style rules as possible. Does anyone know if it's relatively straightforward to enforce additional rules like the no trailing spaces rule mentioned in the linked PR? Nick
Re: Parquet schema migrations
Hi Cody, Assuming you are talking about 'safe' changes to the schema (i.e. existing column names are never reused with incompatible types), this is something I'd love to support. Perhaps you can describe more what sorts of changes you are making, and if simple merging of the schemas would be sufficient. If so, we can open a JIRA, though I'm not sure when we'll have resources to dedicate to this. In the near term, I'd suggest writing converters for each version of the schema, that translate to some desired master schema. You can then union all of these together and avoid the cost of batch conversion. It seems like in most cases this should be pretty efficient, at least now that we have good pushdown past union operators :) Michael On Sun, Oct 5, 2014 at 3:58 PM, Andrew Ash and...@andrewash.com wrote: Hi Cody, I wasn't aware there were different versions of the parquet format. What's the difference between raw parquet and the Hive-written parquet files? As for your migration question, the approaches I've often seen are convert-on-read and convert-all-at-once. Apache Cassandra for example does both -- when upgrading between Cassandra versions that change the on-disk sstable format, it will do a convert-on-read as you access the sstables, or you can run the upgradesstables command to convert them all at once post-upgrade. Andrew On Fri, Oct 3, 2014 at 4:33 PM, Cody Koeninger c...@koeninger.org wrote: Wondering if anyone has thoughts on a path forward for parquet schema migrations, especially for people (like us) that are using raw parquet files rather than Hive. So far we've gotten away with reading old files, converting, and writing to new directories, but that obviously becomes problematic above a certain data size.
Re: How to do broadcast join in SparkSQL
Thanks for the input. We purposefully made sure that the config option did not make it into a release as it is not something that we are willing to support long term. That said we'll try and make this easier in the future either through hints or better support for statistics. In this particular case you can get what you want by registering the tables as external tables and setting an flag. Here's a helper function to do what you need. /** * Sugar for creating a Hive external table from a parquet path. */ def createParquetTable(name: String, file: String): Unit = { import org.apache.spark.sql.hive.HiveMetastoreTypes val rdd = parquetFile(file) val schema = rdd.schema.fields.map(f = s${f.name} ${HiveMetastoreTypes.toMetastoreType(f.dataType)}).mkString(,\n) val ddl = s |CREATE EXTERNAL TABLE $name ( | $schema |) |ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe' |STORED AS INPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat' |OUTPUTFORMAT 'parquet.hive.DeprecatedParquetOutputFormat' |LOCATION '$file'.stripMargin sql(ddl) setConf(spark.sql.hive.convertMetastoreParquet, true) } You'll also need to run this to populate the statistics: ANALYZE TABLE tableName COMPUTE STATISTICS noscan; On Wed, Oct 8, 2014 at 1:44 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Ok, currently there's cost-based optimization however Parquet statistics is not implemented... What's the good way if I want to join a big fact table with several tiny dimension tables in Spark SQL (1.1)? I wish we can allow user hint for the join. Jianshi On Wed, Oct 8, 2014 at 2:18 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Looks like https://issues.apache.org/jira/browse/SPARK-1800 is not merged into master? I cannot find spark.sql.hints.broadcastTables in latest master, but it's in the following patch. https://github.com/apache/spark/commit/76ca4341036b95f71763f631049fdae033990ab5 Jianshi On Mon, Sep 29, 2014 at 1:24 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Yes, looks like it can only be controlled by the parameter spark.sql.autoBroadcastJoinThreshold, which is a little bit weird to me. How am I suppose to know the exact bytes of a table? Let me specify the join algorithm is preferred I think. Jianshi On Sun, Sep 28, 2014 at 11:57 PM, Ted Yu yuzhih...@gmail.com wrote: Have you looked at SPARK-1800 ? e.g. see sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala Cheers On Sun, Sep 28, 2014 at 1:55 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: I cannot find it in the documentation. And I have a dozen dimension tables to (left) join... Cheers, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/ -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/ -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/ -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/
Re: will/when Spark/SparkSQL will support ORCFile format
Yes, the foreign sources work is only about exposing a stable set of APIs for external libraries to link against (to avoid the spark assembly becoming a dependency mess). The code path these APIs use will be the same as that for datasources included in the core spark sql library. Michael On Thu, Oct 9, 2014 at 2:18 PM, James Yu jym2...@gmail.com wrote: For performance, will foreign data format support, same as native ones? Thanks, James On Wed, Oct 8, 2014 at 11:03 PM, Cheng Lian lian.cs@gmail.com wrote: The foreign data source API PR also matters here https://www.github.com/apache/spark/pull/2475 Foreign data source like ORC can be added more easily and systematically after this PR is merged. On 10/9/14 8:22 AM, James Yu wrote: Thanks Mark! I will keep eye on it. @Evan, I saw people use both format, so I really want to have Spark support ORCFile. On Wed, Oct 8, 2014 at 11:12 AM, Mark Hamstra m...@clearstorydata.com wrote: https://github.com/apache/spark/pull/2576 On Wed, Oct 8, 2014 at 11:01 AM, Evan Chan velvia.git...@gmail.com wrote: James, Michael at the meetup last night said there was some development activity around ORCFiles. I'm curious though, what are the pros and cons of ORCFiles vs Parquet? On Wed, Oct 8, 2014 at 10:03 AM, James Yu jym2...@gmail.com wrote: Didn't see anyone asked the question before, but I was wondering if anyone knows if Spark/SparkSQL will support ORCFile format soon? ORCFile is getting more and more popular hi Hive world. Thanks, James - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Trouble running tests
Also, in general for SQL only changes it is sufficient to run sbt/sbt catatlyst/test sql/test hive/test. The hive/test part takes the longest, so I usually leave that out until just before submitting unless my changes are hive specific. On Thu, Oct 9, 2014 at 11:40 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: _RUN_SQL_TESTS needs to be true as well. Those two _... variables set get correctly when tests are run on Jenkins. They’re not meant to be manipulated directly by testers. Did you want to run SQL tests only locally? You can try faking being Jenkins by setting AMPLAB_JENKINS=true before calling run-tests. That should be simpler than futzing with the _... variables. Nick On Thu, Oct 9, 2014 at 10:10 AM, Yana yana.kadiy...@gmail.com wrote: Hi, apologies if I missed a FAQ somewhere. I am trying to submit a bug fix for the very first time. Reading instructions, I forked the git repo (at c9ae79fba25cd49ca70ca398bc75434202d26a97) and am trying to run tests. I run this: ./dev/run-tests _SQL_TESTS_ONLY=true and after a while get the following error: [info] ScalaTest [info] Run completed in 3 minutes, 37 seconds. [info] Total number of tests run: 224 [info] Suites: completed 19, aborted 0 [info] Tests: succeeded 224, failed 0, canceled 0, ignored 5, pending 0 [info] All tests passed. [info] Passed: Total 224, Failed 0, Errors 0, Passed 224, Ignored 5 [success] Total time: 301 s, completed Oct 9, 2014 9:31:23 AM [error] Expected ID character [error] Not a valid command: hive-thriftserver [error] Expected project ID [error] Expected configuration [error] Expected ':' (if selecting a configuration) [error] Expected key [error] Not a valid key: hive-thriftserver [error] hive-thriftserver/test [error] ^ (I am running this without my changes) I have 2 questions: 1. How to fix this 2. Is there a best practice on what to fork so you start off with a good state? I'm wondering if I should sync the latest changes or go back to a label? thanks in advance -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Trouble-running-tests-tp8717.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Parquet Migrations
You can't change parquet schema without reencoding the data as you need to recalculate the footer index data. You can manually do what SPARK-3851 https://issues.apache.org/jira/browse/SPARK-3851 is going to do today however. Consider two schemas: Old Schema: (a: Int, b: String) New Schema, where I've dropped and added a column: (a: Int, c: Long) parquetFile(old).registerTempTable(old) parquetFile(new).registerTempTable(new) sql( SELECT a, b, CAST(null AS LONG) AS c FROM old UNION ALL SELECT a, CAST(null AS STRING) AS b, c FROM new ).registerTempTable(unifiedData) Because of filter/column pushdown past UNIONs this should executed as desired even if you write more complicated queries on top of unifiedData. Its a little onerous but should work for now. This can also support things like column renaming which would be much harder to do automatically. On Fri, Oct 31, 2014 at 1:49 PM, Gary Malouf malouf.g...@gmail.com wrote: Outside of what is discussed here https://issues.apache.org/jira/browse/SPARK-3851 as a future solution, is there any path for being able to modify a Parquet schema once some data has been written? This seems like the kind of thing that should make people pause when considering whether or not to use Parquet+Spark...
Re: Surprising Spark SQL benchmark
dev to bcc. Thanks for reaching out, Ozgun. Let's discuss if there were any missing optimizations off list. We'll make sure to report back or add any findings to the tuning guide. On Mon, Nov 3, 2014 at 3:01 PM, ozgun oz...@citusdata.com wrote: Hey Patrick, It's Ozgun from Citus Data. We'd like to make these benchmark results fair, and have tried different config settings for SparkSQL over the past month. We picked the best config settings we could find, and also contacted the Spark users list about running TPC-H numbers. http://goo.gl/IU5Hw0 http://goo.gl/WQ1kML http://goo.gl/ihLzgh We also received advice at the Spark Summit '14 to wait until v1.1, and therefore re-ran our tests on SparkSQL 1.1. On the specific optimizations, Marco and Samay from our team have much more context, and I'll let them answer your questions on the different settings we tried. Our intent is to be fair and not misrepresent SparkSQL's performance. On that front, we used publicly available documentation and user lists, and spent about a month trying to get the best Spark performance results. If there are specific optimizations we should have applied and missed, we'd love to be involved with the community in re-running the numbers. Is this email thread the best place to continue the conversation? Best, Ozgun -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Surprising-Spark-SQL-benchmark-tp9041p9073.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Designating maintainers for some Spark components
+1 (binding) On Wed, Nov 5, 2014 at 5:33 PM, Matei Zaharia matei.zaha...@gmail.com wrote: BTW, my own vote is obviously +1 (binding). Matei On Nov 5, 2014, at 5:31 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Hi all, I wanted to share a discussion we've been having on the PMC list, as well as call for an official vote on it on a public list. Basically, as the Spark project scales up, we need to define a model to make sure there is still great oversight of key components (in particular internal architecture and public APIs), and to this end I've proposed implementing a maintainer model for some of these components, similar to other large projects. As background on this, Spark has grown a lot since joining Apache. We've had over 80 contributors/month for the past 3 months, which I believe makes us the most active project in contributors/month at Apache, as well as over 500 patches/month. The codebase has also grown significantly, with new libraries for SQL, ML, graphs and more. In this kind of large project, one common way to scale development is to assign maintainers to oversee key components, where each patch to that component needs to get sign-off from at least one of its maintainers. Most existing large projects do this -- at Apache, some large ones with this model are CloudStack (the second-most active project overall), Subversion, and Kafka, and other examples include Linux and Python. This is also by-and-large how Spark operates today -- most components have a de-facto maintainer. IMO, adopting this model would have two benefits: 1) Consistent oversight of design for that component, especially regarding architecture and API. This process would ensure that the component's maintainers see all proposed changes and consider them to fit together in a good way. 2) More structure for new contributors and committers -- in particular, it would be easy to look up who’s responsible for each module and ask them for reviews, etc, rather than having patches slip between the cracks. We'd like to start with in a light-weight manner, where the model only applies to certain key components (e.g. scheduler, shuffle) and user-facing APIs (MLlib, GraphX, etc). Over time, as the project grows, we can expand it if we deem it useful. The specific mechanics would be as follows: - Some components in Spark will have maintainers assigned to them, where one of the maintainers needs to sign off on each patch to the component. - Each component with maintainers will have at least 2 maintainers. - Maintainers will be assigned from the most active and knowledgeable committers on that component by the PMC. The PMC can vote to add / remove maintainers, and maintained components, through consensus. - Maintainers are expected to be active in responding to patches for their components, though they do not need to be the main reviewers for them (e.g. they might just sign off on architecture / API). To prevent inactive maintainers from blocking the project, if a maintainer isn't responding in a reasonable time period (say 2 weeks), other committers can merge the patch, and the PMC will want to discuss adding another maintainer. If you'd like to see examples for this model, check out the following projects: - CloudStack: https://cwiki.apache.org/confluence/display/CLOUDSTACK/CloudStack+Maintainers+Guide https://cwiki.apache.org/confluence/display/CLOUDSTACK/CloudStack+Maintainers+Guide - Subversion: https://subversion.apache.org/docs/community-guide/roles.html https://subversion.apache.org/docs/community-guide/roles.html Finally, I wanted to list our current proposal for initial components and maintainers. It would be good to get feedback on other components we might add, but please note that personnel discussions (e.g. I don't think Matei should maintain *that* component) should only happen on the private list. The initial components were chosen to include all public APIs and the main core components, and the maintainers were chosen from the most active contributors to those modules. - Spark core public API: Matei, Patrick, Reynold - Job scheduler: Matei, Kay, Patrick - Shuffle and network: Reynold, Aaron, Matei - Block manager: Reynold, Aaron - YARN: Tom, Andrew Or - Python: Josh, Matei - MLlib: Xiangrui, Matei - SQL: Michael, Reynold - Streaming: TD, Matei - GraphX: Ankur, Joey, Reynold I'd like to formally call a [VOTE] on this model, to last 72 hours. The [VOTE] will end on Nov 8, 2014 at 6 PM PST. Matei
Re: Replacing Spark's native scheduler with Sparrow
However, I haven't seen it be as high as the 100ms Michael quoted (maybe this was for jobs with tasks that have much larger objects that take a long time to deserialize?). I was thinking more about the average end-to-end latency for launching a query that has 100s of partitions. Its also quite possible that SQLs task launch overhead is higher since we have never profiled how much is getting pulled into the closures.
Re: [VOTE] Release Apache Spark 1.1.1 (RC1)
Hey Sean, Thanks for pointing this out. Looks like a bad test where we should be doing Set comparison instead of Array. Michael On Thu, Nov 13, 2014 at 2:05 AM, Sean Owen so...@cloudera.com wrote: LICENSE and NOTICE are fine. Signature and checksum is fine. I unzipped and built the plain source distribution, which built. However I am seeing a consistent test failure with mvn -DskipTests clean package; mvn test. In the Hive module: - SET commands semantics for a HiveContext *** FAILED *** Expected Array(spark.sql.key.usedfortestonly=test.val.0, spark.sql.key.usedfortestonlyspark.sql.key.usedfortestonly=test.val.0test.val.0), but got Array(spark.sql.key.usedfortestonlyspark.sql.key.usedfortestonly=test.val.0test.val.0, spark.sql.key.usedfortestonly=test.val.0) (HiveQuerySuite.scala:544) Anyone else seeing this? On Thu, Nov 13, 2014 at 8:18 AM, Krishna Sankar ksanka...@gmail.com wrote: +1 1. Compiled OSX 10.10 (Yosemite) mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package 10:49 min 2. Tested pyspark, mlib 2.1. statistics OK 2.2. Linear/Ridge/Laso Regression OK 2.3. Decision Tree, Naive Bayes OK 2.4. KMeans OK 2.5. rdd operations OK 2.6. recommendation OK 2.7. Good work ! In 1.1.0, there was an error and my program used to hang (over memory allocation) consistently running validation using itertools, compute optimum rank, lambda,numofiterations/rmse; data - movielens medium dataset (1 million records) . It works well in 1.1.1 ! Cheers k/ P.S: Missed Reply all, first time On Wed, Nov 12, 2014 at 8:35 PM, Andrew Or and...@databricks.com wrote: I will start the vote with a +1 2014-11-12 20:34 GMT-08:00 Andrew Or and...@databricks.com: Please vote on releasing the following candidate as Apache Spark version 1 .1.1. This release fixes a number of bugs in Spark 1.1.0. Some of the notable ones are - [SPARK-3426] Sort-based shuffle compression settings are incompatible - [SPARK-3948] Stream corruption issues in sort-based shuffle - [SPARK-4107] Incorrect handling of Channel.read() led to data truncation The full list is at http://s.apache.org/z9h and in the CHANGES.txt attached. The tag to be voted on is v1.1.1-rc1 (commit 72a4fdbe): http://s.apache.org/cZC The release files, including signatures, digests, etc can be found at: http://people.apache.org/~andrewor14/spark-1.1.1-rc1/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/andrewor14.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1034/ The documentation corresponding to this release can be found at: http://people.apache.org/~andrewor14/spark-1.1.1-rc1-docs/ Please vote on releasing this package as Apache Spark 1.1.1! The vote is open until Sunday, November 16, at 04:30 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.1.1 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ Cheers, Andrew - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: mvn or sbt for studying and developing Spark?
I'm going to have to disagree here. If you are building a release distribution or integrating with legacy systems then maven is probably the correct choice. However most of the core developers that I know use sbt, and I think its a better choice for exploration and development overall. That said, this probably falls into the category of a religious argument so you might want to look at both options and decide for yourself. In my experience the SBT build is significantly faster with less effort (and I think sbt is still faster even if you go through the extra effort of installing zinc) and easier to read. The console mode of sbt (just run sbt/sbt and then a long running console session is started that will accept further commands) is great for building individual subprojects or running single test suites. In addition to being faster since its a long running JVM, its got a lot of nice features like tab-completion for test case names. For example, if I wanted to see what test cases are available in the SQL subproject you can do the following: [marmbrus@michaels-mbp spark (tpcds)]$ sbt/sbt [info] Loading project definition from /Users/marmbrus/workspace/spark/project/project [info] Loading project definition from /Users/marmbrus/.sbt/0.13/staging/ad8e8574a5bcb2d22d23/sbt-pom-reader/project [info] Set current project to spark-parent (in build file:/Users/marmbrus/workspace/spark/) sql/test-only *tab* -- org.apache.spark.sql.CachedTableSuite org.apache.spark.sql.DataTypeSuite org.apache.spark.sql.DslQuerySuite org.apache.spark.sql.InsertIntoSuite ... Another very useful feature is the development console, which starts an interactive REPL including the most recent version of the code and a lot of useful imports for some subprojects. For example in the hive subproject it automatically sets up a temporary database with a bunch of test data pre-loaded: $ sbt/sbt hive/console hive/console ... import org.apache.spark.sql.hive._ import org.apache.spark.sql.hive.test.TestHive._ import org.apache.spark.sql.parquet.ParquetTestData Welcome to Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_45). Type in expressions to have them evaluated. Type :help for more information. scala sql(SELECT * FROM src).take(2) res0: Array[org.apache.spark.sql.Row] = Array([238,val_238], [86,val_86]) Michael On Sun, Nov 16, 2014 at 3:27 AM, Dinesh J. Weerakkody dineshjweerakk...@gmail.com wrote: Hi Stephen and Sean, Thanks for correction. On Sun, Nov 16, 2014 at 12:28 PM, Sean Owen so...@cloudera.com wrote: No, the Maven build is the main one. I would use it unless you have a need to use the SBT build in particular. On Nov 16, 2014 2:58 AM, Dinesh J. Weerakkody dineshjweerakk...@gmail.com wrote: Hi Yiming, I believe that both SBT and MVN is supported in SPARK, but SBT is preferred (I'm not 100% sure about this :) ). When I'm using MVN I got some build failures. After that used SBT and works fine. You can go through these discussions regarding SBT vs MVN and learn pros and cons of both [1] [2]. [1] http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Necessity-of-Maven-and-SBT-Build-in-Spark-td2315.html [2] https://groups.google.com/forum/#!msg/spark-developers/OxL268v0-Qs/fBeBY8zmh3oJ Thanks, On Sun, Nov 16, 2014 at 7:11 AM, Yiming (John) Zhang sdi...@gmail.com wrote: Hi, I am new in developing Spark and my current focus is about co-scheduling of spark tasks. However, I am confused with the building tools: sometimes the documentation uses mvn but sometimes uses sbt. So, my question is that which one is the preferred tool of Spark community? And what's the technical difference between them? Thank you! Cheers, Yiming -- Thanks Best Regards, *Dinesh J. Weerakkody* -- Thanks Best Regards, *Dinesh J. Weerakkody*
Re: mvn or sbt for studying and developing Spark?
* I moved from sbt to maven in June specifically due to Andrew Or's describing mvn as the default build tool. Developers should keep in mind that jenkins uses mvn so we need to run mvn before submitting PR's - even if sbt were used for day to day dev work To be clear, I think that the PR builder actually uses sbt https://github.com/apache/spark/blob/master/dev/run-tests#L198 currently, but there are master builds that make sure maven doesn't break (amongst other things). * In addition, as Sean has alluded to, the Intellij seems to comprehend the maven builds a bit more readily than sbt Yeah, this is a very good point. I have used `sbt/sbt gen-idea` in the past, but I'm currently using the maven integration of inteliJ since it seems more stable. * But for command line and day to day dev purposes: sbt sounds great to use Those sound bites you provided about exposing built-in test databases for hive and for displaying available testcases are sweet. Any easy/convenient way to see more of those kinds of facilities available through sbt ? The Spark SQL developer readme https://github.com/apache/spark/tree/master/sql has a little bit of this, but we really should have some documentation on using SBT as well. Integrating with those systems is generally easier if you are also working with Spark in Maven. (And I wouldn't classify all of those Maven-built systems as legacy, Michael :) Also a good point, though I've seen some pretty clever uses of sbt's external project references to link spark into other projects. I'll certainly admit I have a bias towards new shiny things in general though, so my definition of legacy is probably skewed :)
Re: Creating a SchemaRDD from an existing API
No, it should support any data source that has a schema and can produce rows. On Mon, Dec 1, 2014 at 1:34 AM, Niranda Perera nira...@wso2.com wrote: Hi Michael, About this new data source API, what type of data sources would it support? Does it have to be RDBMS necessarily? Cheers On Sat, Nov 29, 2014 at 12:57 AM, Michael Armbrust mich...@databricks.com wrote: You probably don't need to create a new kind of SchemaRDD. Instead I'd suggest taking a look at the data sources API that we are adding in Spark 1.2. There is not a ton of documentation, but the test cases show how to implement the various interfaces https://github.com/apache/spark/tree/master/sql/core/src/test/scala/org/apache/spark/sql/sources, and there is an example library for reading Avro data https://github.com/databricks/spark-avro. On Thu, Nov 27, 2014 at 10:31 PM, Niranda Perera nira...@wso2.com wrote: Hi, I am evaluating Spark for an analytic component where we do batch processing of data using SQL. So, I am particularly interested in Spark SQL and in creating a SchemaRDD from an existing API [1]. This API exposes elements in a database as datasources. Using the methods allowed by this data source, we can access and edit data. So, I want to create a custom SchemaRDD using the methods and provisions of this API. I tried going through Spark documentation and the Java Docs, but unfortunately, I was unable to come to a final conclusion if this was actually possible. I would like to ask the Spark Devs, 1. As of the current Spark release, can we make a custom SchemaRDD? 2. What is the extension point to a custom SchemaRDD? or are there particular interfaces? 3. Could you please point me the specific docs regarding this matter? Your help in this regard is highly appreciated. Cheers [1] https://github.com/wso2-dev/carbon-analytics/tree/master/components/xanalytics -- *Niranda Perera* Software Engineer, WSO2 Inc. Mobile: +94-71-554-8430 Twitter: @n1r44 https://twitter.com/N1R44 -- *Niranda Perera* Software Engineer, WSO2 Inc. Mobile: +94-71-554-8430 Twitter: @n1r44 https://twitter.com/N1R44
Re: [Thrift,1.2 RC] what happened to parquet.hive.serde.ParquetHiveSerDe
In Hive 13 (which is the default for Spark 1.2), parquet is included and thus we no longer include the Hive parquet bundle. You can now use the included ParquetSerDe: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe If you want to compile Spark 1.2 with Hive 12 instead you can pass -Phive-0.12.0 and parquet.hive.serde.ParquetHiveSerDe will be included as before. Michael On Tue, Dec 2, 2014 at 9:31 AM, Yana Kadiyska yana.kadiy...@gmail.com wrote: Apologies if people get this more than once -- I sent mail to dev@spark last night and don't see it in the archives. Trying the incubator list now...wanted to make sure it doesn't get lost in case it's a bug... -- Forwarded message -- From: Yana Kadiyska yana.kadiy...@gmail.com Date: Mon, Dec 1, 2014 at 8:10 PM Subject: [Thrift,1.2 RC] what happened to parquet.hive.serde.ParquetHiveSerDe To: dev@spark.apache.org Hi all, apologies if this is not a question for the dev list -- figured User list might not be appropriate since I'm having trouble with the RC tag. I just tried deploying the RC and running ThriftServer. I see the following error: 14/12/01 21:31:42 ERROR UserGroupInformation: PriviledgedActionException as:anonymous (auth:SIMPLE) cause:org.apache.hive.service.cli.HiveSQLException: java.lang.RuntimeException: MetaException(message:java.lang.ClassNotFoundException Class parquet.hive.serde.ParquetHiveSerDe not found) 14/12/01 21:31:42 WARN ThriftCLIService: Error executing statement: org.apache.hive.service.cli.HiveSQLException: java.lang.RuntimeException: MetaException(message:java.lang.ClassNotFoundException Class parquet.hive.serde.ParquetHiveSerDe not found) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:192) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatement(HiveSessionImpl.java:212) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79) at org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37) at org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) I looked at a working installation that I have(build master a few weeks ago) and this class used to be included in spark-assembly: ls *.jar|xargs grep parquet.hive.serde.ParquetHiveSerDe Binary file spark-assembly-1.2.0-SNAPSHOT-hadoop2.0.0-mr1-cdh4.2.0.jar matches but with the RC build it's not there? I tried both the prebuilt CDH drop and later manually built the tag with the following command: ./make-distribution.sh --tgz -Phive -Dhadoop.version=2.0.0-mr1-cdh4.2.0 -Phive-thriftserver $JAVA_HOME/bin/jar -tvf spark-assembly-1.2.0-hadoop2.0.0-mr1-cdh4.2.0.jar |grep parquet.hive.serde.ParquetHiveSerDe comes back empty...
Re: [Thrift,1.2 RC] what happened to parquet.hive.serde.ParquetHiveSerDe
Here's a fix: https://github.com/apache/spark/pull/3586 On Wed, Dec 3, 2014 at 11:05 AM, Michael Armbrust mich...@databricks.com wrote: Thanks for reporting. As a workaround you should be able to SET spark.sql.hive.convertMetastoreParquet=false, but I'm going to try to fix this before the next RC. On Wed, Dec 3, 2014 at 7:09 AM, Yana Kadiyska yana.kadiy...@gmail.com wrote: Thanks Michael, you are correct. I also opened https://issues.apache.org/jira/browse/SPARK-4702 -- if someone can comment on why this might be happening that would be great. This would be a blocker to me using 1.2 and it used to work so I'm a bit puzzled. I was hoping that it's again a result of the default profile switch but it didn't seem to be the case (ps. please advise if this is more user-list appropriate. I'm posting to dev as it's an RC) On Tue, Dec 2, 2014 at 8:37 PM, Michael Armbrust mich...@databricks.com wrote: In Hive 13 (which is the default for Spark 1.2), parquet is included and thus we no longer include the Hive parquet bundle. You can now use the included ParquetSerDe: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe If you want to compile Spark 1.2 with Hive 12 instead you can pass -Phive-0.12.0 and parquet.hive.serde.ParquetHiveSerDe will be included as before. Michael On Tue, Dec 2, 2014 at 9:31 AM, Yana Kadiyska yana.kadiy...@gmail.com wrote: Apologies if people get this more than once -- I sent mail to dev@spark last night and don't see it in the archives. Trying the incubator list now...wanted to make sure it doesn't get lost in case it's a bug... -- Forwarded message -- From: Yana Kadiyska yana.kadiy...@gmail.com Date: Mon, Dec 1, 2014 at 8:10 PM Subject: [Thrift,1.2 RC] what happened to parquet.hive.serde.ParquetHiveSerDe To: dev@spark.apache.org Hi all, apologies if this is not a question for the dev list -- figured User list might not be appropriate since I'm having trouble with the RC tag. I just tried deploying the RC and running ThriftServer. I see the following error: 14/12/01 21:31:42 ERROR UserGroupInformation: PriviledgedActionException as:anonymous (auth:SIMPLE) cause:org.apache.hive.service.cli.HiveSQLException: java.lang.RuntimeException: MetaException(message:java.lang.ClassNotFoundException Class parquet.hive.serde.ParquetHiveSerDe not found) 14/12/01 21:31:42 WARN ThriftCLIService: Error executing statement: org.apache.hive.service.cli.HiveSQLException: java.lang.RuntimeException: MetaException(message:java.lang.ClassNotFoundException Class parquet.hive.serde.ParquetHiveSerDe not found) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:192) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatement(HiveSessionImpl.java:212) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79) at org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37) at org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) I looked at a working installation that I have(build master a few weeks ago) and this class used to be included in spark-assembly: ls *.jar|xargs grep parquet.hive.serde.ParquetHiveSerDe Binary file spark-assembly-1.2.0-SNAPSHOT-hadoop2.0.0-mr1-cdh4.2.0.jar matches but with the RC build it's not there? I tried both the prebuilt CDH drop and later manually built the tag with the following command: ./make-distribution.sh --tgz -Phive -Dhadoop.version=2.0.0-mr1-cdh4.2.0 -Phive-thriftserver $JAVA_HOME/bin/jar -tvf spark-assembly-1.2.0-hadoop2.0.0-mr1-cdh4.2.0.jar |grep parquet.hive.serde.ParquetHiveSerDe comes back empty...
Re: drop table if exists throws exception
The command run fine for me on master. Note that Hive does print an exception in the logs, but that exception does not propogate to user code. On Thu, Dec 4, 2014 at 11:31 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi, I got exception saying Hive: NoSuchObjectException(message:table table not found) when running DROP TABLE IF EXISTS table Looks like a new regression in Hive module. Anyone can confirm this? Thanks, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/
Re: CREATE TABLE AS SELECT does not work with temp tables in 1.2.0
Thanks for reporting. This looks like a regression related to: https://github.com/apache/spark/pull/2570 I've filed it here: https://issues.apache.org/jira/browse/SPARK-4769 On Fri, Dec 5, 2014 at 12:03 PM, kb kend...@hotmail.com wrote: I am having trouble getting create table as select or saveAsTable from a hiveContext to work with temp tables in spark 1.2. No issues in 1.1.0 or 1.1.1 Simple modification to test case in the hive SQLQuerySuite.scala: test(double nested data) { sparkContext.parallelize(Nested1(Nested2(Nested3(1))) :: Nil).registerTempTable(nested) checkAnswer( sql(SELECT f1.f2.f3 FROM nested), 1) checkAnswer(sql(CREATE TABLE test_ctas_1234 AS SELECT * from nested), Seq.empty[Row]) checkAnswer( sql(SELECT * FROM test_ctas_1234), sql(SELECT * FROM nested).collect().toSeq) } output: 11:57:15.974 ERROR org.apache.hadoop.hive.ql.parse.SemanticAnalyzer: org.apache.hadoop.hive.ql.parse.SemanticException: Line 1:45 Table not found 'nested' at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:1243) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:1192) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:9209) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:327) at org.apache.spark.sql.hive.execution.CreateTableAsSelect.metastoreRelation$lzycompute(CreateTableAsSelect.scala:59) at org.apache.spark.sql.hive.execution.CreateTableAsSelect.metastoreRelation(CreateTableAsSelect.scala:55) at org.apache.spark.sql.hive.execution.CreateTableAsSelect.sideEffectResult$lzycompute(CreateTableAsSelect.scala:82) at org.apache.spark.sql.hive.execution.CreateTableAsSelect.sideEffectResult(CreateTableAsSelect.scala:70) at org.apache.spark.sql.hive.execution.CreateTableAsSelect.execute(CreateTableAsSelect.scala:89) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425) at org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58) at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:105) at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:103) at org.apache.spark.sql.hive.execution.SQLQuerySuite$$anonfun$4.apply$mcV$sp(SQLQuerySuite.scala:122) at org.apache.spark.sql.hive.execution.SQLQuerySuite$$anonfun$4.apply(SQLQuerySuite.scala:117) at org.apache.spark.sql.hive.execution.SQLQuerySuite$$anonfun$4.apply(SQLQuerySuite.scala:117) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.scalatest.Suite$class.withFixture(Suite.scala:1122) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) at org.scalatest.FunSuite.runTest(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) at scala.collection.immutable.List.foreach(List.scala:318) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.org $scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) at org.scalatest.Suite$class.run(Suite.scala:1424) at org.scalatest.FunSuite.org $scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at
Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)
This is by hive's design. From the Hive documentation: The column change command will only modify Hive's metadata, and will not modify data. Users should make sure the actual data layout of the table/partition conforms with the metadata definition. On Sat, Dec 6, 2014 at 8:28 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Ok, found another possible bug in Hive. My current solution is to use ALTER TABLE CHANGE to rename the column names. The problem is after renaming the column names, the value of the columns became all NULL. Before renaming: scala sql(select `sorted::cre_ts` from pmt limit 1).collect res12: Array[org.apache.spark.sql.Row] = Array([12/02/2014 07:38:54]) Execute renaming: scala sql(alter table pmt change `sorted::cre_ts` cre_ts string) res13: org.apache.spark.sql.SchemaRDD = SchemaRDD[972] at RDD at SchemaRDD.scala:108 == Query Plan == Native command: executed by Hive After renaming: scala sql(select cre_ts from pmt limit 1).collect res16: Array[org.apache.spark.sql.Row] = Array([null]) I created a JIRA for it: https://issues.apache.org/jira/browse/SPARK-4781 Jianshi On Sun, Dec 7, 2014 at 1:06 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hmm... another issue I found doing this approach is that ANALYZE TABLE ... COMPUTE STATISTICS will fail to attach the metadata to the table, and later broadcast join and such will fail... Any idea how to fix this issue? Jianshi On Sat, Dec 6, 2014 at 9:10 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Very interesting, the line doing drop table will throws an exception. After removing it all works. Jianshi On Sat, Dec 6, 2014 at 9:11 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Here's the solution I got after talking with Liancheng: 1) using backquote `..` to wrap up all illegal characters val rdd = parquetFile(file) val schema = rdd.schema.fields.map(f = s`${f.name}` ${HiveMetastoreTypes.toMetastoreType(f.dataType)}).mkString(,\n) val ddl_13 = s |CREATE EXTERNAL TABLE $name ( | $schema |) |STORED AS PARQUET |LOCATION '$file' .stripMargin sql(ddl_13) 2) create a new Schema and do applySchema to generate a new SchemaRDD, had to drop and register table val t = table(name) val newSchema = StructType(t.schema.fields.map(s = s.copy(name = s.name.replaceAll(.*?::, sql(sdrop table $name) applySchema(t, newSchema).registerTempTable(name) I'm testing it for now. Thanks for the help! Jianshi On Sat, Dec 6, 2014 at 8:41 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi, I had to use Pig for some preprocessing and to generate Parquet files for Spark to consume. However, due to Pig's limitation, the generated schema contains Pig's identifier e.g. sorted::id, sorted::cre_ts, ... I tried to put the schema inside CREATE EXTERNAL TABLE, e.g. create external table pmt ( sorted::id bigint ) stored as parquet location '...' Obviously it didn't work, I also tried removing the identifier sorted::, but the resulting rows contain only nulls. Any idea how to create a table in HiveContext from these Parquet files? Thanks, Jianshi -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/ -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/ -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/ -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/ -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/
Re: CREATE TABLE AS SELECT does not work with temp tables in 1.2.0
This is merged now and should be fixed in the next 1.2 RC. On Sat, Dec 6, 2014 at 8:28 PM, Cheng, Hao hao.ch...@intel.com wrote: I've created(reused) the PR https://github.com/apache/spark/pull/3336, hopefully we can fix this regression. Thanks for the reporting. Cheng Hao -Original Message- From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Saturday, December 6, 2014 4:51 AM To: kb Cc: d...@spark.incubator.apache.org; Cheng Hao Subject: Re: CREATE TABLE AS SELECT does not work with temp tables in 1.2.0 Thanks for reporting. This looks like a regression related to: https://github.com/apache/spark/pull/2570 I've filed it here: https://issues.apache.org/jira/browse/SPARK-4769 On Fri, Dec 5, 2014 at 12:03 PM, kb kend...@hotmail.com wrote: I am having trouble getting create table as select or saveAsTable from a hiveContext to work with temp tables in spark 1.2. No issues in 1.1.0 or 1.1.1 Simple modification to test case in the hive SQLQuerySuite.scala: test(double nested data) { sparkContext.parallelize(Nested1(Nested2(Nested3(1))) :: Nil).registerTempTable(nested) checkAnswer( sql(SELECT f1.f2.f3 FROM nested), 1) checkAnswer(sql(CREATE TABLE test_ctas_1234 AS SELECT * from nested), Seq.empty[Row]) checkAnswer( sql(SELECT * FROM test_ctas_1234), sql(SELECT * FROM nested).collect().toSeq) } output: 11:57:15.974 ERROR org.apache.hadoop.hive.ql.parse.SemanticAnalyzer: org.apache.hadoop.hive.ql.parse.SemanticException: Line 1:45 Table not found 'nested' at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:1243) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:1192) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:9209) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:327) at org.apache.spark.sql.hive.execution.CreateTableAsSelect.metastoreRelation$lzycompute(CreateTableAsSelect.scala:59) at org.apache.spark.sql.hive.execution.CreateTableAsSelect.metastoreRelation(CreateTableAsSelect.scala:55) at org.apache.spark.sql.hive.execution.CreateTableAsSelect.sideEffectResult$lzycompute(CreateTableAsSelect.scala:82) at org.apache.spark.sql.hive.execution.CreateTableAsSelect.sideEffectResult(CreateTableAsSelect.scala:70) at org.apache.spark.sql.hive.execution.CreateTableAsSelect.execute(CreateTableAsSelect.scala:89) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425) at org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58) at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:105) at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:103) at org.apache.spark.sql.hive.execution.SQLQuerySuite$$anonfun$4.apply$mcV$sp(SQLQuerySuite.scala:122) at org.apache.spark.sql.hive.execution.SQLQuerySuite$$anonfun$4.apply(SQLQuerySuite.scala:117) at org.apache.spark.sql.hive.execution.SQLQuerySuite$$anonfun$4.apply(SQLQuerySuite.scala:117) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.scalatest.Suite$class.withFixture(Suite.scala:1122) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) at org.scalatest.FunSuite.runTest(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) at scala.collection.immutable.List.foreach(List.scala:318
Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)
You might also try out the recently added support for views. On Mon, Dec 8, 2014 at 9:31 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Ah... I see. Thanks for pointing it out. Then it means we cannot mount external table using customized column names. hmm... Then the only option left is to use a subquery to add a bunch of column alias. I'll try it later. Thanks, Jianshi On Tue, Dec 9, 2014 at 3:34 AM, Michael Armbrust mich...@databricks.com wrote: This is by hive's design. From the Hive documentation: The column change command will only modify Hive's metadata, and will not modify data. Users should make sure the actual data layout of the table/partition conforms with the metadata definition. On Sat, Dec 6, 2014 at 8:28 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Ok, found another possible bug in Hive. My current solution is to use ALTER TABLE CHANGE to rename the column names. The problem is after renaming the column names, the value of the columns became all NULL. Before renaming: scala sql(select `sorted::cre_ts` from pmt limit 1).collect res12: Array[org.apache.spark.sql.Row] = Array([12/02/2014 07:38:54]) Execute renaming: scala sql(alter table pmt change `sorted::cre_ts` cre_ts string) res13: org.apache.spark.sql.SchemaRDD = SchemaRDD[972] at RDD at SchemaRDD.scala:108 == Query Plan == Native command: executed by Hive After renaming: scala sql(select cre_ts from pmt limit 1).collect res16: Array[org.apache.spark.sql.Row] = Array([null]) I created a JIRA for it: https://issues.apache.org/jira/browse/SPARK-4781 Jianshi On Sun, Dec 7, 2014 at 1:06 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hmm... another issue I found doing this approach is that ANALYZE TABLE ... COMPUTE STATISTICS will fail to attach the metadata to the table, and later broadcast join and such will fail... Any idea how to fix this issue? Jianshi On Sat, Dec 6, 2014 at 9:10 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Very interesting, the line doing drop table will throws an exception. After removing it all works. Jianshi On Sat, Dec 6, 2014 at 9:11 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Here's the solution I got after talking with Liancheng: 1) using backquote `..` to wrap up all illegal characters val rdd = parquetFile(file) val schema = rdd.schema.fields.map(f = s`${f.name}` ${HiveMetastoreTypes.toMetastoreType(f.dataType)}).mkString(,\n) val ddl_13 = s |CREATE EXTERNAL TABLE $name ( | $schema |) |STORED AS PARQUET |LOCATION '$file' .stripMargin sql(ddl_13) 2) create a new Schema and do applySchema to generate a new SchemaRDD, had to drop and register table val t = table(name) val newSchema = StructType(t.schema.fields.map(s = s.copy(name = s.name.replaceAll(.*?::, sql(sdrop table $name) applySchema(t, newSchema).registerTempTable(name) I'm testing it for now. Thanks for the help! Jianshi On Sat, Dec 6, 2014 at 8:41 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi, I had to use Pig for some preprocessing and to generate Parquet files for Spark to consume. However, due to Pig's limitation, the generated schema contains Pig's identifier e.g. sorted::id, sorted::cre_ts, ... I tried to put the schema inside CREATE EXTERNAL TABLE, e.g. create external table pmt ( sorted::id bigint ) stored as parquet location '...' Obviously it didn't work, I also tried removing the identifier sorted::, but the resulting rows contain only nulls. Any idea how to create a table in HiveContext from these Parquet files? Thanks, Jianshi -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/ -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/ -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/ -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/ -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/ -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/
Re: SparkSQL not honoring schema
As the scala doc for applySchema says, It is important to make sure that the structure of every [[Row]] of the provided RDD matches the provided schema. Otherwise, there will be runtime exceptions. We don't check as doing runtime reflection on all of the data would be very expensive. You will only get errors if you try to manipulate the data, but otherwise it will pass it though. I have written some debugging code (developer API, not guaranteed to be stable) though that you can use. import org.apache.spark.sql.execution.debug._ schemaRDD.typeCheck() On Wed, Dec 10, 2014 at 6:19 PM, Alessandro Baretta alexbare...@gmail.com wrote: Hello, I defined a SchemaRDD by applying a hand-crafted StructType to an RDD. Some of the Rows in the RDD are malformed--that is, they do not conform to the schema defined by the StructType. When running a select statement on this SchemaRDD I would expect SparkSQL to either reject the malformed rows or fail. Instead, it returns whatever data it finds, even if malformed. Is this the desired behavior? Is there no method in SparkSQL to check for validity with respect to the schema? Thanks. Alex
Re: Is there any document to explain how to build the hive jars for spark?
The modified version of hive can be found here: https://github.com/pwendell/hive On Thu, Dec 11, 2014 at 5:47 PM, Yi Tian tianyi.asiai...@gmail.com wrote: Hi, all We found some bugs in hive-0.12, but we could not wait for hive community fixing them. We want to fix these bugs in our lab and build a new release which could be recognized by spark. As we know, spark depends on a special release of hive, like: |dependency groupIdorg.spark-project.hive/groupId artifactIdhive-metastore/artifactId version${hive.version}/version /dependency | The different between |org.spark-project.hive| and |org.apache.hive| was described by Patrick: |There are two differences: 1. We publish hive with a shaded protobuf dependency to avoid conflicts with some Hadoop versions. 2. We publish a proper hive-exec jar that only includes hive packages. The upstream version of hive-exec bundles a bunch of other random dependencies in it which makes it really hard for third-party projects to use it. | Is there any document to guide us how to build the hive jars for spark? Any help would be greatly appreciated.
Re: Data source interface for making multiple tables available for query
I agree and this is something that we have discussed in the past. Essentially I think instead of creating a RelationProvider that returns a single table, we'll have something like an external catalog that can return multiple base relations. On Sun, Dec 21, 2014 at 6:43 PM, Venkata ramana gollamudi ramana.gollam...@huawei.com wrote: Hi, Data source ddl.scala, CREATE TEMPORARY TABLE makes one table at time available to temp tables, how about the case if multiple/all tables from some data source needs to be available for query, just like hive tables. I think we also need that interface to connect such data sources. Please comment. Regards, Ramana
Re: Unsupported Catalyst types in Parquet
I'd love to get both of these in. There is some trickiness that I talk about on the JIRA for timestamps since the SQL timestamp class can support nano seconds and I don't think parquet has a type for this. Other systems (impala) seem to use INT96. It would be great to maybe ask on the parquet mailing list what the plan is there to make sure that whatever we do is going to be compatible long term. Michael On Mon, Dec 29, 2014 at 8:13 AM, Alessandro Baretta alexbare...@gmail.com wrote: Daoyuan, Thanks for creating the jiras. I need these features by... last week, so I'd be happy to take care of this myself, if only you or someone more experienced than me in the SparkSQL codebase could provide some guidance. Alex On Dec 29, 2014 12:06 AM, Wang, Daoyuan daoyuan.w...@intel.com wrote: Hi Alex, I'll create JIRA SPARK-4985 for date type support in parquet, and SPARK-4987 for timestamp type support. For decimal type, I think we only support decimals that fits in a long. Thanks, Daoyuan -Original Message- From: Alessandro Baretta [mailto:alexbare...@gmail.com] Sent: Saturday, December 27, 2014 2:47 PM To: dev@spark.apache.org; Michael Armbrust Subject: Unsupported Catalyst types in Parquet Michael, I'm having trouble storing my SchemaRDDs in Parquet format with SparkSQL, due to my RDDs having having DateType and DecimalType fields. What would it take to add Parquet support for these Catalyst? Are there any other Catalyst types for which there is no Catalyst support? Alex
Re: query planner design doc?
No, are you looking for something in particular? On Fri, Jan 23, 2015 at 9:44 AM, Nicholas Murphy halcyo...@gmail.com wrote: Okay, thanks. The design document mostly details the infrastructure for optimization strategies but doesn’t detail the strategies themselves. I take it the set of strategies are basically embodied in SparkStrategies.scala...is there a design doc/roadmap/JIRA issue detailing what strategies exist and which are planned? Thanks, Nick On Jan 22, 2015, at 7:45 PM, Michael Armbrust mich...@databricks.com wrote: Here is the initial design document for catalyst : https://docs.google.com/document/d/1Hc_Ehtr0G8SQUg69cmViZsMi55_Kf3tISD9GPGU5M1Y/edit Strategies (many of which are in SparkStragegies.scala) are the part that creates the physical operators from a catalyst logical plan. These operators have execute() methods that actually call RDD operations. On Thu, Jan 22, 2015 at 3:19 PM, Nicholas Murphy halcyo...@gmail.com wrote: Hi- Quick question: is there a design doc (or something more than “look at the code”) for the query planner for Spark SQL (i.e., the component that takes…Catalyst?…operator trees and translates them into SPARK operations)? Thanks, Nick - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Are there any plans to run Spark on top of Succinct
There was work being done at Berkeley on prototyping support for Succinct in Spark SQL. Rachit might have more information. On Thu, Jan 22, 2015 at 7:04 AM, Dean Wampler deanwamp...@gmail.com wrote: Interesting. I was wondering recently if anyone has explored working with compressed data directly. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Thu, Jan 22, 2015 at 2:59 AM, Mick Davies michael.belldav...@gmail.com wrote: http://succinct.cs.berkeley.edu/wp/wordpress/ Looks like a really interesting piece of work that could dovetail well with Spark. I have been trying recently to optimize some queries I have running on Spark on top of Parquet but the support from Parquet for predicate push down especially for dictionary based columns is a bit limiting. I am not sure, but from a cursory view it looks like this format may help in this area. Mick -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Are-there-any-plans-to-run-Spark-on-top-of-Succinct-tp10243.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Optimize encoding/decoding strings when using Parquet
+1 to adding such an optimization to parquet. The bytes are tagged specially as UTF8 in the parquet schema so it seem like it would be possible to add this. On Fri, Jan 16, 2015 at 8:17 AM, Mick Davies michael.belldav...@gmail.com wrote: Hi, It seems that a reasonably large proportion of query time using Spark SQL seems to be spent decoding Parquet Binary objects to produce Java Strings. Has anyone considered trying to optimize these conversions as many are duplicated. Details are outlined in the conversation in the user mailing list http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-amp-Parquet-data-are-reading-very-very-slow-td21061.html http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-amp-Parquet-data-are-reading-very-very-slow-td21061.html , I have copied a bit of that discussion here. It seems that as Spark processes each row from Parquet it makes a call to convert the Binary representation for each String column into a Java String. However in many (probably most) circumstances the underlying Binary instance from Parquet will have come from a Dictionary, for example when column cardinality is low. Therefore Spark is converting the same byte array to a copy of the same Java String over and over again. This is bad due to extra cpu, extra memory used for these strings, and probably results in more expensive grouping comparisons. I tested a simple hack to cache the last Binary-String conversion per column in ParquetConverter and this led to a 25% performance improvement for the queries I used. Admittedly this was over a data set with lots or runs of the same Strings in the queried columns. These costs are quite significant for the type of data that I expect will be stored in Parquet which will often have denormalized tables and probably lots of fairly low cardinality string columns I think a good way to optimize this would be if changes could be made to Parquet so that the encoding/decoding of Objects to Binary is handled on Parquet side of fence. Parquet could deal with Objects (Strings) as the client understands them and only use encoding/decoding to store/read from underlying storage medium. Doing this I think Parquet could ensure that the encoding/decoding of each Object occurs only once. Does anyone have an opinion on this, has it been considered already? Cheers Mick -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Optimize-encoding-decoding-strings-when-using-Parquet-tp10141.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: query planner design doc?
Here is the initial design document for catalyst : https://docs.google.com/document/d/1Hc_Ehtr0G8SQUg69cmViZsMi55_Kf3tISD9GPGU5M1Y/edit Strategies (many of which are in SparkStragegies.scala) are the part that creates the physical operators from a catalyst logical plan. These operators have execute() methods that actually call RDD operations. On Thu, Jan 22, 2015 at 3:19 PM, Nicholas Murphy halcyo...@gmail.com wrote: Hi- Quick question: is there a design doc (or something more than “look at the code”) for the query planner for Spark SQL (i.e., the component that takes…Catalyst?…operator trees and translates them into SPARK operations)? Thanks, Nick - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: HiveContext cannot be serialized
I'd suggest marking the HiveContext as @transient since its not valid to use it on the slaves anyway. On Mon, Feb 16, 2015 at 4:27 AM, Haopu Wang hw...@qilinsoft.com wrote: When I'm investigating this issue (in the end of this email), I take a look at HiveContext's code and find this change (https://github.com/apache/spark/commit/64945f868443fbc59cb34b34c16d782d da0fb63d#diff-ff50aea397a607b79df9bec6f2a841db): - @transient protected[hive] lazy val hiveconf = new HiveConf(classOf[SessionState]) - @transient protected[hive] lazy val sessionState = { -val ss = new SessionState(hiveconf) -setConf(hiveconf.getAllProperties) // Have SQLConf pick up the initial set of HiveConf. -ss - } + @transient protected[hive] lazy val (hiveconf, sessionState) = +Option(SessionState.get()) + .orElse { With the new change, Scala compiler always generate a Tuple2 field of HiveContext as below: private Tuple2 x$3; private transient OutputStream outputBuffer; private transient HiveConf hiveconf; private transient SessionState sessionState; private transient HiveMetastoreCatalog catalog; That x$3 field's key is HiveConf object that cannot be serialized. So can you suggest how to resolve this issue? Thank you very much! I have a streaming application which registered temp table on a HiveContext for each batch duration. The application runs well in Spark 1.1.0. But I get below error from 1.1.1. Do you have any suggestions to resolve it? Thank you! java.io.NotSerializableException: org.apache.hadoop.hive.conf.HiveConf - field (class scala.Tuple2, name: _1, type: class java.lang.Object) - object (class scala.Tuple2, (Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml, org.apache.hadoop.hive.conf.LoopingByteArrayInputStream@2158ce23,org.apa che.hadoop.hive.ql.session.SessionState@49b6eef9)) - field (class org.apache.spark.sql.hive.HiveContext, name: x$3, type: class scala.Tuple2) - object (class org.apache.spark.sql.hive.HiveContext, org.apache.spark.sql.hive.HiveContext@4e6e66a4) - field (class example.BaseQueryableDStream$$anonfun$registerTempTable$2, name: sqlContext$1, type: class org.apache.spark.sql.SQLContext) - object (class example.BaseQueryableDStream$$anonfun$registerTempTable$2, function1) - field (class org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1, name: foreachFunc$1, type: interface scala.Function1) - object (class org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1, function2) - field (class org.apache.spark.streaming.dstream.ForEachDStream, name: org$apache$spark$streaming$dstream$ForEachDStream$$foreachFunc, type: interface scala.Function2) - object (class org.apache.spark.streaming.dstream.ForEachDStream, org.apache.spark.streaming.dstream.ForEachDStream@5ccbdc20) - element of array (index: 0) - array (class [Ljava.lang.Object;, size: 16) - field (class scala.collection.mutable.ArrayBuffer, name: array, type: class [Ljava.lang.Object;) - object (class scala.collection.mutable.ArrayBuffer, ArrayBuffer(org.apache.spark.streaming.dstream.ForEachDStream@5ccbdc20)) - field (class org.apache.spark.streaming.DStreamGraph, name: outputStreams, type: class scala.collection.mutable.ArrayBuffer) - custom writeObject data (class org.apache.spark.streaming.DStreamGraph) - object (class org.apache.spark.streaming.DStreamGraph, org.apache.spark.streaming.DStreamGraph@776ae7da) - field (class org.apache.spark.streaming.Checkpoint, name: graph, type: class org.apache.spark.streaming.DStreamGraph) - root object (class org.apache.spark.streaming.Checkpoint, org.apache.spark.streaming.Checkpoint@5eade065) at java.io.ObjectOutputStream.writeObject0(Unknown Source)
Re: HiveContext cannot be serialized
I was suggesting you mark the variable that is holding the HiveContext '@transient' since the scala compiler is not correctly propagating this through the tuple extraction. This is only a workaround. We can also remove the tuple extraction. On Mon, Feb 16, 2015 at 10:47 AM, Reynold Xin r...@databricks.com wrote: Michael - it is already transient. This should probably considered a bug in the scala compiler, but we can easily work around it by removing the use of destructuring binding. On Mon, Feb 16, 2015 at 10:41 AM, Michael Armbrust mich...@databricks.com wrote: I'd suggest marking the HiveContext as @transient since its not valid to use it on the slaves anyway. On Mon, Feb 16, 2015 at 4:27 AM, Haopu Wang hw...@qilinsoft.com wrote: When I'm investigating this issue (in the end of this email), I take a look at HiveContext's code and find this change ( https://github.com/apache/spark/commit/64945f868443fbc59cb34b34c16d782d da0fb63d#diff-ff50aea397a607b79df9bec6f2a841db): - @transient protected[hive] lazy val hiveconf = new HiveConf(classOf[SessionState]) - @transient protected[hive] lazy val sessionState = { -val ss = new SessionState(hiveconf) -setConf(hiveconf.getAllProperties) // Have SQLConf pick up the initial set of HiveConf. -ss - } + @transient protected[hive] lazy val (hiveconf, sessionState) = +Option(SessionState.get()) + .orElse { With the new change, Scala compiler always generate a Tuple2 field of HiveContext as below: private Tuple2 x$3; private transient OutputStream outputBuffer; private transient HiveConf hiveconf; private transient SessionState sessionState; private transient HiveMetastoreCatalog catalog; That x$3 field's key is HiveConf object that cannot be serialized. So can you suggest how to resolve this issue? Thank you very much! I have a streaming application which registered temp table on a HiveContext for each batch duration. The application runs well in Spark 1.1.0. But I get below error from 1.1.1. Do you have any suggestions to resolve it? Thank you! java.io.NotSerializableException: org.apache.hadoop.hive.conf.HiveConf - field (class scala.Tuple2, name: _1, type: class java.lang.Object) - object (class scala.Tuple2, (Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml, org.apache.hadoop.hive.conf.LoopingByteArrayInputStream@2158ce23 ,org.apa che.hadoop.hive.ql.session.SessionState@49b6eef9)) - field (class org.apache.spark.sql.hive.HiveContext, name: x$3, type: class scala.Tuple2) - object (class org.apache.spark.sql.hive.HiveContext, org.apache.spark.sql.hive.HiveContext@4e6e66a4) - field (class example.BaseQueryableDStream$$anonfun$registerTempTable$2, name: sqlContext$1, type: class org.apache.spark.sql.SQLContext) - object (class example.BaseQueryableDStream$$anonfun$registerTempTable$2, function1) - field (class org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1, name: foreachFunc$1, type: interface scala.Function1) - object (class org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1, function2) - field (class org.apache.spark.streaming.dstream.ForEachDStream, name: org$apache$spark$streaming$dstream$ForEachDStream$$foreachFunc, type: interface scala.Function2) - object (class org.apache.spark.streaming.dstream.ForEachDStream, org.apache.spark.streaming.dstream.ForEachDStream@5ccbdc20) - element of array (index: 0) - array (class [Ljava.lang.Object;, size: 16) - field (class scala.collection.mutable.ArrayBuffer, name: array, type: class [Ljava.lang.Object;) - object (class scala.collection.mutable.ArrayBuffer, ArrayBuffer(org.apache.spark.streaming.dstream.ForEachDStream@5ccbdc20 )) - field (class org.apache.spark.streaming.DStreamGraph, name: outputStreams, type: class scala.collection.mutable.ArrayBuffer) - custom writeObject data (class org.apache.spark.streaming.DStreamGraph) - object (class org.apache.spark.streaming.DStreamGraph, org.apache.spark.streaming.DStreamGraph@776ae7da) - field (class org.apache.spark.streaming.Checkpoint, name: graph, type: class org.apache.spark.streaming.DStreamGraph) - root object (class org.apache.spark.streaming.Checkpoint, org.apache.spark.streaming.Checkpoint@5eade065) at java.io.ObjectOutputStream.writeObject0(Unknown Source)
Re: [VOTE] Release Apache Spark 1.3.0 (RC1)
P.S: For some reason replacing import sqlContext.createSchemaRDD with import sqlContext.implicits._ doesn't do the implicit conversations. registerTempTable gives syntax error. I will dig deeper tomorrow. Has anyone seen this ? We will write up a whole migration guide before the final release, but I can quickly explain this one. We made the implicit conversion significantly less broad to avoid the chance of confusing conflicts. However, now you have to call .toDF in order to force RDDs to become DataFrames.
Re: Hive SKEWED feature supported in Spark SQL ?
1) is SKEWED BY honored ? If so, has anyone run into directories not being created ? It is not. 2) if it is not honored, does it matter ? Hive introduced this feature to better handle joins where tables had a skewed distribution on keys joined on so that the single mapper handling one of the keys didn't hold up the whole process. Could that happen in Spark / Spark SQL? It could matter for very skewed data, though I have not heard many complaints. We could consider adding it in the future if people are having problems with skewed data.
Re: renaming SchemaRDD - DataFrame
In particular the performance tricks are in SpecificMutableRow. On Wed, Jan 28, 2015 at 5:49 PM, Evan Chan velvia.git...@gmail.com wrote: Yeah, it's null. I was worried you couldn't represent it in Row because of primitive types like Int (unless you box the Int, which would be a performance hit). Anyways, I'll take another look at the Row API again :-p On Wed, Jan 28, 2015 at 4:42 PM, Reynold Xin r...@databricks.com wrote: Isn't that just null in SQL? On Wed, Jan 28, 2015 at 4:41 PM, Evan Chan velvia.git...@gmail.com wrote: I believe that most DataFrame implementations out there, like Pandas, supports the idea of missing values / NA, and some support the idea of Not Meaningful as well. Does Row support anything like that? That is important for certain applications. I thought that Row worked by being a mutable object, but haven't looked into the details in a while. -Evan On Wed, Jan 28, 2015 at 4:23 PM, Reynold Xin r...@databricks.com wrote: It shouldn't change the data source api at all because data sources create RDD[Row], and that gets converted into a DataFrame automatically (previously to SchemaRDD). https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala One thing that will break the data source API in 1.3 is the location of types. Types were previously defined in sql.catalyst.types, and now moved to sql.types. After 1.3, sql.catalyst is hidden from users, and all public APIs have first class classes/objects defined in sql directly. On Wed, Jan 28, 2015 at 4:20 PM, Evan Chan velvia.git...@gmail.com wrote: Hey guys, How does this impact the data sources API? I was planning on using this for a project. +1 that many things from spark-sql / DataFrame is universally desirable and useful. By the way, one thing that prevents the columnar compression stuff in Spark SQL from being more useful is, at least from previous talks with Reynold and Michael et al., that the format was not designed for persistence. I have a new project that aims to change that. It is a zero-serialisation, high performance binary vector library, designed from the outset to be a persistent storage friendly. May be one day it can replace the Spark SQL columnar compression. Michael told me this would be a lot of work, and recreates parts of Parquet, but I think it's worth it. LMK if you'd like more details. -Evan On Tue, Jan 27, 2015 at 4:35 PM, Reynold Xin r...@databricks.com wrote: Alright I have merged the patch ( https://github.com/apache/spark/pull/4173 ) since I don't see any strong opinions against it (as a matter of fact most were for it). We can still change it if somebody lays out a strong argument. On Tue, Jan 27, 2015 at 12:25 PM, Matei Zaharia matei.zaha...@gmail.com wrote: The type alias means your methods can specify either type and they will work. It's just another name for the same type. But Scaladocs and such will show DataFrame as the type. Matei On Jan 27, 2015, at 12:10 PM, Dirceu Semighini Filho dirceu.semigh...@gmail.com wrote: Reynold, But with type alias we will have the same problem, right? If the methods doesn't receive schemardd anymore, we will have to change our code to migrade from schema to dataframe. Unless we have an implicit conversion between DataFrame and SchemaRDD 2015-01-27 17:18 GMT-02:00 Reynold Xin r...@databricks.com: Dirceu, That is not possible because one cannot overload return types. SQLContext.parquetFile (and many other methods) needs to return some type, and that type cannot be both SchemaRDD and DataFrame. In 1.3, we will create a type alias for DataFrame called SchemaRDD to not break source compatibility for Scala. On Tue, Jan 27, 2015 at 6:28 AM, Dirceu Semighini Filho dirceu.semigh...@gmail.com wrote: Can't the SchemaRDD remain the same, but deprecated, and be removed in the release 1.5(+/- 1) for example, and the new code been added to DataFrame? With this, we don't impact in existing code for the next few releases. 2015-01-27 0:02 GMT-02:00 Kushal Datta kushal.da...@gmail.com: I want to address the issue that Matei raised about the heavy lifting required for a full SQL support. It is amazing that even after 30 years of research there is not a single good open source columnar database like Vertica. There is a column store option in MySQL, but it is not nearly as sophisticated as Vertica or MonetDB. But there's a true need for such a system.
Re: Caching tables at column level
Its not completely transparent, but you can do something like the following today: CACHE TABLE hotData AS SELECT columns, I, care, about FROM fullTable On Sun, Feb 1, 2015 at 3:03 AM, Mick Davies michael.belldav...@gmail.com wrote: I have been working a lot recently with denormalised tables with lots of columns, nearly 600. We are using this form to avoid joins. I have tried to use cache table with this data, but it proves too expensive as it seems to try to cache all the data in the table. For data sets such as the one I am using you find that certain columns will be hot, referenced frequently in queries, others will be used very infrequently. Therefore it would be great if caches could be column based. I realise that this may not be optimal for all use cases, but I think it could be quite a common need. Has something like this been considered? Thanks Mick -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Caching-tables-at-column-level-tp10377.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
GitHub Syncing Down
FYI: https://issues.apache.org/jira/browse/INFRA-9259
Re: Spark 1.3 SQL Type Parser Changes?
Thanks for reporting. This was a result of a change to our DDL parser that resulted in types becoming reserved words. I've filled a JIRA and will investigate if this is something we can fix. https://issues.apache.org/jira/browse/SPARK-6250 On Tue, Mar 10, 2015 at 1:51 PM, Nitay Joffe ni...@actioniq.co wrote: In Spark 1.2 I used to be able to do this: scala org.apache.spark.sql.hive.HiveMetastoreTypes.toDataType(structint:bigint) res30: org.apache.spark.sql.catalyst.types.DataType = StructType(List(StructField(int,LongType,true))) That is, the name of a column can be a keyword like int. This is no longer the case in 1.3: data-pipeline-shell HiveTypeHelper.toDataType(structint:bigint) org.apache.spark.sql.sources.DDLException: Unsupported dataType: [1.8] failure: ``'' expected but `int' found structint:bigint ^ at org.apache.spark.sql.sources.DDLParser.parseType(ddl.scala:52) at org.apache.spark.sql.hive.HiveMetastoreTypes$.toDataType(HiveMetastoreCatalog.scala:785) at org.apache.spark.sql.hive.HiveTypeHelper$.toDataType(HiveTypeHelper.scala:9) Note HiveTypeHelper is simply an object I load in to expose HiveMetastoreTypes since it was made private. See https://gist.github.com/nitay/460b41ed5fd7608507f5 https://app.relateiq.com/r?c=chrome_gmailurl=https%3A%2F%2Fgist.github.com%2Fnitay%2F460b41ed5fd7608507f5t=AFwhZf262cJFT8YSR54ZotvY2aTmpm_zHTSKNSd4jeT-a6b8q-yMXQ-BqEX9-Ym54J1bkDFiFOXyRKsNxXoDGIh7bhqbBVKsGGq6YTJIfLZxs375XXPdS13KHsE_3Lffk4UIFkRFZ_7c This is actually a pretty big problem for us as we have a bunch of legacy tables with column names like timestamp. They work fine in 1.2, but now everything throws in 1.3. Any thoughts? Thanks, - Nitay Founder CTO
Re: Any guidance on when to back port and how far?
Two other criteria that I use when deciding what to backport: - Is it a regression from a previous minor release? I'm much more likely to backport fixes in this case, as I'd love for most people to stay up to date. - How scary is the change? I think the primary goal is stability of the maintenance branches. When I am confident that something is isolated and unlikely to break things (i.e. I'm fixing a confusing error message), then i'm much more likely to backport it. Regarding the length of time to continue backporting, I mostly don't backport to N-1, but this is partially because SQL is changing too fast for that to generally be useful. These old branches usually only get attention from me when there is an explicit request. I'd love to hear more feedback from others. Michael On Tue, Mar 24, 2015 at 6:13 AM, Sean Owen so...@cloudera.com wrote: So far, my rule of thumb has been: - Don't back-port new features or improvements in general, only bug fixes - Don't back-port minor bug fixes - Back-port bug fixes that seem important enough to not wait for the next minor release - Back-port site doc changes to the release most likely to go out next, to make it a part of the next site publish But, how far should back-ports go, in general? If the last minor release was 1.N, then to branch 1.N surely. Farther back is a question of expectation for support of past minor releases. Given the pace of change and time available, I assume there's not much support for continuing to use release 1.(N-1) and very little for 1.(N-2). Concretely: does anyone expect a 1.1.2 release ever? a 1.2.2 release? It'd be good to hear the received wisdom explicitly. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: enum-like types in Spark
#4 with a preference for CamelCaseEnums On Wed, Mar 4, 2015 at 5:29 PM, Joseph Bradley jos...@databricks.com wrote: another vote for #4 People are already used to adding () in Java. On Wed, Mar 4, 2015 at 5:14 PM, Stephen Boesch java...@gmail.com wrote: #4 but with MemoryOnly (more scala-like) http://docs.scala-lang.org/style/naming-conventions.html Constants, Values, Variable and Methods Constant names should be in upper camel case. That is, if the member is final, immutable and it belongs to a package object or an object, it may be considered a constant (similar to Java’sstatic final members): 1. object Container { 2. val MyConstant = ... 3. } 2015-03-04 17:11 GMT-08:00 Xiangrui Meng men...@gmail.com: Hi all, There are many places where we use enum-like types in Spark, but in different ways. Every approach has both pros and cons. I wonder whether there should be an “official” approach for enum-like types in Spark. 1. Scala’s Enumeration (e.g., SchedulingMode, WorkerState, etc) * All types show up as Enumeration.Value in Java. http://spark.apache.org/docs/latest/api/java/org/apache/spark/scheduler/SchedulingMode.html 2. Java’s Enum (e.g., SaveMode, IOMode) * Implementation must be in a Java file. * Values doesn’t show up in the ScalaDoc: http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.network.util.IOMode 3. Static fields in Java (e.g., TripletFields) * Implementation must be in a Java file. * Doesn’t need “()” in Java code. * Values don't show up in the ScalaDoc: http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.graphx.TripletFields 4. Objects in Scala. (e.g., StorageLevel) * Needs “()” in Java code. * Values show up in both ScalaDoc and JavaDoc: http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.storage.StorageLevel$ http://spark.apache.org/docs/latest/api/java/org/apache/spark/storage/StorageLevel.html It would be great if we have an “official” approach for this as well as the naming convention for enum-like values (“MEMORY_ONLY” or “MemoryOnly”). Personally, I like 4) with “MEMORY_ONLY”. Any thoughts? Best, Xiangrui - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.3.0 (RC1)
On Sun, Feb 22, 2015 at 11:20 PM, Mark Hamstra m...@clearstorydata.com wrote: So what are we expecting of Hive 0.12.0 builds with this RC? I know not every combination of Hadoop and Hive versions, etc., can be supported, but even an example build from the Building Spark page isn't looking too good to me. I would definitely expect this to build and we do actually test that for each PR. We don't yet run the tests for both versions of Hive and thus unfortunately these do get out of sync. Usually these are just problems diff-ing golden output or cases where we have added a test that uses a feature not available in hive 12. Have you seen problems with using Hive 12 outside of these test failures?
Re: [SQL][Feature] Access row by column name instead of index
Already done :) https://github.com/apache/spark/commit/2e8c6ca47df14681c1110f0736234ce76a3eca9b On Fri, Apr 24, 2015 at 2:37 PM, Reynold Xin r...@databricks.com wrote: Can you elaborate what you mean by that? (what's already available in Python?) On Fri, Apr 24, 2015 at 2:24 PM, Shuai Zheng szheng.c...@gmail.com wrote: Hi All, I want to ask whether there is a plan to implement the feature to access the Row in sql by name? Currently we can only allow to access a row by index (there is a python version api of access by name, but none for java) Regards, Shuai - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Is SQLContext thread-safe?
Unfortunately, I think the SQLParser is not threadsafe. I would recommend using HiveQL. On Thu, Apr 30, 2015 at 4:07 AM, Wangfei (X) wangf...@huawei.com wrote: actually this is a sql parse exception, are you sure your sql is right? 发自我的 iPhone 在 2015年4月30日,18:50,Haopu Wang hw...@qilinsoft.com 写道: Hi, in a test on SparkSQL 1.3.0, multiple threads are doing select on a same SQLContext instance, but below exception is thrown, so it looks like SQLContext is NOT thread safe? I think this is not the desired behavior. == java.lang.RuntimeException: [1.1] failure: ``insert'' expected but identifier select found select id ,ext.d from UNIT_TEST ^ at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSpark SQLParser.scala:40) at org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:130) at org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:130) at org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkS QLParser$$others$1.apply(SparkSQLParser.scala:96) at org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkS QLParser$$others$1.apply(SparkSQLParser.scala:95) at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parser s.scala:242) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parser s.scala:242) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$ apply$2.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$ apply$2.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Par sers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Par sers.scala:254) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Pa rsers.scala:891) at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Pa rsers.scala:891) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890) at scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParser s.scala:110) at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSpark SQLParser.scala:38) at org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.sca la:134) at org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.sca la:134) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:134) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:915) -Original Message- From: Cheng, Hao [mailto:hao.ch...@intel.com] Sent: Monday, March 02, 2015 9:05 PM To: Haopu Wang; user Subject: RE: Is SQLContext thread-safe? Yes it is thread safe, at least it's supposed to be. -Original Message- From: Haopu Wang [mailto:hw...@qilinsoft.com] Sent: Monday, March 2, 2015 4:43 PM To: user Subject: Is SQLContext thread-safe? Hi, is it safe to use the same SQLContext to do Select operations in different threads at the same time? Thank you very much! - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Uninitialized session in HiveContext?
Hey Marcelo, Thanks for the heads up! I'm currently in the process of refactoring all of this (to separate the metadata connection from the execution side) and as part of this I'm making the initialization of the session not lazy. It would be great to hear if this also works for your internal integration tests once the patch is up (hopefully this weekend). Michael On Thu, Apr 30, 2015 at 2:36 PM, Marcelo Vanzin van...@cloudera.com wrote: Hey all, We ran into some test failures in our internal branch (which builds against Hive 1.1), and I narrowed it down to the fix below. I'm not super familiar with the Hive integration code, but does this look like a bug for other versions of Hive too? This caused an error where some internal Hive configuration that is initialized by the session were not available. diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala index dd06b26..6242745 100644 --- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala +++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala @@ -93,6 +93,10 @@ class HiveContext(sc: SparkContext) extends SQLContext(sc) { if (conf.dialect == sql) { super.sql(substituted) } else if (conf.dialect == hiveql) { + // Make sure Hive session state is initialized. + if (SessionState.get() != sessionState) { +SessionState.start(sessionState) + } val ddlPlan = ddlParserWithHiveQL.parse(sqlText, exceptionOnError = false) DataFrame(this, ddlPlan.getOrElse(HiveQl.parseSql(substituted))) } else { -- Marcelo - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Plans for upgrading Hive dependency?
I am working on it. Here is the (very rough) version: https://github.com/apache/spark/compare/apache:master...marmbrus:multiHiveVersions On Mon, Apr 27, 2015 at 1:03 PM, Punyashloka Biswal punya.bis...@gmail.com wrote: Thanks Marcelo and Patrick - I don't know how I missed that ticket in my Jira search earlier. Is anybody working on the sub-issues yet, or is there a design doc I should look at before taking a stab? Regards, Punya On Mon, Apr 27, 2015 at 3:56 PM Patrick Wendell pwend...@gmail.com wrote: Hey Punya, There is some ongoing work to help make Hive upgrades more manageable and allow us to support multiple versions of Hive. Once we do that, it will be much easier for us to upgrade. https://issues.apache.org/jira/browse/SPARK-6906 - Patrick On Mon, Apr 27, 2015 at 12:47 PM, Marcelo Vanzin van...@cloudera.com wrote: That's a lot more complicated than you might think. We've done some basic work to get HiveContext to compile against Hive 1.1.0. Here's the code: https://github.com/cloudera/spark/commit/00e2c7e35d4ac236bcfbcd3d2805b483060255ec We didn't sent that upstream because that only solves half of the problem; the hive-thriftserver is disabled in our CDH build because it uses a lot of Hive APIs that have been removed in 1.1.0, so even getting it to compile is really complicated. If there's interest in getting the HiveContext part fixed up I can send a PR for that code. But at this time I don't really have plans to look at the thrift server. On Mon, Apr 27, 2015 at 11:58 AM, Punyashloka Biswal punya.bis...@gmail.com wrote: Dear Spark devs, Is there a plan for staying up-to-date with current (and future) versions of Hive? Spark currently supports version 0.13 (June 2014), but the latest version of Hive is 1.1.0 (March 2015). I don't see any Jira tickets about updating beyond 0.13, so I was wondering if this was intentional or it was just that nobody had started work on this yet. I'd be happy to work on a PR for the upgrade if one of the core developers can tell me what pitfalls to watch out for. Punya -- Marcelo - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: DataFrame distinct vs RDD distinct
I'd happily merge a PR that changes the distinct implementation to be more like Spark core, assuming it includes benchmarks that show better performance for both the fits in memory case and the too big for memory case. On Thu, May 7, 2015 at 2:23 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Ok, but for the moment, this seems to be killing performances on some computations... I'll try to give you precise figures on this between rdd and dataframe. Olivier. Le jeu. 7 mai 2015 à 10:08, Reynold Xin r...@databricks.com a écrit : In 1.5, we will most likely just rewrite distinct in SQL to either use the Aggregate operator which will benefit from all the Tungsten optimizations, or have a Tungsten version of distinct for SQL/DataFrame. On Thu, May 7, 2015 at 1:32 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi everyone, there seems to be different implementations of the distinct feature in DataFrames and RDD and some performance issue with the DataFrame distinct API. In RDD.scala : def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope { map(x = (x, null)).reduceByKey((x, y) = x, numPartitions).map(_._1) } And in DataFrame : case class Distinct(partial: Boolean, child: SparkPlan) extends UnaryNode { override def output: Seq[Attribute] = child.output override def requiredChildDistribution: Seq[Distribution] = if (partial) UnspecifiedDistribution :: Nil else ClusteredDistribution(child.output) :: Nil *override def execute(): RDD[Row] = {** child.execute().mapPartitions { iter =** val hashSet = new scala.collection.mutable.HashSet[Row]()* * var currentRow: Row = null** while (iter.hasNext) {** currentRow = iter.next()** if (!hashSet.contains(currentRow)) {** hashSet.add(currentRow.copy())** }** }* * hashSet.iterator** }** }*} I can try to reproduce more clearly the performance issue, but do you have any insights into why we can't have the same distinct strategy between DataFrame and RDD ? Regards, Olivier.
Re: Speeding up Spark build during development
FWIW... My Spark SQL development workflow is usually to run build/sbt sparkShell or build/sbt 'sql/test-only testSuiteName'. These commands starts in as little as 30s on my laptop, automatically figure out which subprojects need to be rebuilt, and don't require the expensive assembly creation. On Mon, May 4, 2015 at 5:48 AM, Meethu Mathew meethu.mat...@flytxt.com wrote: * * ** ** ** ** ** ** Hi, Is it really necessary to run **mvn --projects assembly/ -DskipTests install ? Could you please explain why this is needed? I got the changes after running mvn --projects streaming/ -DskipTests package. Regards, Meethu On Monday 04 May 2015 02:20 PM, Emre Sevinc wrote: Just to give you an example: When I was trying to make a small change only to the Streaming component of Spark, first I built and installed the whole Spark project (this took about 15 minutes on my 4-core, 4 GB RAM laptop). Then, after having changed files only in Streaming, I ran something like (in the top-level directory): mvn --projects streaming/ -DskipTests package and then mvn --projects assembly/ -DskipTests install This was much faster than trying to build the whole Spark from scratch, because Maven was only building one component, in my case the Streaming component, of Spark. I think you can use a very similar approach. -- Emre Sevinç On Mon, May 4, 2015 at 10:44 AM, Pramod Biligiri pramodbilig...@gmail.com wrote: No, I just need to build one project at a time. Right now SparkSql. Pramod On Mon, May 4, 2015 at 12:09 AM, Emre Sevinc emre.sev...@gmail.com wrote: Hello Pramod, Do you need to build the whole project every time? Generally you don't, e.g., when I was changing some files that belong only to Spark Streaming, I was building only the streaming (of course after having build and installed the whole project, but that was done only once), and then the assembly. This was much faster than trying to build the whole Spark every time. -- Emre Sevinç On Mon, May 4, 2015 at 9:01 AM, Pramod Biligiri pramodbilig...@gmail.com wrote: Using the inbuilt maven and zinc it takes around 10 minutes for each build. Is that reasonable? My maven opts looks like this: $ echo $MAVEN_OPTS -Xmx12000m -XX:MaxPermSize=2048m I'm running it as build/mvn -DskipTests package Should I be tweaking my Zinc/Nailgun config? Pramod On Sun, May 3, 2015 at 3:40 PM, Mark Hamstra m...@clearstorydata.com wrote: https://spark.apache.org/docs/latest/building-spark.html#building-with-buildmvn On Sun, May 3, 2015 at 2:54 PM, Pramod Biligiri pramodbilig...@gmail.com wrote: This is great. I didn't know about the mvn script in the build directory. Pramod On Fri, May 1, 2015 at 9:51 AM, York, Brennon brennon.y...@capitalone.com wrote: Following what Ted said, if you leverage the `mvn` from within the `build/` directory of Spark you¹ll get zinc for free which should help speed up build times. On 5/1/15, 9:45 AM, Ted Yu yuzhih...@gmail.com wrote: Pramod: Please remember to run Zinc so that the build is faster. Cheers On Fri, May 1, 2015 at 9:36 AM, Ulanov, Alexander alexander.ula...@hp.com wrote: Hi Pramod, For cluster-like tests you might want to use the same code as in mllib's LocalClusterSparkContext. You can rebuild only the package that you change and then run this main class. Best regards, Alexander -Original Message- From: Pramod Biligiri [mailto:pramodbilig...@gmail.com] Sent: Friday, May 01, 2015 1:46 AM To: dev@spark.apache.org Subject: Speeding up Spark build during development Hi, I'm making some small changes to the Spark codebase and trying it out on a cluster. I was wondering if there's a faster way to build than running the package target each time. Currently I'm using: mvn -DskipTests package All the nodes have the same filesystem mounted at the same mount point. Pramod The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer. -- Emre Sevinc
Re: [VOTE] Release Apache Spark 1.3.1 (RC2)
-1 (binding) We just were alerted to a pretty serious regression since 1.3.0 ( https://issues.apache.org/jira/browse/SPARK-6851). Should have a fix shortly. Michael On Fri, Apr 10, 2015 at 6:10 AM, Corey Nolet cjno...@gmail.com wrote: +1 (non-binding) - Verified signatures - built on Mac OSX - built on Fedora 21 All builds were done using profiles: hive, hive-thriftserver, hadoop-2.4, yarn +1 tested ML-related items on Mac OS X On Wed, Apr 8, 2015 at 7:59 PM, Krishna Sankar ksanka...@gmail.com wrote: +1 (non-binding, of course) 1. Compiled OSX 10.10 (Yosemite) OK Total time: 14:16 min mvn clean package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -DskipTests -Dscala-2.11 2. Tested pyspark, mlib - running as well as compare results with 1.3.0 pyspark works well with the new iPython 3.0.0 release 2.1. statistics (min,max,mean,Pearson,Spearman) OK 2.2. Linear/Ridge/Laso Regression OK 2.3. Decision Tree, Naive Bayes OK 2.4. KMeans OK Center And Scale OK 2.5. RDD operations OK State of the Union Texts - MapReduce, Filter,sortByKey (word count) 2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK Model evaluation/optimization (rank, numIter, lambda) with itertools OK 3. Scala - MLlib 3.1. statistics (min,max,mean,Pearson,Spearman) OK 3.2. LinearRegressionWithSGD OK 3.3. Decision Tree OK 3.4. KMeans OK 3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK 4.0. Spark SQL from Python OK 4.1. result = sqlContext.sql(SELECT * from people WHERE State = 'WA') OK On Tue, Apr 7, 2015 at 10:46 PM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.3.1! The tag to be voted on is v1.3.1-rc2 (commit 7c4473a): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=7c4473aa5a7f5de0323394aaedeefbf9738e8eb5 The list of fixes present in this release can be found at: http://bit.ly/1C2nVPY The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.3.1-rc2/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1083/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.3.1-rc2-docs/ The patches on top of RC1 are: [SPARK-6737] Fix memory leak in OutputCommitCoordinator https://github.com/apache/spark/pull/5397 [SPARK-6636] Use public DNS hostname everywhere in spark_ec2.py https://github.com/apache/spark/pull/5302 [SPARK-6205] [CORE] UISeleniumSuite fails for Hadoop 2.x test with NoClassDefFoundError https://github.com/apache/spark/pull/4933 Please vote on releasing this package as Apache Spark 1.3.1! The vote is open until Saturday, April 11, at 07:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.3.1 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [Catalyst] RFC: Using PartialFunction literals instead of objects
Overall this seems like a reasonable proposal to me. Here are a few thoughts: - There is some debugging utility to the ruleName, so we would probably want to at least make that an argument to the rule function. - We also have had rules that operate on SparkPlan, though since there is only one ATM maybe we don't need sugar there. - I would not call the sugar for creating Strategies rule/seqrule, as I think the one-to-one vs one-to-many distinction is useful. - I'm generally pro-refactoring to make the code nicer, especially when its not official public API, but I do think its important to maintain source compatibility (which I think you are) when possible as there are other projects using catalyst. - Finally, we'll have to balance this with other code changes / conflicts. You should probably open a JIRA and we can continue the discussion there. On Tue, May 19, 2015 at 4:16 AM, Edoardo Vacchi uncommonnonse...@gmail.com wrote: Hi everybody, At the moment, Catalyst rules are defined using two different types of rules: `Rule[LogicalPlan]` and `Strategy` (which in turn maps to `GenericStrategy[SparkPlan]`). I propose to introduce utility methods to a) reduce the boilerplate to define rewrite rules b) turning them back into what they essentially represent: function types. These changes would be backwards compatible, and would greatly help in understanding what the code does. Personally, I feel like the current use of objects is redundant and possibly confusing. ## `Rule[LogicalPlan]` The analyzer and optimizer use `Rule[LogicalPlan]`, which, besides defining a default `val ruleName` only defines the method `apply(plan: TreeType): TreeType`. Because the body of such method is always supposed to read `plan match pf`, with `pf` being some `PartialFunction[LogicalPlan, LogicalPlan]`, we can conclude that `Rule[LogicalPlan]` might be substituted by a PartialFunction. I propose the following: a) Introduce the utility method def rule(pf: PartialFunction[LogicalPlan, LogicalPlan]): Rule[LogicalPlan] = new Rule[LogicalPlan] { def apply (plan: LogicalPlan): LogicalPlan = plan transform pf } b) progressively replace the boilerplate-y object definitions; e.g. object MyRewriteRule extends Rule[LogicalPlan] { def apply(plan: LogicalPlan): LogicalPlan = plan transform { case ... = ... } with // define a Rule[LogicalPlan] val MyRewriteRule = rule { case ... = ... } it might also be possible to make rule method `implicit`, thereby further reducing MyRewriteRule to: // define a PartialFunction[LogicalPlan, LogicalPlan] // the implicit would convert it into a Rule[LogicalPlan] at the use sites val MyRewriteRule = { case ... = ... } ## Strategies A similar solution could be applied to shorten the code for Strategies, which are total functions only because they are all supposed to manage the default case, possibly returning `Nil`. In this case we might introduce the following utility methods: /** * Generate a Strategy from a PartialFunction[LogicalPlan, SparkPlan]. * The partial function must therefore return *one single* SparkPlan for each case. * The method will automatically wrap them in a [[Seq]]. * Unhandled cases will automatically return Seq.empty */ protected def rule(pf: PartialFunction[LogicalPlan, SparkPlan]): Strategy = new Strategy { def apply(plan: LogicalPlan): Seq[SparkPlan] = if (pf.isDefinedAt(plan)) Seq(pf.apply(plan)) else Seq.empty } /** * Generate a Strategy from a PartialFunction[ LogicalPlan, Seq[SparkPlan] ]. * The partial function must therefore return a Seq[SparkPlan] for each case. * Unhandled cases will automatically return Seq.empty */ protected def seqrule(pf: PartialFunction[LogicalPlan, Seq[SparkPlan]]): Strategy = new Strategy { def apply(plan: LogicalPlan): Seq[SparkPlan] = if (pf.isDefinedAt(plan)) pf.apply(plan) else Seq.empty[SparkPlan] } Thanks in advance e.v. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [SparkSQL 1.4]Could not use concat with UDF in where clause
Can you file a JIRA please? On Tue, Jun 23, 2015 at 1:42 AM, StanZhai m...@zhaishidan.cn wrote: Hi all, After upgraded the cluster from spark 1.3.1 to 1.4.0(rc4), I encountered the following exception when use concat with UDF in where clause: ===Exception org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to dataType on unresolved object, tree: 'concat(HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFYear(date#1776),年) at org.apache.spark.sql.catalyst.analysis.UnresolvedFunction.dataType(unresolved.scala:82) at org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$InConversion$$anonfun$apply$5$$anonfun$applyOrElse$15.apply(HiveTypeCoercion.scala:299) at org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$InConversion$$anonfun$apply$5$$anonfun$applyOrElse$15.apply(HiveTypeCoercion.scala:299) at scala.collection.LinearSeqOptimized$class.exists(LinearSeqOptimized.scala:80) at scala.collection.immutable.List.exists(List.scala:84) at org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$InConversion$$anonfun$apply$5.applyOrElse(HiveTypeCoercion.scala:299) at org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$InConversion$$anonfun$apply$5.applyOrElse(HiveTypeCoercion.scala:298) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:221) at org.apache.spark.sql.catalyst.plans.QueryPlan.org $apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionDown$1(QueryPlan.scala:75) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:85) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to (TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:94) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:64) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformAllExpressions$1.applyOrElse(QueryPlan.scala:136) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformAllExpressions$1.applyOrElse(QueryPlan.scala:135) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:221) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:242) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to (TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:272) at
Re: how to implement my own datasource?
I'd suggest looking at the avro data source as an example implementation: https://github.com/databricks/spark-avro I also gave a talk a while ago: https://www.youtube.com/watch?v=GQSNJAzxOr8 Hi, You can connect to by JDBC as described in https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases. Other option is using HadoopRDD and NewHadoopRDD to connect to databases compatible with Hadoop, like HBase, some examples can be found at chapter 5 of Learning Spark https://books.google.es/books?id=tOptBgAAQBAJpg=PT190dq=learning+spark+hadooprddhl=ensa=Xei=4bqLVaDaLsXaU46NgfgLved=0CCoQ6AEwAA#v=onepageq=%20hadooprddf=false For Spark Streaming see the section Custom Sources of https://spark.apache.org/docs/latest/streaming-programming-guide.html Hope that helps. Greetings, Juan 2015-06-25 8:25 GMT+02:00 诺铁 noty...@gmail.com: hi, I can't find documentation about datasource api, how to implement custom datasource. any hint is appreciated.thanks.
Re: When to expect UTF8String?
1. Custom aggregators that do map-side combine. This is something I'd hoping to add in Spark 1.5 2. UDFs with more than 22 arguments which is not supported by ScalaUdf, and to avoid wrapping a Java function interface in one of 22 different Scala function interfaces depending on the number of parameters. I'm super open to suggestions here. Mind possibly opening a JIRA with a proposed interface?
Re: When to expect UTF8String?
Through the DataFrame API, users should never see UTF8String. Expression (and any class in the catalyst package) is considered internal and so uses the internal representation of various types. Which type we use here is not stable across releases. Is there a reason you aren't defining a UDF instead? On Thu, Jun 11, 2015 at 8:08 PM, zsampson zsamp...@palantir.com wrote: I'm hoping for some clarity about when to expect String vs UTF8String when using the Java DataFrames API. In upgrading to Spark 1.4, I'm dealing with a lot of errors where what was once a String is now a UTF8String. The comments in the file and the related commit message indicate that maybe it should be internal to SparkSQL's implementation. However, when I add a column containing a custom subclass of Expression, the row passed to the eval method contains instances of UTF8String. Ditto for AggregateFunction.update. Is this expected? If so, when should I generally know to deal with UTF8String objects? -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/When-to-expect-UTF8String-tp12710.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.4.0 (RC3)
Its no longer valid to start more than one instance of HiveContext in a single JVM, as one of the goals of this refactoring was to allow connection to more than one metastore from a single context. For tests I suggest you use TestHive as we do in our unit tests. It has a reset() method you can use to cleanup state between tests/suites. We could also add an explicit close() method to remove this restriction, but if thats something you want to investigate we should move this off the vote thread and onto JIRA. On Tue, Jun 2, 2015 at 7:19 AM, Peter Rudenko petro.rude...@gmail.com wrote: Thanks Yin, tried on a clean VM - works now. But tests in my app still fails: [info] Cause: javax.jdo.JDOFatalDataStoreException: Unable to open a test connection to the given database. JDBC url = jdbc:derby:;databaseName=metastore_db;create=true, username = APP. Terminating connection pool (set lazyInit to true if you expect to start your database after your app). Original Exception: -- [info] java.sql.SQLException: Failed to start database 'metastore_db' with class loader org.apache.spark.sql.hive.client.IsolatedClientLoader$anon$1@380628de, see the next exception for details. [info] at org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown Source) [info] at org.apache.derby.impl.jdbc.Util.newEmbedSQLException(Unknown Source) [info] at org.apache.derby.impl.jdbc.Util.seeNextException(Unknown Source) [info] at org.apache.derby.impl.jdbc.EmbedConnection.bootDatabase(Unknown Source) [info] at org.apache.derby.impl.jdbc.EmbedConnection.init(Unknown Source) [info] at org.apache.derby.impl.jdbc.EmbedConnection40.init(Unknown Source) [info] at org.apache.derby.jdbc.Driver40.getNewEmbedConnection(Unknown Source) [info] at org.apache.derby.jdbc.InternalDriver.connect(Unknown Source) [info] at org.apache.derby.jdbc.Driver20.connect(Unknown Source) [info] at org.apache.derby.jdbc.AutoloadedDriver.connect(Unknown Source) [info] at java.sql.DriverManager.getConnection(DriverManager.java:571) [info] at java.sql.DriverManager.getConnection(DriverManager.java:187) [info] at com.jolbox.bonecp.BoneCP.obtainRawInternalConnection(BoneCP.java:361) [info] at com.jolbox.bonecp.BoneCP.init(BoneCP.java:416) [info] at com.jolbox.bonecp.BoneCPDataSource.getConnection(BoneCPDataSource.java:120) I’ve set parallelExecution in Test := false, Thanks, Peter Rudenko On 2015-06-01 21:10, Yin Huai wrote: Hi Peter, Based on your error message, seems you were not using the RC3. For the error thrown at HiveContext's line 206, we have changed the message to this one https://github.com/apache/spark/blob/v1.4.0-rc3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala#L205-207 just before RC3. Basically, we will not print out the class loader name. Can you check if a older version of 1.4 branch got used? Have you published a RC3 to your local maven repo? Can you clean your local repo cache and try again? Thanks, Yin On Mon, Jun 1, 2015 at 10:45 AM, Peter Rudenko petro.rude...@gmail.com petro.rude...@gmail.com wrote: Still have problem using HiveContext from sbt. Here’s an example of dependencies: val sparkVersion = 1.4.0-rc3 lazy val root = Project(id = spark-hive, base = file(.), settings = Project.defaultSettings ++ Seq( name := spark-1.4-hive, scalaVersion := 2.10.5, scalaBinaryVersion := 2.10, resolvers += Spark RC at https://repository.apache.org/content/repositories/orgapachespark-1110/; https://repository.apache.org/content/repositories/orgapachespark-1110/, libraryDependencies ++= Seq( org.apache.spark %% spark-core % sparkVersion, org.apache.spark %% spark-mllib % sparkVersion, org.apache.spark %% spark-hive % sparkVersion, org.apache.spark %% spark-sql % sparkVersion ) )) Launching sbt console with it and running: val conf = new SparkConf().setMaster(local[4]).setAppName(test) val sc = new SparkContext(conf) val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) val data = sc.parallelize(1 to 1) import sqlContext.implicits._ scala data.toDF java.lang.IllegalArgumentException: Unable to locate hive jars to connect to metastore using classloader scala.tools.nsc.interpreter.IMain$TranslatingClassLoader. Please set spark.sql.hive.metastore.jars at org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:206) at org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:175) at org.apache.spark.sql.hive.HiveContext$anon$2.init(HiveContext.scala:367) at org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:367) at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:366) at
Re: Catalyst: Reusing already computed expressions within a projection
I think this is likely something that we'll want to do during the code generation phase. Though its probably not the lowest hanging fruit at this point. On Sun, May 31, 2015 at 5:02 AM, Reynold Xin r...@databricks.com wrote: I think you are looking for http://en.wikipedia.org/wiki/Common_subexpression_elimination in the optimizer. One thing to note is that as we do more and more optimization like this, the optimization time might increase. Do you see a case where this can bring you substantial performance gains? On Sat, May 30, 2015 at 9:02 AM, Justin Uang justin.u...@gmail.com wrote: On second thought, perhaps can this be done by writing a rule that builds the dag of dependencies between expressions, then convert it into several layers of projections, where each new layer is allowed to depend on expression results from previous projections? Are there any pitfalls to this approach? On Sat, May 30, 2015 at 11:30 AM Justin Uang justin.u...@gmail.com wrote: If I do the following df2 = df.withColumn('y', df['x'] * 7) df3 = df2.withColumn('z', df2.y * 3) df3.explain() Then the result is Project [date#56,id#57,timestamp#58,x#59,(x#59 * 7.0) AS y#64,((x#59 * 7.0) AS y#64 * 3.0) AS z#65] PhysicalRDD [date#56,id#57,timestamp#58,x#59], MapPartitionsRDD[125] at mapPartitions at SQLContext.scala:1163 Effectively I want to compute y = f(x) z = g(y) The catalyst optimizer realizes that y#64 is the same as the one previously computed, however, when building the projection, it is ignoring the fact that it had already computed y, so it calculates `x * 7` twice. y = x * 7 z = x * 7 * 3 If I wanted to make this fix, would it be possible to do the logic in the optimizer phase? I imagine that it's difficult because the expressions in InterpretedMutableProjection don't have access to the previous expression results, only the input row, and that the design doesn't seem to be catered for this.
Re: [SparkSQL 1.4.0]The result of SUM(xxx) in SparkSQL is 0.0 but not null when the column xxx is all null
This was a change that was made to match a wrong answer coming from older versions of Hive. Unfortunately I think its too late to fix this in the 1.4 branch (as I'd like to avoid changing answers at all in point releases), but in Spark 1.5 we revert to the correct behavior. https://issues.apache.org/jira/browse/SPARK-8828 On Thu, Jul 2, 2015 at 11:58 PM, StanZhai m...@zhaishidan.cn wrote: Hi all, I have a table named test like this: | a | b | | 1 | null | | 2 | null | After upgraded the cluster from spark 1.3.1 to 1.4.0, I found the Sum function in spark 1.4 and 1.3 are different. The SQL is: select sum(b) from test In Spark 1.4.0 the result is 0.0, in spark 1.3.1 the result is null. I think the result should be null, why the result is 0.0 in 1.4.0 but not null? Is this a bug? Any hint is appreciated. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/SparkSQL-1-4-0-The-result-of-SUM-xxx-in-SparkSQL-is-0-0-but-not-null-when-the-column-xxx-is-all-null-tp13008.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org