Re: test cases stuck on local-cluster mode of ReplSuite?

2014-03-14 Thread Michael Armbrust
Sorry to revive an old thread, but I just ran into this issue myself.  It
is likely that you do not have the assembly jar built, or that you have
SPARK_HOME set incorrectly (it does not need to be set).

Michael


On Thu, Feb 27, 2014 at 8:13 AM, Nan Zhu zhunanmcg...@gmail.com wrote:

 Hi, all

 Actually this problem exists for months in my side, when I run the test
 cases, it will stop (actually pause?) at the ReplSuite

 [info] ReplSuite:
 2014-02-27 10:57:37.220 java[3911:1303] Unable to load realm info from
 SCDynamicStore
 [info] - propagation of local properties (7 seconds, 646 milliseconds)
 [info] - simple foreach with accumulator (6 seconds, 204 milliseconds)
 [info] - external vars (4 seconds, 271 milliseconds)
 [info] - external classes (3 seconds, 186 milliseconds)
 [info] - external functions (4 seconds, 843 milliseconds)
 [info] - external functions that access vars (3 seconds, 503 milliseconds)
 [info] - broadcast vars (4 seconds, 313 milliseconds)
 [info] - interacting with files (2 seconds, 492 milliseconds)



 The next test case should be

 test(local-cluster mode) {
 val output = runInterpreter(local-cluster[1,1,512],
   
 |var v = 7
 |def getV() = v
 |sc.parallelize(1 to 10).map(x = getV()).collect.reduceLeft(_+_)
 |v = 10
 |sc.parallelize(1 to 10).map(x = getV()).collect.reduceLeft(_+_)
 |var array = new Array[Int](5)
 |val broadcastArray = sc.broadcast(array)
 |sc.parallelize(0 to 4).map(x = broadcastArray.value(x)).collect
 |array(0) = 5
 |sc.parallelize(0 to 4).map(x = broadcastArray.value(x)).collect
   .stripMargin)
 assertDoesNotContain(error:, output)
 assertDoesNotContain(Exception, output)
 assertContains(res0: Int = 70, output)
 assertContains(res1: Int = 100, output)
 assertContains(res2: Array[Int] = Array(0, 0, 0, 0, 0), output)
 assertContains(res4: Array[Int] = Array(0, 0, 0, 0, 0), output)
   }



 I didn't see any reason for it spending so much time on it

 Any idea? I'm using mbp, OS X 10.9.1, Intel Core i7 2.9 GHz, Memory 8GB
 1600 MHz DDR3

 Best,

 --
 Nan Zhu




Re: new Catalyst/SQL component merged into master

2014-03-20 Thread Michael Armbrust
Hi Everyone,

I'm very excited about merging this new feature into Spark!  We have a lot
of cool things in the pipeline, including: porting Shark's in-memory
columnar format to Spark SQL, code-generation for expression evaluation and
improved support for complex types in parquet.

I would love to hear feedback on the interfaces, and what is missing.  In
particular, while we have pretty good test coverage for Hive, there has not
been a lot of testing with real Hive deployments and there is certainly a
lot more work to do.  So, please test it out and if there are any missing
features let me know!

Michael


On Thu, Mar 20, 2014 at 6:11 PM, Reynold Xin r...@databricks.com wrote:

 Hi All,

 I'm excited to announce a new module in Spark (SPARK-1251). After an
 initial review we've merged this as Spark as an alpha component to be
 included in Spark 1.0. This new component adds some exciting features,
 including:

 - schema-aware RDD programming via an experimental DSL
 - native Parquet support
 - support for executing SQL against RDDs

 The pull request itself contains more information:
 https://github.com/apache/spark/pull/146

 You can also find the documentation for this new component here:
 http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html


 This contribution was lead by Michael Ambrust with work from several other
 contributors who I'd like to highlight here: Yin Huai, Cheng Lian, Andre
 Schumacher, Timothy Chen, Henry Cook, and Mark Hamstra.


 - Reynold




Re: new Catalyst/SQL component merged into master

2014-03-21 Thread Michael Armbrust

 It will be great if there are any examples or usecases to look at ?

There are examples in the Spark documentation.  Patrick posted and updated
copy here so people can see them before 1.0 is released:
http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html

 Does this feature has different usecases than shark or more cleaner as
 hive dependency is gone?

Depending on how you use this, there is still a dependency on Hive (By
default this is not the case.  See the above documentation for more
details).  However, the dependency is on a stock version of Hive instead of
one modified by the AMPLab.  Furthermore, Spark SQL has its own optimizer,
instead of relying on the Hive optimizer.  Long term, this is going to give
us a lot more flexibility to optimize queries specifically for the Spark
execution engine.  We are actively porting over the best parts of shark
(specifically the in-memory columnar representation).

Shark still has some features that are missing in Spark SQL, including
SharkServer (and years of testing).  Once SparkSQL graduates from Alpha
status, it'll likely become the new backend for Shark.


Making RDDs Covariant

2014-03-21 Thread Michael Armbrust
Hey Everyone,

Here is a pretty major (but source compatible) change we are considering
making to the RDD API for 1.0.  Java and Python APIs would remain the same,
but users of Scala would likely need to use less casts.  This would be
especially true for libraries whose functions take RDDs as parameters.  Any
comments would be appreciated!

https://spark-project.atlassian.net/browse/SPARK-1296

Michael


Re: Making RDDs Covariant

2014-03-22 Thread Michael Armbrust

 From my experience, covariance often becomes a pain when dealing with
 serialization/deserialization (I've experienced a few cases while
 developing play-json  datomisca).
 Moreover, if you have implicits, variance often becomes a headache...


This is exactly the kind of feedback I was hoping for!  Can you be any more
specific about the kinds of problems you ran into here?


Re: Making RDDs Covariant

2014-03-22 Thread Michael Armbrust
Hi Pascal,

Thanks for the input.  I think we are going to be okay here since, as Koert
said, the current serializers use runtime type information.  We could also
keep at ClassTag around for the original type when the RDD was created.
 Good things to be aware of though.

Michael

On Sat, Mar 22, 2014 at 12:42 PM, Pascal Voitot Dev 
pascal.voitot@gmail.com wrote:

 On Sat, Mar 22, 2014 at 8:38 PM, David Hall d...@cs.berkeley.edu wrote:

  On Sat, Mar 22, 2014 at 8:59 AM, Pascal Voitot Dev 
  pascal.voitot@gmail.com wrote:
 
   The problem I was talking about is when you try to use typeclass
  converters
   and make them contravariant/covariant for input/output. Something like:
  
   Reader[-I, +O] { def read(i:I): O }
  
   Doing this, you soon have implicit collisions and philosophical
 concerns
   about what it means to serialize/deserialize a Parent class and a Child
   class...
  
 
 
  You should (almost) never make a typeclass param contravariant. It's
 almost
  certainly not what you want:
 
  https://issues.scala-lang.org/browse/SI-2509
 
  -- David
 

 I confirm that it's a pain and I must say I never do it but I've inherited
 historical code that did it :)



Re: new Catalyst/SQL component merged into master

2014-03-24 Thread Michael Armbrust
Hi Evan,

Index support is definitely something we would like to add, and it is
possible that adding support for your custom indexing solution would not be
too difficult.

We already push predicates into hive table scan operators when the
predicates are over partition keys.  You can see an example of how we
collect filters and decide which can pushed into the scan using the
HiveTableScan
query planning 
strategyhttps://github.com/marmbrus/spark/blob/0ae86cfcba56b700d8e7bd869379f0c663b21c1e/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala#L56
.

I'd like to know more about your indexing solution.  Is this something
publicly available?  One concern here is that the query planning code is
not considered a public API and so is likely to change quite a bit as we
improve the optimizer.  Its not currently something that we plan to expose
for external components to modify.

Michael


On Sun, Mar 23, 2014 at 11:49 PM, Evan Chan e...@ooyala.com wrote:

 Hi Michael,

 Congrats, this is really neat!

 What thoughts do you have regarding adding indexing support and
 predicate pushdown to this SQL framework?Right now we have custom
 bitmap indexing to speed up queries, so we're really curious as far as
 the architectural direction.

 -Evan


 On Fri, Mar 21, 2014 at 11:09 AM, Michael Armbrust
 mich...@databricks.com wrote:
 
  It will be great if there are any examples or usecases to look at ?
 
  There are examples in the Spark documentation.  Patrick posted and
 updated
  copy here so people can see them before 1.0 is released:
 
 http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html
 
  Does this feature has different usecases than shark or more cleaner as
  hive dependency is gone?
 
  Depending on how you use this, there is still a dependency on Hive (By
  default this is not the case.  See the above documentation for more
  details).  However, the dependency is on a stock version of Hive instead
 of
  one modified by the AMPLab.  Furthermore, Spark SQL has its own
 optimizer,
  instead of relying on the Hive optimizer.  Long term, this is going to
 give
  us a lot more flexibility to optimize queries specifically for the Spark
  execution engine.  We are actively porting over the best parts of shark
  (specifically the in-memory columnar representation).
 
  Shark still has some features that are missing in Spark SQL, including
  SharkServer (and years of testing).  Once SparkSQL graduates from Alpha
  status, it'll likely become the new backend for Shark.



 --
 --
 Evan Chan
 Staff Engineer
 e...@ooyala.com  |



Travis CI

2014-03-25 Thread Michael Armbrust
Just a quick note to everyone that Patrick and I are playing around with
Travis CI on the Spark github repository.  For now, travis does not run all
of the test cases, so will only be turned on experimentally.  Long term it
looks like Travis might give better integration with github, so we are
going to see if it is feasible to get all of our tests running on it.

*Jenkins remains the reference CI and should be consulted before merging
pull requests, independent of what Travis says.*

If you have any questions or want to help out with the investigation, let
me know!

Michael


Re: Travis CI

2014-03-29 Thread Michael Armbrust

 Is the migration from Jenkins to Travis finished?


It is not finished and really at this point it is only something we are
considering, not something that will happen for sure.  We turned it on in
addition to Jenkins so that we could start finding issues exactly like the
ones you described below to determine if Travis is going to be a viable
option.

Basically it seems to me that the Travis environment is a little less
predictable (probably because of virtualization) and this is pointing out
some existing flakey-ness in the tests

If there are tests that are regularly flakey we should probably file JIRAs
so they can be fixed or switched off.  If you have seen a test fail 2-3
times and then pass with no changes, I'd say go ahead and file an issue for
it (others should feel free to chime in if we want some other process here)

A few more specific comments inline below.


 2. hive/test usually aborted because it doesn't output anything within 10
 minutes


Hmm, this is a little confusing.  Do you have a pointer to this one?  Was
there any other error?


 4. hive/test didn't finish in 50 minutes, and was aborted


Here I think the right thing to do is probably break the hive tests in two
and run them in parallel.  There is already machinery for doing this, we
just need to flip the options on in the travis.yml to make it happen.  This
is only going to get more critical as we whitelist more hive tests.  We
also talked about checking the PR and skipping the hive tests when there
have been no changes in catalyst/sql/hive.  I'm okay with this plan, just
need to find someone with time to implement it


Re: Flaky streaming tests

2014-04-07 Thread Michael Armbrust
There is a JIRA for one of the flakey tests here:
https://issues.apache.org/jira/browse/SPARK-1409


On Mon, Apr 7, 2014 at 11:32 AM, Patrick Wendell pwend...@gmail.com wrote:

 TD - do you know what is going on here?

 I looked into this ab it and at least a few of these that use
 Thread.sleep() and assume the sleep will be exact, which is wrong. We
 should disable all the tests that do and probably they should be re-written
 to virtualize time.

 - Patrick


 On Mon, Apr 7, 2014 at 10:52 AM, Kay Ousterhout k...@eecs.berkeley.edu
 wrote:

  Hi all,
 
  The InputStreamsSuite seems to have some serious flakiness issues -- I've
  seen the file input stream fail many times and now I'm seeing some actor
  input stream test failures (
 
 
 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13846/consoleFull
  )
  on what I think is an unrelated change.  Does anyone know anything about
  these?  Should we just remove some of these tests since they seem to be
  constantly failing?
 
  -Kay
 



Re: RFC: varargs in Logging.scala?

2014-04-10 Thread Michael Armbrust
Hi Marcelo,

Thanks for bringing this up here, as this has been a topic of debate
recently.  Some thoughts below.

... all of the suffer from the fact that the log message needs to be built
 even
 though it might not be used.


This is not true of the current implementation (and this is actually why
Spark has a logging trait instead of just using a logger directly.)

If you look at the original function signatures:

protected def logDebug(msg: = String) ...


The = implies that we are passing the msg by name instead of by value.
Under the covers, scala is creating a closure that can be used to calculate
the log message, only if its actually required.  This does result is a
significant performance improvement, but still requires allocating an
object for the closure.  The bytecode is really something like this:

val logMessage = new Function0() { def call() =  Log message +
someExpensiveComputation() }
log.debug(logMessage)


In Catalyst and Spark SQL we are using the scala-logging package, which
uses macros to automatically rewrite all of your log statements.

You write: logger.debug(sLog message $someExpensiveComputation)

You get:

if(logger.debugEnabled) {
  val logMsg = Log message + someExpensiveComputation()
  logger.debug(logMsg)
}

IMHO, this is the cleanest option (and is supported by Typesafe).  Based on
a micro-benchmark, it is also the fastest:

std logging: 19885.48ms
spark logging 914.408ms
scala logging 729.779ms

Once the dust settles from the 1.0 release, I'd be in favor of
standardizing on scala-logging.

Michael


Re: RFC: varargs in Logging.scala?

2014-04-10 Thread Michael Armbrust
BTW...

You can do calculations in string interpolation:
sTime: ${timeMillis / 1000}s

Or use format strings.
fFloat with two decimal places: $floatValue%.2f

More info:
http://docs.scala-lang.org/overviews/core/string-interpolation.html


On Thu, Apr 10, 2014 at 5:46 PM, Michael Armbrust mich...@databricks.comwrote:

 Hi Marcelo,

 Thanks for bringing this up here, as this has been a topic of debate
 recently.  Some thoughts below.

 ... all of the suffer from the fact that the log message needs to be built
 even

 though it might not be used.


 This is not true of the current implementation (and this is actually why
 Spark has a logging trait instead of just using a logger directly.)

 If you look at the original function signatures:

 protected def logDebug(msg: = String) ...


 The = implies that we are passing the msg by name instead of by value.
 Under the covers, scala is creating a closure that can be used to calculate
 the log message, only if its actually required.  This does result is a
 significant performance improvement, but still requires allocating an
 object for the closure.  The bytecode is really something like this:

 val logMessage = new Function0() { def call() =  Log message + 
 someExpensiveComputation() }
 log.debug(logMessage)


 In Catalyst and Spark SQL we are using the scala-logging package, which
 uses macros to automatically rewrite all of your log statements.

 You write: logger.debug(sLog message $someExpensiveComputation)

 You get:

 if(logger.debugEnabled) {
   val logMsg = Log message + someExpensiveComputation()
   logger.debug(logMsg)
 }

 IMHO, this is the cleanest option (and is supported by Typesafe).  Based
 on a micro-benchmark, it is also the fastest:

 std logging: 19885.48ms
 spark logging 914.408ms
 scala logging 729.779ms

 Once the dust settles from the 1.0 release, I'd be in favor of
 standardizing on scala-logging.

 Michael



Re: Problem creating objects through reflection

2014-04-24 Thread Michael Armbrust
The Spark REPL is slightly modified from the normal Scala REPL to prevent
work from being done twice when closures are deserialized on the workers.
 I'm not sure exactly why this causes your problem, but its probably worth
filing a JIRA about it.

Here is another issues with classes defined in the REPL.  Not sure if it is
related, but I'd be curious if the workaround helps you:
https://issues.apache.org/jira/browse/SPARK-1199

Michael


On Thu, Apr 24, 2014 at 3:14 AM, Piotr Kołaczkowski
pkola...@datastax.comwrote:

 Hi,

 I'm working on Cassandra-Spark integration and I hit a pretty severe
 problem. One of the provided functionality is mapping Cassandra rows into
 objects of user-defined classes. E.g. like this:

 class MyRow(val key: String, val data: Int)
 sc.cassandraTable(keyspace, table).select(key, data).as[MyRow]  //
 returns CassandraRDD[MyRow]

 In this example CassandraRDD creates MyRow instances by reflection, i.e.
 matches selected fields from Cassandra table and passes them to the
 constructor.

 Unfortunately this does not work in Spark REPL.
 Turns out any class declared on the REPL is an inner classes, and to be
 successfully created, it needs a reference to the outer object, even though
 it doesn't really use anything from the outer context.

 scala class SomeClass
 defined class SomeClass

 scala classOf[SomeClass].getConstructors()(0)
 res11: java.lang.reflect.Constructor[_] = public
 $iwC$$iwC$SomeClass($iwC$$iwC)

 I tried passing a null as a temporary workaround, and it also doesn't work
 - I get NPE.
 How can I get a reference to the current outer object representing the
 context of the current line?

 Also, plain non-spark Scala REPL doesn't exhibit this behaviour - and
 classes declared on the REPL are proper top-most classes, not inner ones.
 Why?

 Thanks,
 Piotr







 --
 Piotr Kolaczkowski, Lead Software Engineer
 pkola...@datastax.com

 777 Mariners Island Blvd., Suite 510
 San Mateo, CA 94404



Re: [VOTE] Release Apache Spark 1.0.0 (rc8)

2014-05-16 Thread Michael Armbrust
-1

We found a regression in the way configuration is passed to executors.

https://issues.apache.org/jira/browse/SPARK-1864
https://github.com/apache/spark/pull/808

Michael


On Fri, May 16, 2014 at 3:57 PM, Mark Hamstra m...@clearstorydata.comwrote:

 +1


 On Fri, May 16, 2014 at 2:16 AM, Patrick Wendell pwend...@gmail.com
 wrote:

  [Due to ASF e-mail outage, I'm not if anyone will actually receive this.]
 
  Please vote on releasing the following candidate as Apache Spark version
  1.0.0!
  This has only minor changes on top of rc7.
 
  The tag to be voted on is v1.0.0-rc8 (commit 80eea0f):
 
 
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=80eea0f111c06260ffaa780d2f3f7facd09c17bc
 
  The release files, including signatures, digests, etc. can be found at:
  http://people.apache.org/~pwendell/spark-1.0.0-rc8/
 
  Release artifacts are signed with the following key:
  https://people.apache.org/keys/committer/pwendell.asc
 
  The staging repository for this release can be found at:
  https://repository.apache.org/content/repositories/orgapachespark-1016/
 
  The documentation corresponding to this release can be found at:
  http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/
 
  Please vote on releasing this package as Apache Spark 1.0.0!
 
  The vote is open until Monday, May 19, at 10:15 UTC and passes if a
  majority of at least 3 +1 PMC votes are cast.
 
  [ ] +1 Release this package as Apache Spark 1.0.0
  [ ] -1 Do not release this package because ...
 
  To learn more about Apache Spark, please see
  http://spark.apache.org/
 
  == API Changes ==
  We welcome users to compile Spark applications against 1.0. There are
  a few API changes in this release. Here are links to the associated
  upgrade guides - user facing changes have been kept as small as
  possible.
 
  changes to ML vector specification:
 
 
 http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/mllib-guide.html#from-09-to-10
 
  changes to the Java API:
 
 
 http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark
 
  changes to the streaming API:
 
 
 http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x
 
  changes to the GraphX API:
 
 
 http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091
 
  coGroup and related functions now return Iterable[T] instead of Seq[T]
  == Call toSeq on the result to restore the old behavior
 
  SparkContext.jarOfClass returns Option[String] instead of Seq[String]
  == Call toSeq on the result to restore old behavior
 



Re: Timestamp support in v1.0

2014-05-29 Thread Michael Armbrust
Thanks for reporting this!

https://issues.apache.org/jira/browse/SPARK-1964
https://github.com/apache/spark/pull/913

If you could test out that PR and see if it fixes your problems I'd really
appreciate it!

Michael


On Thu, May 29, 2014 at 9:09 AM, Andrew Ash and...@andrewash.com wrote:

 I can confirm that the commit is included in the 1.0.0 release candidates
 (it was committed before branch-1.0 split off from master), but I can't
 confirm that it works in PySpark.  Generally the Python and Java interfaces
 lag a little behind the Scala interface to Spark, but we're working to keep
 that diff much smaller going forward.

 Can you try the same thing in Scala?


 On Thu, May 29, 2014 at 8:54 AM, dataginjaninja 
 rickett.stepha...@gmail.com
  wrote:

  Can anyone verify which rc  [SPARK-1360] Add Timestamp Support for SQL
 #275
  https://github.com/apache/spark/pull/275   is included in? I am
 running
  rc3, but receiving errors with TIMESTAMP as a datatype in my Hive tables
  when trying to use them in pyspark.
 
  *The error I get:
  *
  14/05/29 15:44:47 INFO ParseDriver: Parsing command: SELECT COUNT(*) FROM
  aol
  14/05/29 15:44:48 INFO ParseDriver: Parse Completed
  14/05/29 15:44:48 INFO metastore: Trying to connect to metastore with URI
  thrift:
  14/05/29 15:44:48 INFO metastore: Waiting 1 seconds before next
 connection
  attempt.
  14/05/29 15:44:49 INFO metastore: Connected to metastore.
  Traceback (most recent call last):
File stdin, line 1, in module
File /opt/spark-1.0.0-rc3/python/pyspark/sql.py, line 189, in hql
  return self.hiveql(hqlQuery)
File /opt/spark-1.0.0-rc3/python/pyspark/sql.py, line 183, in hiveql
  return SchemaRDD(self._ssql_ctx.hiveql(hqlQuery), self)
File
 
 /opt/spark-1.0.0-rc3/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py,
  line 537, in __call__
File
  /opt/spark-1.0.0-rc3/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py,
 line
  300, in get_return_value
  py4j.protocol.Py4JJavaError: An error occurred while calling o14.hiveql.
  : java.lang.RuntimeException: Unsupported dataType: timestamp
 
  *The table I loaded:*
  DROP TABLE IF EXISTS aol;
  CREATE EXTERNAL TABLE aol (
  userid STRING,
  query STRING,
  query_time TIMESTAMP,
  item_rank INT,
  click_url STRING)
  ROW FORMAT DELIMITED
  FIELDS TERMINATED BY '\t'
  LOCATION '/tmp/data/aol';
 
  *The pyspark commands:*
  from pyspark.sql import HiveContext
  hctx= HiveContext(sc)
  results = hctx.hql(SELECT COUNT(*) FROM aol).collect()
 
 
 
 
 
 
  --
  View this message in context:
 
 http://apache-spark-developers-list.1001551.n3.nabble.com/Timestamp-support-in-v1-0-tp6850.html
  Sent from the Apache Spark Developers List mailing list archive at
  Nabble.com.
 



Re: Timestamp support in v1.0

2014-05-29 Thread Michael Armbrust
Yes, you'll need to download the code from that PR and reassemble Spark
(sbt/sbt assembly).


On Thu, May 29, 2014 at 10:02 AM, dataginjaninja 
rickett.stepha...@gmail.com wrote:

 Michael,

 Will I have to rebuild after adding the change? Thanks



 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Timestamp-support-in-v1-0-tp6850p6855.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.



Re: Timestamp support in v1.0

2014-05-29 Thread Michael Armbrust
You should be able to get away with only doing it locally.  This bug is
happening during analysis which only occurs on the driver.


On Thu, May 29, 2014 at 10:17 AM, dataginjaninja 
rickett.stepha...@gmail.com wrote:

 Darn, I was hoping just to sneak it in that file. I am not the only person
 working on the cluster; if I rebuild it that means I have to redeploy
 everything to all the nodes as well.  So I cannot do that ... today. If
 someone else doesn't beat me to it. I can rebuild at another time.



 -
 Cheers,

 Stephanie
 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Timestamp-support-in-v1-0-tp6850p6857.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.



Re: Timestamp support in v1.0

2014-06-05 Thread Michael Armbrust
Awesome, thanks for testing!


On Thu, Jun 5, 2014 at 1:30 PM, dataginjaninja rickett.stepha...@gmail.com
wrote:

 I can confirm that the patch fixed my issue. :-)



 -
 Cheers,

 Stephanie
 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Timestamp-support-in-v1-0-tp6850p6948.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.



Re: question about Hive compatiblilty tests

2014-06-18 Thread Michael Armbrust
I assume you are adding tests?  because that is the only time you should
see that message.

That error could mean a couple of things:
 1) The query is invalid and hive threw an exception
 2) Your Hive setup is bad.

Regarding #2, you need to have the source for Hive 0.12.0 available and
built as well as a hadoop installation.  You also have to have the
environment vars set as specified here:
https://github.com/apache/spark/tree/master/sql

Michael


On Thu, Jun 19, 2014 at 12:22 AM, Will Benton wi...@redhat.com wrote:

 Hi all,

 Does a Failed to generate golden answer for query message from
 HiveComparisonTests indicate that it isn't possible to run the query in
 question under Hive from Spark's test suite rather than anything about
 Spark's implementation of HiveQL?  The stack trace I'm getting implicates
 Hive code and not Spark code, but I wanted to make sure I wasn't missing
 something.


 thanks,
 wb



Re: [VOTE] Release Apache Spark 1.0.1 (RC2)

2014-07-05 Thread Michael Armbrust
+1

I tested sql/hive functionality.


On Sat, Jul 5, 2014 at 9:30 AM, Mark Hamstra m...@clearstorydata.com
wrote:

 +1


 On Fri, Jul 4, 2014 at 12:40 PM, Patrick Wendell pwend...@gmail.com
 wrote:

  I'll start the voting with a +1 - ran tests on the release candidate
  and ran some basic programs. RC1 passed our performance regression
  suite, and there are no major changes from that RC.
 
  On Fri, Jul 4, 2014 at 12:39 PM, Patrick Wendell pwend...@gmail.com
  wrote:
   Please vote on releasing the following candidate as Apache Spark
 version
  1.0.1!
  
   The tag to be voted on is v1.0.1-rc1 (commit 7d1043c):
  
 
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=7d1043c99303b87aef8ee19873629c2bfba4cc78
  
   The release files, including signatures, digests, etc. can be found at:
   http://people.apache.org/~pwendell/spark-1.0.1-rc2/
  
   Release artifacts are signed with the following key:
   https://people.apache.org/keys/committer/pwendell.asc
  
   The staging repository for this release can be found at:
  
 https://repository.apache.org/content/repositories/orgapachespark-1021/
  
   The documentation corresponding to this release can be found at:
   http://people.apache.org/~pwendell/spark-1.0.1-rc2-docs/
  
   Please vote on releasing this package as Apache Spark 1.0.1!
  
   The vote is open until Monday, July 07, at 20:45 UTC and passes if
   a majority of at least 3 +1 PMC votes are cast.
  
   [ ] +1 Release this package as Apache Spark 1.0.1
   [ ] -1 Do not release this package because ...
  
   To learn more about Apache Spark, please see
   http://spark.apache.org/
  
   === Differences from RC1 ===
   This release includes only one blocking patch from rc1:
   https://github.com/apache/spark/pull/1255
  
   There are also smaller fixes which came in over the last week.
  
   === About this release ===
   This release fixes a few high-priority bugs in 1.0 and has a variety
   of smaller fixes. The full list is here: http://s.apache.org/b45. Some
   of the more visible patches are:
  
   SPARK-2043: ExternalAppendOnlyMap doesn't always find matching keys
   SPARK-2156 and SPARK-1112: Issues with jobs hanging due to akka frame
  size.
   SPARK-1790: Support r3 instance types on EC2.
  
   This is the first maintenance release on the 1.0 line. We plan to make
   additional maintenance releases as new fixes come in.
 



Re: sparkSQL thread safe?

2014-07-10 Thread Michael Armbrust
Hey Ian,

Thanks for bringing these up!  Responses in-line:

Just wondering if right now spark sql is expected to be thread safe on
 master?
 doing a simple hadoop file - RDD - schema RDD - write parquet
 will fail in reflection code if i run these in a thread pool.


You are probably hitting SPARK-2178
https://issues.apache.org/jira/browse/SPARK-2178 which is caused by
SI-6240 https://issues.scala-lang.org/browse/SI-6240.  We have a plan to
fix this by moving the schema introspection to compile time, using macros.


 The SparkSqlSerializer, seems to create a new Kryo instance each time it
 wants to serialize anything. I got a huge speedup when I had any
 non-primitive type in my SchemaRDD using the ResourcePool's from Chill for
 providing the KryoSerializer to it. (I can open an RB if there is some
 reason not to re-use them?)


Sounds like SPARK-2102 https://issues.apache.org/jira/browse/SPARK-2102.
 There is no reason AFAIK to not reuse the instance. A PR would be greatly
appreciated!


 With the Distinct Count operator there is no map-side operations, and a
 test to check for this. Is there any reason not to do a map side combine
 into a set and then merge the sets later? (similar to the approximate
 distinct count operator)


Thats just not an optimization that we had implemented yet... but I've just
done it here https://github.com/apache/spark/pull/1366 and it'll be in
master soon :)


 Another thing while i'm mailing.. the 1.0.1 docs have a section like:
 
 // Note: Case classes in Scala 2.10 can support only up to 22 fields. To
 work around this limit, // you can use custom classes that implement the
 Product interface.
 

 Which sounds great, we have lots of data in thrift.. so via scrooge (
 https://github.com/twitter/scrooge), we end up with ultimately instances
 of
 traits which implement product. Though the reflection code appears to look
 for the constructor of the class and base the types based on those
 parameters?


Yeah, thats true that we only look in the constructor at the moment, but I
don't think there is a really good reason for that (other than I guess we
will need to add code to make sure we skip builtin object methods).  If you
want to open a JIRA, we can try fixing this.

Michael


Re: Catalyst dependency on Spark Core

2014-07-14 Thread Michael Armbrust
Yeah, sadly this dependency was introduced when someone consolidated the
logging infrastructure.  However, the dependency should be very small and
thus easy to remove, and I would like catalyst to be usable outside of
Spark.  A pull request to make this possible would be welcome.

Ideally, we'd create some sort of spark common package that has things like
logging.  That way catalyst could depend on that, without pulling in all of
Hadoop, etc.  Maybe others have opinions though, so I'm cc-ing the dev list.


On Mon, Jul 14, 2014 at 12:21 AM, Yanbo Liang yanboha...@gmail.com wrote:

 Make Catalyst independent of Spark is the goal of Catalyst, maybe need
 time and evolution.
 I awared that package org.apache.spark.sql.catalyst.util
 embraced org.apache.spark.util.{Utils = SparkUtils},
 so that Catalyst has a dependency on Spark core.
 I'm not sure whether it will be replaced by other component independent of
 Spark in later release.


 2014-07-14 11:51 GMT+08:00 Aniket Bhatnagar aniket.bhatna...@gmail.com:

 As per the recent presentation given in Scala days (
 http://people.apache.org/~marmbrus/talks/SparkSQLScalaDays2014.pdf), it
 was mentioned that Catalyst is independent of Spark. But on inspecting
 pom.xml of sql/catalyst module, it seems it has a dependency on Spark Core.
 Any particular reason for the dependency? I would love to use Catalyst
 outside Spark

 (reposted as previous email bounced. Sorry if this is a duplicate).





Change when loading/storing String data using Parquet

2014-07-14 Thread Michael Armbrust
I just wanted to send out a quick note about a change in the handling of
strings when loading / storing data using parquet and Spark SQL.  Before,
Spark SQL did not support binary data in Parquet, so all binary blobs were
implicitly treated as Strings.  9fe693
https://github.com/apache/spark/commit/9fe693b5b6ed6af34ee1e800ab89c8a11991ea38
fixes
this limitation by adding support for binary data.

However, data written out with a prior version of Spark SQL will be missing
the annotation telling us to interpret a given column as a String, so old
string data will now be loaded as binary data.  If you would like to use
the data as a string, you will need to add a CAST to convert the datatype.

New string data written out after this change, will correctly be loaded in
as a string as now we will include an annotation about the desired type.
 Additionally, this should now interoperate correctly with other systems
that write Parquet data (hive, thrift, etc).

Michael


Re: SQLQuerySuite error

2014-07-24 Thread Michael Armbrust
Thanks for reporting back.  I was pretty confused trying to reproduce the
error :)


On Thu, Jul 24, 2014 at 1:09 PM, Stephen Boesch java...@gmail.com wrote:

 OK I did find my error.  The missing step:

   mvn install

 I should have republished (mvn install) all of the other modules .

 The mvn -pl  will rely on the modules locally published and so the latest
 code that I had git pull'ed was not being used (except  the sql/core module
 code).

 The tests are passing after having properly performed the mvn install
 before  running with the mvn -pl sql/core.




 2014-07-24 12:04 GMT-07:00 Stephen Boesch java...@gmail.com:

 
  Are other developers seeing the following error for the recently added
  substr() method?  If not, any ideas why the following invocation of tests
  would be failing for me - i.e. how the given invocation would need to be
  tweaked?
 
  mvn -Pyarn -Pcdh5 test  -pl sql/core
  -DwildcardSuites=org.apache.spark.sql.SQLQuerySuite
 
  (note cdh5 is a custom profile for cdh5.0.0 but should not be affecting
  these results)
 
  Only the test(SPARK-2407 Added Parser of SQL SUBSTR()) fails: all of
 the
  other 33 tests pass.
 
  SQLQuerySuite:
  - SPARK-2041 column name equals tablename
  - SPARK-2407 Added Parser of SQL SUBSTR() *** FAILED ***
Exception thrown while executing query:
== Logical Plan ==
java.lang.UnsupportedOperationException
== Optimized Logical Plan ==
java.lang.UnsupportedOperationException
== Physical Plan ==
java.lang.UnsupportedOperationException
== Exception ==
java.lang.UnsupportedOperationException
java.lang.UnsupportedOperationException
at
 
 org.apache.spark.sql.catalyst.analysis.EmptyFunctionRegistry$.lookupFunction(FunctionRegistry.scala:33)
at
 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$5$$anonfun$applyOrElse$3.applyOrElse(Analyzer.scala:131)
at
 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$5$$anonfun$applyOrElse$3.applyOrElse(Analyzer.scala:129)
at
 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165)
at
 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:183)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at
  scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at
  scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at
  scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to
 (TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at
 
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at
  scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at
 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:212)
at
 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:168)
at org.apache.spark.sql.catalyst.plans.QueryPlan.org
 
 $apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionDown$1(QueryPlan.scala:52)
at
 
 org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1$$anonfun$apply$1.apply(QueryPlan.scala:66)
at
 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at
 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at
 scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at
 



Re: [VOTE] Release Apache Spark 1.0.2 (RC1)

2014-07-25 Thread Michael Armbrust
That query is looking at Fix Version not Target Version.  The fact that
the first one is still open is only because the bug is not resolved in
master.  It is fixed in 1.0.2.  The second one is partially fixed in 1.0.2,
but is not worth blocking the release for.


On Fri, Jul 25, 2014 at 4:23 PM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 TD, there are a couple of unresolved issues slated for 1.0.2
 
 https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%201.0.2%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20priority%20DESC
 .
 Should they be edited somehow?


 On Fri, Jul 25, 2014 at 7:08 PM, Tathagata Das 
 tathagata.das1...@gmail.com
 wrote:

  Please vote on releasing the following candidate as Apache Spark version
  1.0.2.
 
  This release fixes a number of bugs in Spark 1.0.1.
  Some of the notable ones are
  - SPARK-2452: Known issue is Spark 1.0.1 caused by attempted fix for
  SPARK-1199. The fix was reverted for 1.0.2.
  - SPARK-2576: NoClassDefFoundError when executing Spark QL query on
  HDFS CSV file.
  The full list is at http://s.apache.org/9NJ
 
  The tag to be voted on is v1.0.2-rc1 (commit 8fb6f00e):
 
 
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=8fb6f00e195fb258f3f70f04756e07c259a2351f
 
  The release files, including signatures, digests, etc can be found at:
  http://people.apache.org/~tdas/spark-1.0.2-rc1/
 
  Release artifacts are signed with the following key:
  https://people.apache.org/keys/committer/tdas.asc
 
  The staging repository for this release can be found at:
  https://repository.apache.org/content/repositories/orgapachespark-1024/
 
  The documentation corresponding to this release can be found at:
  http://people.apache.org/~tdas/spark-1.0.2-rc1-docs/
 
  Please vote on releasing this package as Apache Spark 1.0.2!
 
  The vote is open until Tuesday, July 29, at 23:00 UTC and passes if
  a majority of at least 3 +1 PMC votes are cast.
  [ ] +1 Release this package as Apache Spark 1.0.2
  [ ] -1 Do not release this package because ...
 
  To learn more about Apache Spark, please see
  http://spark.apache.org/
 



Re: new JDBC server test cases seems failed ?

2014-07-27 Thread Michael Armbrust
How recent is this? We've already reverted this patch once due to failing
tests.  It would be helpful to include a link to the failed build.  If its
failing again we'll have to revert again.


On Sun, Jul 27, 2014 at 5:26 PM, Nan Zhu zhunanmcg...@gmail.com wrote:

 Hi, all

 It seems that the JDBC test cases are failed unexpectedly in Jenkins?


 [info] - test query execution against a Hive Thrift server *** FAILED ***
 [info] java.sql.SQLException: Could not open connection to
 jdbc:hive2://localhost:45518/: java.net.ConnectException: Connection
 refused [info] at
 org.apache.hive.jdbc.HiveConnection.openTransport(HiveConnection.java:146)
 [info] at
 org.apache.hive.jdbc.HiveConnection.init(HiveConnection.java:123) [info]
 at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105) [info] at
 java.sql.DriverManager.getConnection(DriverManager.java:571) [info] at
 java.sql.DriverManager.getConnection(DriverManager.java:215) [info] at
 org.apache.spark.sql.hive.thriftserver.HiveThriftServer2Suite.getConnection(HiveThriftServer2Suite.scala:131)
 [info] at
 org.apache.spark.sql.hive.thriftserver.HiveThriftServer2Suite.createStatement(HiveThriftServer2Suite.scala:134)
 [info] at
 org.apache.spark.sql.hive.thriftserver.HiveThriftServer2Suite$$anonfun$1.apply$mcV$sp(HiveThriftServer2Suite.scala:110)
 [info] at org.apache.spark.sql.hive.thri
 ftserver.HiveThriftServer2Suite$$anonfun$1.apply(HiveThriftServer2Suite.scala:107)
 [info] at
 org.apache.spark.sql.hive.thriftserver.HiveThriftServer2Suite$$anonfun$1.apply(HiveThriftServer2Suite.scala:107)
 [info] ... [info] Cause: org.apache.thrift.transport.TTransportException:
 java.net.ConnectException: Connection refused [info] at
 org.apache.thrift.transport.TSocket.open(TSocket.java:185) [info] at
 org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:248)
 [info] at
 org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37)
 [info] at
 org.apache.hive.jdbc.HiveConnection.openTransport(HiveConnection.java:144)
 [info] at
 org.apache.hive.jdbc.HiveConnection.init(HiveConnection.java:123) [info]
 at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105) [info] at
 java.sql.DriverManager.getConnection(DriverManager.java:571) [info] at
 java.sql.DriverManager.getConnection(DriverManager.java:215) [info] at
 org.apache.spark.sql.hive.thriftserver.H
 iveThriftServer2Suite.getConnection(HiveThriftServer2Suite.scala:131)
 [info] at
 org.apache.spark.sql.hive.thriftserver.HiveThriftServer2Suite.createStatement(HiveThriftServer2Suite.scala:134)
 [info] ... [info] Cause: java.net.ConnectException: Connection refused
 [info] at java.net.PlainSocketImpl.socketConnect(Native Method) [info] at
 java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
 [info] at
 java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
 [info] at
 java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
 [info] at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) [info]
 at java.net.Socket.connect(Socket.java:579) [info] at
 org.apache.thrift.transport.TSocket.open(TSocket.java:180) [info] at
 org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:248)
 [info] at
 org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37)
 [info] at org.apache.hive.jdbc.HiveConn
 ection.openTransport(HiveConnection.java:144) [info] ... [info] CliSuite:
 Executing: create table hive_test1(key int, val string);, expecting output:
 OK [warn] four warnings found [warn] Note:
 /home/jenkins/workspace/SparkPullRequestBuilder@4/core/src/test/java/org/apache/spark/JavaAPISuite.java
 uses or overrides a deprecated API. [warn] Note: Recompile with
 -Xlint:deprecation for details. [info] - simple commands *** FAILED ***
 [info] java.lang.AssertionError: assertion failed: Didn't find OK in the
 output: [info] at scala.Predef$.assert(Predef.scala:179) [info] at
 org.apache.spark.sql.hive.thriftserver.TestUtils$class.waitForQuery(TestUtils.scala:70)
 [info] at
 org.apache.spark.sql.hive.thriftserver.CliSuite.waitForQuery(CliSuite.scala:25)
 [info] at
 org.apache.spark.sql.hive.thriftserver.TestUtils$class.executeQuery(TestUtils.scala:62)
 [info] at
 org.apache.spark.sql.hive.thriftserver.CliSuite.executeQuery(CliSuite.scala:25)
 [info] at org.apache.spark.sql.hive.thriftserver.CliSuite
 $$anonfun$1.apply$mcV$sp(CliSuite.scala:53) [info] at
 org.apache.spark.sql.hive.thriftserver.CliSuite$$anonfun$1.apply(CliSuite.scala:51)
 [info] at
 org.apache.spark.sql.hive.thriftserver.CliSuite$$anonfun$1.apply(CliSuite.scala:51)
 [info] at
 org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
 [info] at
 org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
 [log4j:WARN No appenders could be found for logger
 (org.apache.hadoop.util.Shell). log4j:WARN Please initialize the log4j
 system properly. log4j:WARN See
 

Re: Working Formula for Hive 0.13?

2014-07-28 Thread Michael Armbrust
A few things:
 - When we upgrade to Hive 0.13.0, Patrick will likely republish the
hive-exec jar just as we did for 0.12.0
 - Since we have to tie into some pretty low level APIs it is unsurprising
that the code doesn't just compile out of the box against 0.13.0
 - ScalaReflection is for determining Schema from Scala classes, not
reflection based bridge code.  Either way its unclear to if there is any
reason to use reflection to support multiple versions, instead of just
upgrading to Hive 0.13.0

One question I have is, What is the goal of upgrading to hive 0.13.0?  Is
it purely because you are having problems connecting to newer metastores?
 Are there some features you are hoping for?  This will help me prioritize
this effort.

Michael


On Mon, Jul 28, 2014 at 4:05 PM, Ted Yu yuzhih...@gmail.com wrote:

 I was looking for a class where reflection-related code should reside.

 I found this but don't think it is the proper class for bridging
 differences between hive 0.12 and 0.13.1:

 sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala

 Cheers


 On Mon, Jul 28, 2014 at 3:41 PM, Ted Yu yuzhih...@gmail.com wrote:

  After manually copying hive 0.13.1 jars to local maven repo, I got the
  following errors when building spark-hive_2.10 module :
 
  [ERROR]
 
 /homes/xx/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala:182:
  type mismatch;
   found   : String
   required: Array[String]
  [ERROR]   val proc: CommandProcessor =
  CommandProcessorFactory.get(tokens(0), hiveconf)
  [ERROR]
 ^
  [ERROR]
 
 /homes/xx/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala:60:
  value getAllPartitionsForPruner is not a member of org.apache.
   hadoop.hive.ql.metadata.Hive
  [ERROR] client.getAllPartitionsForPruner(table).toSeq
  [ERROR]^
  [ERROR]
 
 /homes/xx/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala:267:
  overloaded method constructor TableDesc with alternatives:
(x$1: Class[_ : org.apache.hadoop.mapred.InputFormat[_, _]],x$2:
  Class[_],x$3:
 java.util.Properties)org.apache.hadoop.hive.ql.plan.TableDesc
  and
()org.apache.hadoop.hive.ql.plan.TableDesc
   cannot be applied to (Class[org.apache.hadoop.hive.serde2.Deserializer],
  Class[(some other)?0(in value tableDesc)(in value tableDesc)],
 Class[?0(in
  value tableDesc)(in   value tableDesc)], java.util.Properties)
  [ERROR]   val tableDesc = new TableDesc(
  [ERROR]   ^
  [WARNING] Class org.antlr.runtime.tree.CommonTree not found - continuing
  with a stub.
  [WARNING] Class org.antlr.runtime.Token not found - continuing with a
 stub.
  [WARNING] Class org.antlr.runtime.tree.Tree not found - continuing with a
  stub.
  [ERROR]
   while compiling:
 
 /homes/xx/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala
  during phase: typer
   library version: version 2.10.4
  compiler version: version 2.10.4
 
  The above shows incompatible changes between 0.12 and 0.13.1
  e.g. the first error corresponds to the following method
  in CommandProcessorFactory :
public static CommandProcessor get(String[] cmd, HiveConf conf)
 
  Cheers
 
 
  On Mon, Jul 28, 2014 at 1:32 PM, Steve Nunez snu...@hortonworks.com
  wrote:
 
  So, do we have a short-term fix until Hive 0.14 comes out? Perhaps
 adding
  the hive-exec jar to the spark-project repo? It doesn¹t look like
 there¹s
  a release date schedule for 0.14.
 
 
 
  On 7/28/14, 10:50, Cheng Lian lian.cs@gmail.com wrote:
 
  Exactly, forgot to mention Hulu team also made changes to cope with
 those
  incompatibility issues, but they said that¹s relatively easy once the
  re-packaging work is done.
  
  
  On Tue, Jul 29, 2014 at 1:20 AM, Patrick Wendell pwend...@gmail.com
 
  wrote:
  
   I've heard from Cloudera that there were hive internal changes
 between
   0.12 and 0.13 that required code re-writing. Over time it might be
   possible for us to integrate with hive using API's that are more
   stable (this is the domain of Michael/Cheng/Yin more than me!). It
   would be interesting to see what the Hulu folks did.
  
   - Patrick
  
   On Mon, Jul 28, 2014 at 10:16 AM, Cheng Lian lian.cs@gmail.com
   wrote:
AFAIK, according a recent talk, Hulu team in China has built Spark
  SQL
against Hive 0.13 (or 0.13.1?) successfully. Basically they also
re-packaged Hive 0.13 as what the Spark team did. The slides of the
  talk
hasn't been released yet though.
   
   
On Tue, Jul 29, 2014 at 1:01 AM, Ted Yu yuzhih...@gmail.com
 wrote:
   
Owen helped me find this:
https://issues.apache.org/jira/browse/HIVE-7423
   
I guess this means that for Hive 0.14, Spark should be able to
  directly
pull in hive-exec-core.jar
   
Cheers
   
   
On Mon, Jul 28, 2014 at 9:55 AM, Patrick Wendell 
  pwend...@gmail.com
wrote:
   
 It would be great if the hive team can fix 

Re: How to run specific sparkSQL test with maven

2014-08-01 Thread Michael Armbrust

 It seems that the HiveCompatibilitySuite need a hadoop and hive
 environment, am I right?

 Relative path in absolute URI:
 file:$%7Bsystem:test.tmp.dir%7D/tmp_showcrt1”


You should only need Hadoop and Hive if you are creating new tests that we
need to compute the answers for.  Existing tests are run with cached
answers.  There are details about the configuration here:
https://github.com/apache/spark/tree/master/sql


Re: Working Formula for Hive 0.13?

2014-08-08 Thread Michael Armbrust
Could you make a PR as described here:
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark


On Fri, Aug 8, 2014 at 1:57 PM, Zhan Zhang zhaz...@gmail.com wrote:

 Sorry, forget to upload files. I have never posted before :) hive.diff
 
 http://apache-spark-developers-list.1001551.n3.nabble.com/file/n/hive.diff
 



 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Working-Formula-for-Hive-0-13-tp7551p.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: Working Formula for Hive 0.13?

2014-08-25 Thread Michael Armbrust
Thanks for working on this!  Its unclear at the moment exactly how we are
going to handle this, since the end goal is to be compatible with as many
versions of Hive as possible.  That said, I think it would be great to open
a PR in this case.  Even if we don't merge it, thats a good way to get it
on people's radar and have a discussion about the changes that are required.


On Sun, Aug 24, 2014 at 7:11 PM, scwf wangf...@huawei.com wrote:

   I have worked for a branch update the hive version to hive-0.13(by
 org.apache.hive)---https://github.com/scwf/spark/tree/hive-0.13
 I am wondering whether it's ok to make a PR now because hive-0.13 version
 is not compatible with hive-0.12 and here i used org.apache.hive.



 On 2014/7/29 8:22, Michael Armbrust wrote:

 A few things:
   - When we upgrade to Hive 0.13.0, Patrick will likely republish the
 hive-exec jar just as we did for 0.12.0
   - Since we have to tie into some pretty low level APIs it is
 unsurprising
 that the code doesn't just compile out of the box against 0.13.0
   - ScalaReflection is for determining Schema from Scala classes, not
 reflection based bridge code.  Either way its unclear to if there is any
 reason to use reflection to support multiple versions, instead of just
 upgrading to Hive 0.13.0

 One question I have is, What is the goal of upgrading to hive 0.13.0?  Is
 it purely because you are having problems connecting to newer metastores?
   Are there some features you are hoping for?  This will help me
 prioritize
 this effort.

 Michael


 On Mon, Jul 28, 2014 at 4:05 PM, Ted Yu yuzhih...@gmail.com wrote:

  I was looking for a class where reflection-related code should reside.

 I found this but don't think it is the proper class for bridging
 differences between hive 0.12 and 0.13.1:

 sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/
 ScalaReflection.scala

 Cheers


 On Mon, Jul 28, 2014 at 3:41 PM, Ted Yu yuzhih...@gmail.com wrote:

  After manually copying hive 0.13.1 jars to local maven repo, I got the
 following errors when building spark-hive_2.10 module :

 [ERROR]

  /homes/xx/spark/sql/hive/src/main/scala/org/apache/spark/
 sql/hive/HiveContext.scala:182:

 type mismatch;
   found   : String
   required: Array[String]
 [ERROR]   val proc: CommandProcessor =
 CommandProcessorFactory.get(tokens(0), hiveconf)
 [ERROR]
 ^
 [ERROR]

  /homes/xx/spark/sql/hive/src/main/scala/org/apache/spark/
 sql/hive/HiveMetastoreCatalog.scala:60:

 value getAllPartitionsForPruner is not a member of org.apache.
   hadoop.hive.ql.metadata.Hive
 [ERROR] client.getAllPartitionsForPruner(table).toSeq
 [ERROR]^
 [ERROR]

  /homes/xx/spark/sql/hive/src/main/scala/org/apache/spark/
 sql/hive/HiveMetastoreCatalog.scala:267:

 overloaded method constructor TableDesc with alternatives:
(x$1: Class[_ : org.apache.hadoop.mapred.InputFormat[_, _]],x$2:
 Class[_],x$3:

 java.util.Properties)org.apache.hadoop.hive.ql.plan.TableDesc

 and
()org.apache.hadoop.hive.ql.plan.TableDesc
   cannot be applied to (Class[org.apache.hadoop.hive.
 serde2.Deserializer],
 Class[(some other)?0(in value tableDesc)(in value tableDesc)],

 Class[?0(in

 value tableDesc)(in   value tableDesc)], java.util.Properties)
 [ERROR]   val tableDesc = new TableDesc(
 [ERROR]   ^
 [WARNING] Class org.antlr.runtime.tree.CommonTree not found -
 continuing
 with a stub.
 [WARNING] Class org.antlr.runtime.Token not found - continuing with a

 stub.

 [WARNING] Class org.antlr.runtime.tree.Tree not found - continuing with
 a
 stub.
 [ERROR]
   while compiling:

  /homes/xx/spark/sql/hive/src/main/scala/org/apache/spark/
 sql/hive/HiveQl.scala

  during phase: typer
   library version: version 2.10.4
  compiler version: version 2.10.4

 The above shows incompatible changes between 0.12 and 0.13.1
 e.g. the first error corresponds to the following method
 in CommandProcessorFactory :
public static CommandProcessor get(String[] cmd, HiveConf conf)

 Cheers


 On Mon, Jul 28, 2014 at 1:32 PM, Steve Nunez snu...@hortonworks.com
 wrote:

  So, do we have a short-term fix until Hive 0.14 comes out? Perhaps

 adding

 the hive-exec jar to the spark-project repo? It doesn¹t look like

 there¹s

 a release date schedule for 0.14.



 On 7/28/14, 10:50, Cheng Lian lian.cs@gmail.com wrote:

  Exactly, forgot to mention Hulu team also made changes to cope with

 those

 incompatibility issues, but they said that¹s relatively easy once the
 re-packaging work is done.


 On Tue, Jul 29, 2014 at 1:20 AM, Patrick Wendell pwend...@gmail.com


  wrote:

  I've heard from Cloudera that there were hive internal changes

 between

  0.12 and 0.13 that required code re-writing. Over time it might be
 possible for us to integrate with hive using API's that are more
 stable (this is the domain of Michael/Cheng/Yin more than me!). It
 would be interesting to see what the Hulu folks did.

 - Patrick

 On Mon, Jul 28, 2014

Re: Storage Handlers in Spark SQL

2014-08-25 Thread Michael Armbrust
- dev list
+ user list

You should be able to query Spark SQL using JDBC, starting with the 1.1
release.  There is some documentation is the repo
https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md#running-the-thrift-jdbc-server,
and we'll update the official docs once the release is out.


On Thu, Aug 21, 2014 at 4:43 AM, Niranda Perera nira...@wso2.com wrote:

 Hi,

 I have been playing around with Spark for the past few days, and evaluating
 the possibility of migrating into Spark (Spark SQL) from Hive/Hadoop.

 I am working on the WSO2 Business Activity Monitor (WSO2 BAM,

 https://docs.wso2.com/display/BAM241/WSO2+Business+Activity+Monitor+Documentation
 ) which has currently employed Hive. We are considering Spark as a
 successor for Hive, given it's performance enhancement.

 We have currently employed several custom storage-handlers in Hive.
 Example:
 WSO2 JDBC and Cassandra storage handlers:
 https://docs.wso2.com/display/BAM241/JDBC+Storage+Handler+for+Hive

 https://docs.wso2.com/display/BAM241/Creating+Hive+Queries+to+Analyze+Data#CreatingHiveQueriestoAnalyzeData-cas

 I would like to know where Spark SQL can work with these storage
 handlers (while using HiveContext may be) ?

 Best regards
 --
 *Niranda Perera*
 Software Engineer, WSO2 Inc.
 Mobile: +94-71-554-8430
 Twitter: @n1r44 https://twitter.com/N1R44



Re: [Spark SQL] off-heap columnar store

2014-08-26 Thread Michael Armbrust

 Any initial proposal or design about the caching to Tachyon that you
 can share so far?


Caching parquet files in tachyon with saveAsParquetFile and then reading
them with parquetFile should already work. You can use SQL on these tables
by using registerTempTable.

Some of the general parquet work that we have been doing includes: #1935
https://github.com/apache/spark/pull/1935, SPARK-2721
https://issues.apache.org/jira/browse/SPARK-2721, SPARK-3036
https://issues.apache.org/jira/browse/SPARK-3036, SPARK-3037
https://issues.apache.org/jira/browse/SPARK-3037 and #1819
https://github.com/apache/spark/pull/1819

The reason I'm asking about the columnar compressed format is that
 there are some problems for which Parquet is not practical.


Can you elaborate?


Re: CoHadoop Papers

2014-08-26 Thread Michael Armbrust
It seems like there are two things here:
 - Co-locating blocks with the same keys to avoid network transfer.
 - Leveraging partitioning information to avoid a shuffle when data is
already partitioned correctly (even if those partitions aren't yet on the
same machine).

The former seems more complicated and probably requires the support from
Hadoop you linked to.  However, the latter might be easier as there is
already a framework for reasoning about partitioning and the need to
shuffle in the Spark SQL planner.


On Tue, Aug 26, 2014 at 8:37 AM, Gary Malouf malouf.g...@gmail.com wrote:

 Christopher, can you expand on the co-partitioning support?

 We have a number of spark SQL tables (saved in parquet format) that all
 could be considered to have a common hash key.  Our analytics team wants to
 do frequent joins across these different data-sets based on this key.  It
 makes sense that if the data for each key across 'tables' was co-located on
 the same server, shuffles could be minimized and ultimately performance
 could be much better.

 From reading the HDFS issue I posted before, the way is being paved for
 implementing this type of behavior though there are a lot of complications
 to make it work I believe.


 On Tue, Aug 26, 2014 at 10:40 AM, Christopher Nguyen c...@adatao.com
 wrote:

  Gary, do you mean Spark and HDFS separately, or Spark's use of HDFS?
 
  If the former, Spark does support copartitioning.
 
  If the latter, it's an HDFS scope that's outside of Spark. On that note,
  Hadoop does also make attempts to collocate data, e.g., rack awareness.
 I'm
  sure the paper makes useful contributions for its set of use cases.
 
  Sent while mobile. Pls excuse typos etc.
  On Aug 26, 2014 5:21 AM, Gary Malouf malouf.g...@gmail.com wrote:
 
  It appears support for this type of control over block placement is
 going
  out in the next version of HDFS:
  https://issues.apache.org/jira/browse/HDFS-2576
 
 
  On Tue, Aug 26, 2014 at 7:43 AM, Gary Malouf malouf.g...@gmail.com
  wrote:
 
   One of my colleagues has been questioning me as to why Spark/HDFS
 makes
  no
   attempts to try to co-locate related data blocks.  He pointed to this
   paper: http://www.vldb.org/pvldb/vol4/p575-eltabakh.pdf from 2011 on
  the
   CoHadoop research and the performance improvements it yielded for
   Map/Reduce jobs.
  
   Would leveraging these ideas for writing data from Spark make sense/be
   worthwhile?
  
  
  
 
 



Re: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-09-02 Thread Michael Armbrust
+1


On Tue, Sep 2, 2014 at 5:18 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:

 +1

 Tested on Mac OS X.

 Matei

 On September 2, 2014 at 5:03:19 PM, Kan Zhang (kzh...@apache.org) wrote:

 +1

 Verified PySpark InputFormat/OutputFormat examples.


 On Tue, Sep 2, 2014 at 4:10 PM, Reynold Xin r...@databricks.com wrote:

  +1
 
 
  On Tue, Sep 2, 2014 at 3:08 PM, Cheng Lian lian.cs@gmail.com
 wrote:
 
   +1
  
   - Tested Thrift server and SQL CLI locally on OSX 10.9.
   - Checked datanucleus dependencies in distribution tarball built by
   make-distribution.sh without SPARK_HIVE defined.
  
   ​
  
  
   On Tue, Sep 2, 2014 at 2:30 PM, Will Benton wi...@redhat.com wrote:
  
+1
   
Tested Scala/MLlib apps on Fedora 20 (OpenJDK 7) and OS X 10.9
 (Oracle
   JDK
8).
   
   
best,
wb
   
   
- Original Message -
 From: Patrick Wendell pwend...@gmail.com
 To: dev@spark.apache.org
 Sent: Saturday, August 30, 2014 5:07:52 PM
 Subject: [VOTE] Release Apache Spark 1.1.0 (RC3)

 Please vote on releasing the following candidate as Apache Spark
   version
 1.1.0!

 The tag to be voted on is v1.1.0-rc3 (commit b2d0493b):

   
  
 
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b2d0493b223c5f98a593bb6d7372706cc02bebad

 The release files, including signatures, digests, etc. can be found
  at:
 http://people.apache.org/~pwendell/spark-1.1.0-rc3/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:

  
 https://repository.apache.org/content/repositories/orgapachespark-1030/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-1.1.0-rc3-docs/

 Please vote on releasing this package as Apache Spark 1.1.0!

 The vote is open until Tuesday, September 02, at 23:07 UTC and
 passes
   if
 a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.1.0
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.apache.org/

 == Regressions fixed since RC1 ==
 - Build issue for SQL support:
 https://issues.apache.org/jira/browse/SPARK-3234
 - EC2 script version bump to 1.1.0.

 == What justifies a -1 vote for this release? ==
 This vote is happening very late into the QA period compared with
 previous votes, so -1 votes should only occur for significant
 regressions from 1.0.2. Bugs already present in 1.0.X will not
 block
 this release.

 == What default changes should I be aware of? ==
 1. The default value of spark.io.compression.codec is now
 snappy
 -- Old behavior can be restored by switching to lzf

 2. PySpark now performs external spilling during aggregations.
 -- Old behavior can be restored by setting spark.shuffle.spill
 to
false.


 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org


   
-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org
   
   
  
 



Re: [VOTE] Release Apache Spark 1.1.0 (RC4)

2014-09-03 Thread Michael Armbrust
+1


On Wed, Sep 3, 2014 at 12:29 AM, Reynold Xin r...@databricks.com wrote:

 +1

 Tested locally on Mac OS X with local-cluster mode.




 On Wed, Sep 3, 2014 at 12:24 AM, Patrick Wendell pwend...@gmail.com
 wrote:

  I'll kick it off with a +1
 
  On Wed, Sep 3, 2014 at 12:24 AM, Patrick Wendell pwend...@gmail.com
  wrote:
   Please vote on releasing the following candidate as Apache Spark
 version
  1.1.0!
  
   The tag to be voted on is v1.1.0-rc4 (commit 2f9b2bd):
  
 
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=2f9b2bd7844ee8393dc9c319f4fefedf95f5e460
  
   The release files, including signatures, digests, etc. can be found at:
   http://people.apache.org/~pwendell/spark-1.1.0-rc4/
  
   Release artifacts are signed with the following key:
   https://people.apache.org/keys/committer/pwendell.asc
  
   The staging repository for this release can be found at:
  
 https://repository.apache.org/content/repositories/orgapachespark-1031/
  
   The documentation corresponding to this release can be found at:
   http://people.apache.org/~pwendell/spark-1.1.0-rc4-docs/
  
   Please vote on releasing this package as Apache Spark 1.1.0!
  
   The vote is open until Saturday, September 06, at 08:30 UTC and passes
 if
   a majority of at least 3 +1 PMC votes are cast.
  
   [ ] +1 Release this package as Apache Spark 1.1.0
   [ ] -1 Do not release this package because ...
  
   To learn more about Apache Spark, please see
   http://spark.apache.org/
  
   == Regressions fixed since RC3 ==
   SPARK-3332 - Issue with tagging in EC2 scripts
   SPARK-3358 - Issue with regression for m3.XX instances
  
   == What justifies a -1 vote for this release? ==
   This vote is happening very late into the QA period compared with
   previous votes, so -1 votes should only occur for significant
   regressions from 1.0.2. Bugs already present in 1.0.X will not block
   this release.
  
   == What default changes should I be aware of? ==
   1. The default value of spark.io.compression.codec is now snappy
   -- Old behavior can be restored by switching to lzf
  
   2. PySpark now performs external spilling during aggregations.
   -- Old behavior can be restored by setting spark.shuffle.spill to
  false.
  
   3. PySpark uses a new heuristic for determining the parallelism of
   shuffle operations.
   -- Old behavior can be restored by setting
   spark.default.parallelism to the number of cores in the cluster.
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
 



Re: trimming unnecessary test output

2014-09-07 Thread Michael Armbrust
Feel free to submit a PR to add a log4j.properies file to
sql/catalyst/src/test/resources similar to what we do in core/hive.


On Sat, Sep 6, 2014 at 2:50 PM, Sean Owen so...@cloudera.com wrote:

 This is just a line logging that one test succeeded right? I don't find
 that noise. Recently I wanted to search test run logs for a test case
 success and it was important that the individual test case was logged.
 On Sep 6, 2014 4:13 PM, Nicholas Chammas nicholas.cham...@gmail.com
 wrote:

  Continuing the discussion started here
  https://github.com/apache/spark/pull/2279, I’m wondering if people
  already know that certain test output is unnecessary and should be
 trimmed.
 
  For example
  
 
 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19917/consoleFull
  ,
  I see a bunch of lines like this:
 
  14/09/06 07:54:13 INFO GenerateProjection: Code generated expression
  List(IS NOT NULL 1) in 128.33733 ms
 
  Can/should this type of output be suppressed? Is there any other test
  output that is obviously more noise than signal?
 
  Nick
  ​
 



Re: parquet predicate / projection pushdown into unionAll

2014-09-09 Thread Michael Armbrust
On Tue, Sep 9, 2014 at 10:17 AM, Cody Koeninger c...@koeninger.org wrote:

 Is there a reason in general not to push projections and predicates down
 into the individual ParquetTableScans in a union?


This would be a great case to add to ColumnPruning.  Would be awesome if
you could open a JIRA or even a PR :)


Re: parquet predicate / projection pushdown into unionAll

2014-09-09 Thread Michael Armbrust
Thanks!

On Tue, Sep 9, 2014 at 11:07 AM, Cody Koeninger c...@koeninger.org wrote:

 Opened

 https://issues.apache.org/jira/browse/SPARK-3462

 I'll take a look at ColumnPruning and see what I can do

 On Tue, Sep 9, 2014 at 12:46 PM, Michael Armbrust mich...@databricks.com
 wrote:

 On Tue, Sep 9, 2014 at 10:17 AM, Cody Koeninger c...@koeninger.org
 wrote:

 Is there a reason in general not to push projections and predicates down
 into the individual ParquetTableScans in a union?


 This would be a great case to add to ColumnPruning.  Would be awesome if
 you could open a JIRA or even a PR :)





Re: parquet predicate / projection pushdown into unionAll

2014-09-09 Thread Michael Armbrust
I think usually people add these directories as multiple partitions of the
same table instead of union.  This actually allows us to efficiently prune
directories when reading in addition to standard column pruning.

On Tue, Sep 9, 2014 at 11:26 AM, Gary Malouf malouf.g...@gmail.com wrote:

 I'm kind of surprised this was not run into before.  Do people not
 segregate their data by day/week in the HDFS directory structure?


 On Tue, Sep 9, 2014 at 2:08 PM, Michael Armbrust mich...@databricks.com
 wrote:

 Thanks!

 On Tue, Sep 9, 2014 at 11:07 AM, Cody Koeninger c...@koeninger.org
 wrote:

  Opened
 
  https://issues.apache.org/jira/browse/SPARK-3462
 
  I'll take a look at ColumnPruning and see what I can do
 
  On Tue, Sep 9, 2014 at 12:46 PM, Michael Armbrust 
 mich...@databricks.com
  wrote:
 
  On Tue, Sep 9, 2014 at 10:17 AM, Cody Koeninger c...@koeninger.org
  wrote:
 
  Is there a reason in general not to push projections and predicates
 down
  into the individual ParquetTableScans in a union?
 
 
  This would be a great case to add to ColumnPruning.  Would be awesome
 if
  you could open a JIRA or even a PR :)
 
 
 





Re: parquet predicate / projection pushdown into unionAll

2014-09-09 Thread Michael Armbrust
What Patrick said is correct.  Two other points:
 - In the 1.2 release we are hoping to beef up the support for working with
partitioned parquet independent of the metastore.
 - You can actually do operations like INSERT INTO for parquet tables to
add data.  This creates new parquet files for each insertion.  This will
break if there are multiple concurrent writers to the same table.

On Tue, Sep 9, 2014 at 12:09 PM, Patrick Wendell pwend...@gmail.com wrote:

 I think what Michael means is people often use this to read existing
 partitioned Parquet tables that are defined in a Hive metastore rather
 than data generated directly from within Spark and then reading it
 back as a table. I'd expect the latter case to become more common, but
 for now most users connect to an existing metastore.

 I think you could go this route by creating a partitioned external
 table based on the on-disk layout you create. The downside is that
 you'd have to go through a hive metastore whereas what you are doing
 now doesn't need hive at all.

 We should also just fix the case you are mentioning where a union is
 used directly from within spark. But that's the context.

 - Patrick

 On Tue, Sep 9, 2014 at 12:01 PM, Cody Koeninger c...@koeninger.org
 wrote:
  Maybe I'm missing something, I thought parquet was generally a write-once
  format and the sqlContext interface to it seems that way as well.
 
  d1.saveAsParquetFile(/foo/d1)
 
  // another day, another table, with same schema
  d2.saveAsParquetFile(/foo/d2)
 
  Will give a directory structure like
 
  /foo/d1/_metadata
  /foo/d1/part-r-1.parquet
  /foo/d1/part-r-2.parquet
  /foo/d1/_SUCCESS
 
  /foo/d2/_metadata
  /foo/d2/part-r-1.parquet
  /foo/d2/part-r-2.parquet
  /foo/d2/_SUCCESS
 
  // ParquetFileReader will fail, because /foo/d1 is a directory, not a
  parquet partition
  sqlContext.parquetFile(/foo)
 
  // works, but has the noted lack of pushdown
 
 sqlContext.parquetFile(/foo/d1).unionAll(sqlContext.parquetFile(/foo/d2))
 
 
  Is there another alternative?
 
 
 
  On Tue, Sep 9, 2014 at 1:29 PM, Michael Armbrust mich...@databricks.com
 
  wrote:
 
  I think usually people add these directories as multiple partitions of
 the
  same table instead of union.  This actually allows us to efficiently
 prune
  directories when reading in addition to standard column pruning.
 
  On Tue, Sep 9, 2014 at 11:26 AM, Gary Malouf malouf.g...@gmail.com
  wrote:
 
  I'm kind of surprised this was not run into before.  Do people not
  segregate their data by day/week in the HDFS directory structure?
 
 
  On Tue, Sep 9, 2014 at 2:08 PM, Michael Armbrust 
 mich...@databricks.com
  wrote:
 
  Thanks!
 
  On Tue, Sep 9, 2014 at 11:07 AM, Cody Koeninger c...@koeninger.org
  wrote:
 
   Opened
  
   https://issues.apache.org/jira/browse/SPARK-3462
  
   I'll take a look at ColumnPruning and see what I can do
  
   On Tue, Sep 9, 2014 at 12:46 PM, Michael Armbrust 
  mich...@databricks.com
   wrote:
  
   On Tue, Sep 9, 2014 at 10:17 AM, Cody Koeninger 
 c...@koeninger.org
   wrote:
  
   Is there a reason in general not to push projections and
 predicates
  down
   into the individual ParquetTableScans in a union?
  
  
   This would be a great case to add to ColumnPruning.  Would be
 awesome
  if
   you could open a JIRA or even a PR :)
  
  
  
 
 
 
 



Re: parquet predicate / projection pushdown into unionAll

2014-09-10 Thread Michael Armbrust
Hey Cody,

Thanks for doing this!  Will look at your PR later today.

Michael

On Wed, Sep 10, 2014 at 9:31 AM, Cody Koeninger c...@koeninger.org wrote:

 Tested the patch against a cluster with some real data.  Initial results
 seem like going from one table to a union of 2 tables is now closer to a
 doubling of query time as expected, instead of 5 to 10x.

 Let me know if you see any issues with that PR.

 On Wed, Sep 10, 2014 at 8:19 AM, Cody Koeninger c...@koeninger.org
 wrote:

 So the obvious thing I was missing is that the analyzer has already
 resolved attributes by the time the optimizer runs, so the references in
 the filter / projection need to be fixed up to match the children.

 Created a PR, let me know if there's a better way to do it.  I'll see
 about testing performance against some actual data sets.

 On Tue, Sep 9, 2014 at 6:09 PM, Cody Koeninger c...@koeninger.org
 wrote:

 Ok, so looking at the optimizer code for the first time and trying the
 simplest rule that could possibly work,

 object UnionPushdown extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = plan transform {
 // Push down filter into
 union
 case f @ Filter(condition, u @ Union(left, right)) =

   u.copy(left = f.copy(child = left), right = f.copy(child =
 right))


 // Push down projection into
 union
 case p @ Project(projectList, u @ Union(left, right)) =
   u.copy(left = p.copy(child = left), right = p.copy(child =
 right))

 }

 }


 If I try manually applying that rule to a logical plan in the repl, it
 produces the query shape I'd expect, and executing that plan results in
 parquet pushdowns as I'd expect.

 But adding those cases to ColumnPruning results in a runtime exception
 (below)

 I can keep digging, but it seems like I'm missing some obvious initial
 context around naming of attributes.  If you can provide any pointers to
 speed me on my way I'd appreciate it.


 java.lang.AssertionError: assertion failed: ArrayBuffer() +
 ArrayBuffer() != WrappedArray(name#6, age#7), List(name#9, age#10,
 phones#11)
 at scala.Predef$.assert(Predef.scala:179)
 at
 org.apache.spark.sql.parquet.ParquetTableScan.init(ParquetTableOperations.scala:75)
 at
 org.apache.spark.sql.execution.SparkStrategies$ParquetOperations$$anonfun$9.apply(SparkStrategies.scala:234)
 at
 org.apache.spark.sql.execution.SparkStrategies$ParquetOperations$$anonfun$9.apply(SparkStrategies.scala:234)
 at
 org.apache.spark.sql.SQLContext$SparkPlanner.pruneFilterProject(SQLContext.scala:367)
 at
 org.apache.spark.sql.execution.SparkStrategies$ParquetOperations$.apply(SparkStrategies.scala:230)
 at
 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
 at
 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 at
 org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
 at
 org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
 at
 org.apache.spark.sql.execution.SparkStrategies$BasicOperators$$anonfun$12.apply(SparkStrategies.scala:282)
 at
 org.apache.spark.sql.execution.SparkStrategies$BasicOperators$$anonfun$12.apply(SparkStrategies.scala:282)
 at
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at
 scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
 at
 scala.collection.AbstractTraversable.map(Traversable.scala:105)
 at
 org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:282)
 at
 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
 at
 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 at
 org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
 at
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:402)
 at
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:400)
 at
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:406)
 at
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:406)
 at
 org.apache.spark.sql.SQLContext$QueryExecution.toString(SQLContext.scala:431)




 On Tue, Sep 9, 2014 at 3:02 PM, Michael Armbrust mich...@databricks.com
  wrote:

 What Patrick said is correct.  Two other points:
  - In the 1.2 release we are hoping to beef up the support

Re: parquet predicate / projection pushdown into unionAll

2014-09-12 Thread Michael Armbrust
Yeah, thanks for implementing it!

Since Spark SQL is an alpha component and moving quickly the plan is to
backport all of master into the next point release in the 1.1 series.

On Fri, Sep 12, 2014 at 9:27 AM, Cody Koeninger c...@koeninger.org wrote:

 Cool, thanks for your help on this.  Any chance of adding it to the 1.1.1
 point release, assuming there ends up being one?

 On Wed, Sep 10, 2014 at 11:39 AM, Michael Armbrust mich...@databricks.com
  wrote:

 Hey Cody,

 Thanks for doing this!  Will look at your PR later today.

 Michael

 On Wed, Sep 10, 2014 at 9:31 AM, Cody Koeninger c...@koeninger.org
 wrote:

 Tested the patch against a cluster with some real data.  Initial results
 seem like going from one table to a union of 2 tables is now closer to a
 doubling of query time as expected, instead of 5 to 10x.

 Let me know if you see any issues with that PR.

 On Wed, Sep 10, 2014 at 8:19 AM, Cody Koeninger c...@koeninger.org
 wrote:

 So the obvious thing I was missing is that the analyzer has already
 resolved attributes by the time the optimizer runs, so the references in
 the filter / projection need to be fixed up to match the children.

 Created a PR, let me know if there's a better way to do it.  I'll see
 about testing performance against some actual data sets.

 On Tue, Sep 9, 2014 at 6:09 PM, Cody Koeninger c...@koeninger.org
 wrote:

 Ok, so looking at the optimizer code for the first time and trying the
 simplest rule that could possibly work,

 object UnionPushdown extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = plan transform {
 // Push down filter into
 union
 case f @ Filter(condition, u @ Union(left, right)) =

   u.copy(left = f.copy(child = left), right = f.copy(child =
 right))


 // Push down projection into
 union
 case p @ Project(projectList, u @ Union(left, right)) =
   u.copy(left = p.copy(child = left), right = p.copy(child =
 right))

 }

 }


 If I try manually applying that rule to a logical plan in the repl, it
 produces the query shape I'd expect, and executing that plan results in
 parquet pushdowns as I'd expect.

 But adding those cases to ColumnPruning results in a runtime exception
 (below)

 I can keep digging, but it seems like I'm missing some obvious initial
 context around naming of attributes.  If you can provide any pointers to
 speed me on my way I'd appreciate it.


 java.lang.AssertionError: assertion failed: ArrayBuffer() +
 ArrayBuffer() != WrappedArray(name#6, age#7), List(name#9, age#10,
 phones#11)
 at scala.Predef$.assert(Predef.scala:179)
 at
 org.apache.spark.sql.parquet.ParquetTableScan.init(ParquetTableOperations.scala:75)
 at
 org.apache.spark.sql.execution.SparkStrategies$ParquetOperations$$anonfun$9.apply(SparkStrategies.scala:234)
 at
 org.apache.spark.sql.execution.SparkStrategies$ParquetOperations$$anonfun$9.apply(SparkStrategies.scala:234)
 at
 org.apache.spark.sql.SQLContext$SparkPlanner.pruneFilterProject(SQLContext.scala:367)
 at
 org.apache.spark.sql.execution.SparkStrategies$ParquetOperations$.apply(SparkStrategies.scala:230)
 at
 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
 at
 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
 at
 scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 at
 org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
 at
 org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
 at
 org.apache.spark.sql.execution.SparkStrategies$BasicOperators$$anonfun$12.apply(SparkStrategies.scala:282)
 at
 org.apache.spark.sql.execution.SparkStrategies$BasicOperators$$anonfun$12.apply(SparkStrategies.scala:282)
 at
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at
 scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
 at
 scala.collection.AbstractTraversable.map(Traversable.scala:105)
 at
 org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:282)
 at
 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
 at
 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
 at
 scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 at
 org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
 at
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:402)
 at
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:400

Re: problem with HiveContext inside Actor

2014-09-17 Thread Michael Armbrust
- dev

Is it possible that you are constructing more than one HiveContext in a
single JVM?  Due to global state in Hive code this is not allowed.

Michael

On Wed, Sep 17, 2014 at 7:21 PM, Cheng, Hao hao.ch...@intel.com wrote:

  Hi, Du

 I am not sure what you mean “triggers the HiveContext to create a
 database”, do you create the sub class of HiveContext? Just be sure you
 call the “HiveContext.sessionState” eagerly, since it will set the proper
 “hiveconf” into the SessionState, otherwise the HiveDriver will always get
 the null value when retrieving HiveConf.



 Cheng Hao



 *From:* Du Li [mailto:l...@yahoo-inc.com.INVALID]
 *Sent:* Thursday, September 18, 2014 7:51 AM
 *To:* u...@spark.apache.org; dev@spark.apache.org
 *Subject:* problem with HiveContext inside Actor



 Hi,



 Wonder anybody had similar experience or any suggestion here.



 I have an akka Actor that processes database requests in high-level
 messages. Inside this Actor, it creates a HiveContext object that does the
 actual db work. The main thread creates the needed SparkContext and passes
 in to the Actor to create the HiveContext.



 When a message is sent to the Actor, it is processed properly except that,
 when the message triggers the HiveContext to create a database, it throws a
 NullPointerException in hive.ql.Driver.java which suggests that its conf
 variable is not initialized.



 Ironically, it works fine if my main thread directly calls
 actor.hiveContext to create the database. The spark version is 1.1.0.



 Thanks,

 Du



Re: Support for Hive buckets

2014-09-22 Thread Michael Armbrust
Hi Cody,

There are currently no concrete plans for adding buckets to Spark SQL, but
thats mostly due to lack of resources / demand for this feature.  Adding
full support is probably a fair amount of work since you'd have to make
changes throughout parsing/optimization/execution.  That said, there are
probably some smaller tasks that could be easier (for example, you might be
able to avoid a shuffle when doing joins on tables that are already
bucketed by exposing more metastore information to the planner).

Michael

On Sun, Sep 14, 2014 at 3:10 PM, Cody Koeninger c...@koeninger.org wrote:

 I noticed that the release notes for 1.1.0 said that spark doesn't support
 Hive buckets yet.  I didn't notice any jira issues related to adding
 support.

 Broadly speaking, what would be involved in supporting buckets, especially
 the bucketmapjoin and sortedmerge optimizations?



Re: OutOfMemoryError on parquet SnappyDecompressor

2014-09-23 Thread Michael Armbrust
I actually submitted a patch to do this yesterday:
https://github.com/apache/spark/pull/2493

Can you tell us more about your configuration.  In particular how much
memory/cores do the executors have and what does the schema of your data
look like?

On Tue, Sep 23, 2014 at 7:39 AM, Cody Koeninger c...@koeninger.org wrote:

 So as a related question, is there any reason the settings in SQLConf
 aren't read from the spark context's conf?  I understand why the sql conf
 is mutable, but it's not particularly user friendly to have most spark
 configuration set via e.g. defaults.conf or --properties-file, but for
 spark sql to ignore those.

 On Mon, Sep 22, 2014 at 4:34 PM, Cody Koeninger c...@koeninger.org
 wrote:

  After commit 8856c3d8 switched from gzip to snappy as default parquet
  compression codec, I'm seeing the following when trying to read parquet
  files saved using the new default (same schema and roughly same size as
  files that were previously working):
 
  java.lang.OutOfMemoryError: Direct buffer memory
  java.nio.Bits.reserveMemory(Bits.java:658)
  java.nio.DirectByteBuffer.init(DirectByteBuffer.java:123)
  java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306)
 
 
 parquet.hadoop.codec.SnappyDecompressor.setInput(SnappyDecompressor.java:99)
 
 
 parquet.hadoop.codec.NonBlockedDecompressorStream.read(NonBlockedDecompressorStream.java:43)
  java.io.DataInputStream.readFully(DataInputStream.java:195)
  java.io.DataInputStream.readFully(DataInputStream.java:169)
 
 
 parquet.bytes.BytesInput$StreamBytesInput.toByteArray(BytesInput.java:201)
 
  parquet.column.impl.ColumnReaderImpl.readPage(ColumnReaderImpl.java:521)
 
  parquet.column.impl.ColumnReaderImpl.checkRead(ColumnReaderImpl.java:493)
 
  parquet.column.impl.ColumnReaderImpl.consume(ColumnReaderImpl.java:546)
 
  parquet.column.impl.ColumnReaderImpl.init(ColumnReaderImpl.java:339)
 
 
 parquet.column.impl.ColumnReadStoreImpl.newMemColumnReader(ColumnReadStoreImpl.java:63)
 
 
 parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:58)
 
 
 parquet.io.RecordReaderImplementation.init(RecordReaderImplementation.java:265)
 
  parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:60)
 
  parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:74)
 
 
 parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:110)
 
 
 parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:172)
 
 
 parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:130)
 
  org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:139)
 
 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
  scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
  scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
  scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
  scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
  scala.collection.Iterator$class.isEmpty(Iterator.scala:256)
  scala.collection.AbstractIterator.isEmpty(Iterator.scala:1157)
 
 
 org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:220)
 
 
 org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:219)
  org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
  org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
 
  org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
  org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
  org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
  org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
  org.apache.spark.scheduler.Task.run(Task.scala:54)
 
  org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181)
 
 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
  java.lang.Thread.run(Thread.java:722)
 
 
 



Re: view not supported in spark thrift server?

2014-09-28 Thread Michael Armbrust
Views are not supported yet.  Its not currently on the near term roadmap,
but that can change if there is sufficient demand or someone in the
community is interested in implementing them.  I do not think it would be
very hard.

Michael

On Sun, Sep 28, 2014 at 11:59 AM, Du Li l...@yahoo-inc.com.invalid wrote:


  Can anybody confirm whether or not view is currently supported in spark?
 I found “create view translate” in the blacklist of
 HiveCompatibilitySuite.scala and also the following scenario threw
 NullPointerException on beeline/thriftserver (1.1.0). Any plan to support
 it soon?

   create table src(k string, v string);

  load data local inpath
 '/home/y/share/yspark/examples/src/main/resources/kv1.txt' into table src;

  create view kv as select k, v from src;

  select * from kv;

 Error: java.lang.NullPointerException (state=,code=0)



Re: Extending Scala style checks

2014-10-01 Thread Michael Armbrust
The hard part here is updating the existing code base... which is going to
create merge conflicts with like all of the open PRs...

On Wed, Oct 1, 2014 at 6:13 PM, Nicholas Chammas nicholas.cham...@gmail.com
 wrote:

 Ah, since there appears to be a built-in rule for end-of-line whitespace,
 Michael and Cheng, y'all should be able to add this in pretty easily.

 Nick

 On Wed, Oct 1, 2014 at 6:37 PM, Patrick Wendell pwend...@gmail.com
 wrote:

  Hey Nick,
 
  We can always take built-in rules. Back when we added this Prashant
  Sharma actually did some great work that lets us write our own style
  rules in cases where rules don't exist.
 
  You can see some existing rules here:
 
 
 https://github.com/apache/spark/tree/master/project/spark-style/src/main/scala/org/apache/spark/scalastyle
 
  Prashant has over time contributed a lot of our custom rules upstream
  to stalastyle, so now there are only a couple there.
 
  - Patrick
 
  On Wed, Oct 1, 2014 at 2:36 PM, Ted Yu yuzhih...@gmail.com wrote:
   Please take a look at WhitespaceEndOfLineChecker under:
   http://www.scalastyle.org/rules-0.1.0.html
  
   Cheers
  
   On Wed, Oct 1, 2014 at 2:01 PM, Nicholas Chammas 
  nicholas.cham...@gmail.com
   wrote:
  
   As discussed here https://github.com/apache/spark/pull/2619, it
  would be
   good to extend our Scala style checks to programmatically enforce as
  many
   of our style rules as possible.
  
   Does anyone know if it's relatively straightforward to enforce
  additional
   rules like the no trailing spaces rule mentioned in the linked PR?
  
   Nick
  
 



Re: Parquet schema migrations

2014-10-05 Thread Michael Armbrust
Hi Cody,

Assuming you are talking about 'safe' changes to the schema (i.e. existing
column names are never reused with incompatible types), this is something
I'd love to support.  Perhaps you can describe more what sorts of changes
you are making, and if simple merging of the schemas would be sufficient.
If so, we can open a JIRA, though I'm not sure when we'll have resources to
dedicate to this.

In the near term, I'd suggest writing converters for each version of the
schema, that translate to some desired master schema.  You can then union
all of these together and avoid the cost of batch conversion.  It seems
like in most cases this should be pretty efficient, at least now that we
have good pushdown past union operators :)

Michael

On Sun, Oct 5, 2014 at 3:58 PM, Andrew Ash and...@andrewash.com wrote:

 Hi Cody,

 I wasn't aware there were different versions of the parquet format.  What's
 the difference between raw parquet and the Hive-written parquet files?

 As for your migration question, the approaches I've often seen are
 convert-on-read and convert-all-at-once.  Apache Cassandra for example does
 both -- when upgrading between Cassandra versions that change the on-disk
 sstable format, it will do a convert-on-read as you access the sstables, or
 you can run the upgradesstables command to convert them all at once
 post-upgrade.

 Andrew

 On Fri, Oct 3, 2014 at 4:33 PM, Cody Koeninger c...@koeninger.org wrote:

  Wondering if anyone has thoughts on a path forward for parquet schema
  migrations, especially for people (like us) that are using raw parquet
  files rather than Hive.
 
  So far we've gotten away with reading old files, converting, and writing
 to
  new directories, but that obviously becomes problematic above a certain
  data size.
 



Re: How to do broadcast join in SparkSQL

2014-10-08 Thread Michael Armbrust
Thanks for the input.  We purposefully made sure that the config option did
not make it into a release as it is not something that we are willing to
support long term.  That said we'll try and make this easier in the future
either through hints or better support for statistics.

In this particular case you can get what you want by registering the tables
as external tables and setting an flag.  Here's a helper function to do
what you need.

/**
 * Sugar for creating a Hive external table from a parquet path.
 */
def createParquetTable(name: String, file: String): Unit = {
  import org.apache.spark.sql.hive.HiveMetastoreTypes

  val rdd = parquetFile(file)
  val schema = rdd.schema.fields.map(f = s${f.name}
${HiveMetastoreTypes.toMetastoreType(f.dataType)}).mkString(,\n)
  val ddl = s
|CREATE EXTERNAL TABLE $name (
|  $schema
|)
|ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe'
|STORED AS INPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat'
|OUTPUTFORMAT 'parquet.hive.DeprecatedParquetOutputFormat'
|LOCATION '$file'.stripMargin
  sql(ddl)
  setConf(spark.sql.hive.convertMetastoreParquet, true)
}

You'll also need to run this to populate the statistics:

ANALYZE TABLE  tableName COMPUTE STATISTICS noscan;


On Wed, Oct 8, 2014 at 1:44 AM, Jianshi Huang jianshi.hu...@gmail.com
wrote:

 Ok, currently there's cost-based optimization however Parquet statistics
 is not implemented...

 What's the good way if I want to join a big fact table with several tiny
 dimension tables in Spark SQL (1.1)?

 I wish we can allow user hint for the join.

 Jianshi

 On Wed, Oct 8, 2014 at 2:18 PM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:

 Looks like https://issues.apache.org/jira/browse/SPARK-1800 is not
 merged into master?

 I cannot find spark.sql.hints.broadcastTables in latest master, but it's
 in the following patch.


 https://github.com/apache/spark/commit/76ca4341036b95f71763f631049fdae033990ab5


 Jianshi


 On Mon, Sep 29, 2014 at 1:24 AM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:

 Yes, looks like it can only be controlled by the
 parameter spark.sql.autoBroadcastJoinThreshold, which is a little bit weird
 to me.

 How am I suppose to know the exact bytes of a table? Let me specify the
 join algorithm is preferred I think.

 Jianshi

 On Sun, Sep 28, 2014 at 11:57 PM, Ted Yu yuzhih...@gmail.com wrote:

 Have you looked at SPARK-1800 ?

 e.g. see sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala
 Cheers

 On Sun, Sep 28, 2014 at 1:55 AM, Jianshi Huang jianshi.hu...@gmail.com
  wrote:

 I cannot find it in the documentation. And I have a dozen dimension
 tables to (left) join...


 Cheers,
 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/





 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/




 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/




 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/



Re: will/when Spark/SparkSQL will support ORCFile format

2014-10-09 Thread Michael Armbrust
Yes, the foreign sources work is only about exposing a stable set of APIs
for external libraries to link against (to avoid the spark assembly
becoming a dependency mess).  The code path these APIs use will be the same
as that for datasources included in the core spark sql library.

Michael

On Thu, Oct 9, 2014 at 2:18 PM, James Yu jym2...@gmail.com wrote:

 For performance, will foreign data format support, same as native ones?

 Thanks,
 James


 On Wed, Oct 8, 2014 at 11:03 PM, Cheng Lian lian.cs@gmail.com wrote:

  The foreign data source API PR also matters here
  https://www.github.com/apache/spark/pull/2475
 
  Foreign data source like ORC can be added more easily and systematically
  after this PR is merged.
 
  On 10/9/14 8:22 AM, James Yu wrote:
 
  Thanks Mark! I will keep eye on it.
 
  @Evan, I saw people use both format, so I really want to have Spark
  support
  ORCFile.
 
 
  On Wed, Oct 8, 2014 at 11:12 AM, Mark Hamstra m...@clearstorydata.com
  wrote:
 
   https://github.com/apache/spark/pull/2576
 
 
 
  On Wed, Oct 8, 2014 at 11:01 AM, Evan Chan velvia.git...@gmail.com
  wrote:
 
   James,
 
  Michael at the meetup last night said there was some development
  activity around ORCFiles.
 
  I'm curious though, what are the pros and cons of ORCFiles vs Parquet?
 
  On Wed, Oct 8, 2014 at 10:03 AM, James Yu jym2...@gmail.com wrote:
 
  Didn't see anyone asked the question before, but I was wondering if
 
  anyone
 
  knows if Spark/SparkSQL will support ORCFile format soon? ORCFile is
  getting more and more popular hi Hive world.
 
  Thanks,
  James
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
 
 
 



Re: Trouble running tests

2014-10-09 Thread Michael Armbrust
Also, in general for SQL only changes it is sufficient to run sbt/sbt
catatlyst/test sql/test hive/test.  The hive/test part takes the
longest, so I usually leave that out until just before submitting unless my
changes are hive specific.

On Thu, Oct 9, 2014 at 11:40 AM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 _RUN_SQL_TESTS needs to be true as well. Those two _... variables set get
 correctly when tests are run on Jenkins. They’re not meant to be
 manipulated directly by testers.

 Did you want to run SQL tests only locally? You can try faking being
 Jenkins by setting AMPLAB_JENKINS=true before calling run-tests. That
 should be simpler than futzing with the _... variables.

 Nick
 ​

 On Thu, Oct 9, 2014 at 10:10 AM, Yana yana.kadiy...@gmail.com wrote:

  Hi, apologies if I missed a FAQ somewhere.
 
  I am trying to submit a bug fix for the very first time. Reading
  instructions, I forked the git repo (at
  c9ae79fba25cd49ca70ca398bc75434202d26a97) and am trying to run tests.
 
  I run this: ./dev/run-tests  _SQL_TESTS_ONLY=true
 
  and after a while get the following error:
 
  [info] ScalaTest
  [info] Run completed in 3 minutes, 37 seconds.
  [info] Total number of tests run: 224
  [info] Suites: completed 19, aborted 0
  [info] Tests: succeeded 224, failed 0, canceled 0, ignored 5, pending 0
  [info] All tests passed.
  [info] Passed: Total 224, Failed 0, Errors 0, Passed 224, Ignored 5
  [success] Total time: 301 s, completed Oct 9, 2014 9:31:23 AM
  [error] Expected ID character
  [error] Not a valid command: hive-thriftserver
  [error] Expected project ID
  [error] Expected configuration
  [error] Expected ':' (if selecting a configuration)
  [error] Expected key
  [error] Not a valid key: hive-thriftserver
  [error] hive-thriftserver/test
  [error]  ^
 
 
  (I am running this without my changes)
 
  I have 2 questions:
  1. How to fix this
  2. Is there a best practice on what to fork so you start off with a good
  state? I'm wondering if I should sync the latest changes or go back to a
  label?
 
  thanks in advance
 
 
 
 
  --
  View this message in context:
 
 http://apache-spark-developers-list.1001551.n3.nabble.com/Trouble-running-tests-tp8717.html
  Sent from the Apache Spark Developers List mailing list archive at
  Nabble.com.
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
 



Re: Parquet Migrations

2014-10-31 Thread Michael Armbrust
You can't change parquet schema without reencoding the data as you need to
recalculate the footer index data.  You can manually do what SPARK-3851
https://issues.apache.org/jira/browse/SPARK-3851 is going to do today
however.

Consider two schemas:

Old Schema: (a: Int, b: String)
New Schema, where I've dropped and added a column: (a: Int, c: Long)

parquetFile(old).registerTempTable(old)
parquetFile(new).registerTempTable(new)

sql(
  SELECT a, b, CAST(null AS LONG) AS c  FROM old UNION ALL
  SELECT a, CAST(null AS STRING) AS b, c FROM new
).registerTempTable(unifiedData)

Because of filter/column pushdown past UNIONs this should executed as
desired even if you write more complicated queries on top of
unifiedData.  Its a little onerous but should work for now.  This can
also support things like column renaming which would be much harder to do
automatically.

On Fri, Oct 31, 2014 at 1:49 PM, Gary Malouf malouf.g...@gmail.com wrote:

 Outside of what is discussed here
 https://issues.apache.org/jira/browse/SPARK-3851 as a future solution,
 is
 there any path for being able to modify a Parquet schema once some data has
 been written?  This seems like the kind of thing that should make people
 pause when considering whether or not to use Parquet+Spark...



Re: Surprising Spark SQL benchmark

2014-11-04 Thread Michael Armbrust
dev to bcc.

Thanks for reaching out, Ozgun.  Let's discuss if there were any missing
optimizations off list.  We'll make sure to report back or add any findings
to the tuning guide.

On Mon, Nov 3, 2014 at 3:01 PM, ozgun oz...@citusdata.com wrote:

 Hey Patrick,

 It's Ozgun from Citus Data. We'd like to make these benchmark results fair,
 and have tried different config settings for SparkSQL over the past month.
 We picked the best config settings we could find, and also contacted the
 Spark users list about running TPC-H numbers.

 http://goo.gl/IU5Hw0
 http://goo.gl/WQ1kML
 http://goo.gl/ihLzgh

 We also received advice at the Spark Summit '14 to wait until v1.1, and
 therefore re-ran our tests on SparkSQL 1.1. On the specific optimizations,
 Marco and Samay from our team have much more context, and I'll let them
 answer your questions on the different settings we tried.

 Our intent is to be fair and not misrepresent SparkSQL's performance. On
 that front, we used publicly available documentation and user lists, and
 spent about a month trying to get the best Spark performance results. If
 there are specific optimizations we should have applied and missed, we'd
 love to be involved with the community in re-running the numbers.

 Is this email thread the best place to continue the conversation?

 Best,
 Ozgun



 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Surprising-Spark-SQL-benchmark-tp9041p9073.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: [VOTE] Designating maintainers for some Spark components

2014-11-05 Thread Michael Armbrust
+1 (binding)

On Wed, Nov 5, 2014 at 5:33 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:

 BTW, my own vote is obviously +1 (binding).

 Matei

  On Nov 5, 2014, at 5:31 PM, Matei Zaharia matei.zaha...@gmail.com
 wrote:
 
  Hi all,
 
  I wanted to share a discussion we've been having on the PMC list, as
 well as call for an official vote on it on a public list. Basically, as the
 Spark project scales up, we need to define a model to make sure there is
 still great oversight of key components (in particular internal
 architecture and public APIs), and to this end I've proposed implementing a
 maintainer model for some of these components, similar to other large
 projects.
 
  As background on this, Spark has grown a lot since joining Apache. We've
 had over 80 contributors/month for the past 3 months, which I believe makes
 us the most active project in contributors/month at Apache, as well as over
 500 patches/month. The codebase has also grown significantly, with new
 libraries for SQL, ML, graphs and more.
 
  In this kind of large project, one common way to scale development is to
 assign maintainers to oversee key components, where each patch to that
 component needs to get sign-off from at least one of its maintainers. Most
 existing large projects do this -- at Apache, some large ones with this
 model are CloudStack (the second-most active project overall), Subversion,
 and Kafka, and other examples include Linux and Python. This is also
 by-and-large how Spark operates today -- most components have a de-facto
 maintainer.
 
  IMO, adopting this model would have two benefits:
 
  1) Consistent oversight of design for that component, especially
 regarding architecture and API. This process would ensure that the
 component's maintainers see all proposed changes and consider them to fit
 together in a good way.
 
  2) More structure for new contributors and committers -- in particular,
 it would be easy to look up who’s responsible for each module and ask them
 for reviews, etc, rather than having patches slip between the cracks.
 
  We'd like to start with in a light-weight manner, where the model only
 applies to certain key components (e.g. scheduler, shuffle) and user-facing
 APIs (MLlib, GraphX, etc). Over time, as the project grows, we can expand
 it if we deem it useful. The specific mechanics would be as follows:
 
  - Some components in Spark will have maintainers assigned to them, where
 one of the maintainers needs to sign off on each patch to the component.
  - Each component with maintainers will have at least 2 maintainers.
  - Maintainers will be assigned from the most active and knowledgeable
 committers on that component by the PMC. The PMC can vote to add / remove
 maintainers, and maintained components, through consensus.
  - Maintainers are expected to be active in responding to patches for
 their components, though they do not need to be the main reviewers for them
 (e.g. they might just sign off on architecture / API). To prevent inactive
 maintainers from blocking the project, if a maintainer isn't responding in
 a reasonable time period (say 2 weeks), other committers can merge the
 patch, and the PMC will want to discuss adding another maintainer.
 
  If you'd like to see examples for this model, check out the following
 projects:
  - CloudStack:
 https://cwiki.apache.org/confluence/display/CLOUDSTACK/CloudStack+Maintainers+Guide
 
 https://cwiki.apache.org/confluence/display/CLOUDSTACK/CloudStack+Maintainers+Guide
 
  - Subversion:
 https://subversion.apache.org/docs/community-guide/roles.html 
 https://subversion.apache.org/docs/community-guide/roles.html
 
  Finally, I wanted to list our current proposal for initial components
 and maintainers. It would be good to get feedback on other components we
 might add, but please note that personnel discussions (e.g. I don't think
 Matei should maintain *that* component) should only happen on the private
 list. The initial components were chosen to include all public APIs and the
 main core components, and the maintainers were chosen from the most active
 contributors to those modules.
 
  - Spark core public API: Matei, Patrick, Reynold
  - Job scheduler: Matei, Kay, Patrick
  - Shuffle and network: Reynold, Aaron, Matei
  - Block manager: Reynold, Aaron
  - YARN: Tom, Andrew Or
  - Python: Josh, Matei
  - MLlib: Xiangrui, Matei
  - SQL: Michael, Reynold
  - Streaming: TD, Matei
  - GraphX: Ankur, Joey, Reynold
 
  I'd like to formally call a [VOTE] on this model, to last 72 hours. The
 [VOTE] will end on Nov 8, 2014 at 6 PM PST.
 
  Matei




Re: Replacing Spark's native scheduler with Sparrow

2014-11-08 Thread Michael Armbrust

 However, I haven't seen it be as
 high as the 100ms Michael quoted (maybe this was for jobs with tasks that
 have much larger objects that take a long time to deserialize?).


I was thinking more about the average end-to-end latency for launching a
query that has 100s of partitions. Its also quite possible that SQLs task
launch overhead is higher since we have never profiled how much is getting
pulled into the closures.


Re: [VOTE] Release Apache Spark 1.1.1 (RC1)

2014-11-13 Thread Michael Armbrust
Hey Sean,

Thanks for pointing this out.  Looks like a bad test where we should be
doing Set comparison instead of Array.

Michael

On Thu, Nov 13, 2014 at 2:05 AM, Sean Owen so...@cloudera.com wrote:

 LICENSE and NOTICE are fine. Signature and checksum is fine. I
 unzipped and built the plain source distribution, which built.

 However I am seeing a consistent test failure with mvn -DskipTests
 clean package; mvn test. In the Hive module:

 - SET commands semantics for a HiveContext *** FAILED ***
   Expected Array(spark.sql.key.usedfortestonly=test.val.0,

 spark.sql.key.usedfortestonlyspark.sql.key.usedfortestonly=test.val.0test.val.0),
 but got
 Array(spark.sql.key.usedfortestonlyspark.sql.key.usedfortestonly=test.val.0test.val.0,
 spark.sql.key.usedfortestonly=test.val.0) (HiveQuerySuite.scala:544)

 Anyone else seeing this?


 On Thu, Nov 13, 2014 at 8:18 AM, Krishna Sankar ksanka...@gmail.com
 wrote:
  +1
  1. Compiled OSX 10.10 (Yosemite) mvn -Pyarn -Phadoop-2.4
  -Dhadoop.version=2.4.0 -DskipTests clean package 10:49 min
  2. Tested pyspark, mlib
  2.1. statistics OK
  2.2. Linear/Ridge/Laso Regression OK
  2.3. Decision Tree, Naive Bayes OK
  2.4. KMeans OK
  2.5. rdd operations OK
  2.6. recommendation OK
  2.7. Good work ! In 1.1.0, there was an error and my program used to hang
  (over memory allocation) consistently running validation using itertools,
  compute optimum rank, lambda,numofiterations/rmse; data - movielens
 medium
  dataset (1 million records) . It works well in 1.1.1 !
  Cheers
  k/
  P.S: Missed Reply all, first time
 
  On Wed, Nov 12, 2014 at 8:35 PM, Andrew Or and...@databricks.com
 wrote:
 
  I will start the vote with a +1
 
  2014-11-12 20:34 GMT-08:00 Andrew Or and...@databricks.com:
 
   Please vote on releasing the following candidate as Apache Spark
 version
  1
   .1.1.
  
   This release fixes a number of bugs in Spark 1.1.0. Some of the
 notable
   ones are
   - [SPARK-3426] Sort-based shuffle compression settings are
 incompatible
   - [SPARK-3948] Stream corruption issues in sort-based shuffle
   - [SPARK-4107] Incorrect handling of Channel.read() led to data
  truncation
   The full list is at http://s.apache.org/z9h and in the CHANGES.txt
   attached.
  
   The tag to be voted on is v1.1.1-rc1 (commit 72a4fdbe):
   http://s.apache.org/cZC
  
   The release files, including signatures, digests, etc can be found at:
   http://people.apache.org/~andrewor14/spark-1.1.1-rc1/
  
   Release artifacts are signed with the following key:
   https://people.apache.org/keys/committer/andrewor14.asc
  
   The staging repository for this release can be found at:
  
 https://repository.apache.org/content/repositories/orgapachespark-1034/
  
   The documentation corresponding to this release can be found at:
   http://people.apache.org/~andrewor14/spark-1.1.1-rc1-docs/
  
   Please vote on releasing this package as Apache Spark 1.1.1!
  
   The vote is open until Sunday, November 16, at 04:30 UTC and passes if
   a majority of at least 3 +1 PMC votes are cast.
   [ ] +1 Release this package as Apache Spark 1.1.1
   [ ] -1 Do not release this package because ...
  
   To learn more about Apache Spark, please see
   http://spark.apache.org/
  
   Cheers,
   Andrew
  
 

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: mvn or sbt for studying and developing Spark?

2014-11-16 Thread Michael Armbrust
I'm going to have to disagree here.  If you are building a release
distribution or integrating with legacy systems then maven is probably the
correct choice.  However most of the core developers that I know use sbt,
and I think its a better choice for exploration and development overall.
That said, this probably falls into the category of a religious argument so
you might want to look at both options and decide for yourself.

In my experience the SBT build is significantly faster with less effort
(and I think sbt is still faster even if you go through the extra effort of
installing zinc) and easier to read.  The console mode of sbt (just run
sbt/sbt and then a long running console session is started that will accept
further commands) is great for building individual subprojects or running
single test suites.  In addition to being faster since its a long running
JVM, its got a lot of nice features like tab-completion for test case names.

For example, if I wanted to see what test cases are available in the SQL
subproject you can do the following:

[marmbrus@michaels-mbp spark (tpcds)]$ sbt/sbt
[info] Loading project definition from
/Users/marmbrus/workspace/spark/project/project
[info] Loading project definition from
/Users/marmbrus/.sbt/0.13/staging/ad8e8574a5bcb2d22d23/sbt-pom-reader/project
[info] Set current project to spark-parent (in build
file:/Users/marmbrus/workspace/spark/)
 sql/test-only *tab*
--
 org.apache.spark.sql.CachedTableSuite
org.apache.spark.sql.DataTypeSuite
 org.apache.spark.sql.DslQuerySuite
org.apache.spark.sql.InsertIntoSuite
...

Another very useful feature is the development console, which starts an
interactive REPL including the most recent version of the code and a lot of
useful imports for some subprojects.  For example in the hive subproject it
automatically sets up a temporary database with a bunch of test data
pre-loaded:

$ sbt/sbt hive/console
 hive/console
...
import org.apache.spark.sql.hive._
import org.apache.spark.sql.hive.test.TestHive._
import org.apache.spark.sql.parquet.ParquetTestData
Welcome to Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java
1.7.0_45).
Type in expressions to have them evaluated.
Type :help for more information.

scala sql(SELECT * FROM src).take(2)
res0: Array[org.apache.spark.sql.Row] = Array([238,val_238], [86,val_86])

Michael

On Sun, Nov 16, 2014 at 3:27 AM, Dinesh J. Weerakkody 
dineshjweerakk...@gmail.com wrote:

 Hi Stephen and Sean,

 Thanks for correction.

 On Sun, Nov 16, 2014 at 12:28 PM, Sean Owen so...@cloudera.com wrote:

  No, the Maven build is the main one.  I would use it unless you have a
  need to use the SBT build in particular.
  On Nov 16, 2014 2:58 AM, Dinesh J. Weerakkody 
  dineshjweerakk...@gmail.com wrote:
 
  Hi Yiming,
 
  I believe that both SBT and MVN is supported in SPARK, but SBT is
  preferred
  (I'm not 100% sure about this :) ). When I'm using MVN I got some build
  failures. After that used SBT and works fine.
 
  You can go through these discussions regarding SBT vs MVN and learn pros
  and cons of both [1] [2].
 
  [1]
 
 
 http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Necessity-of-Maven-and-SBT-Build-in-Spark-td2315.html
 
  [2]
 
 
 https://groups.google.com/forum/#!msg/spark-developers/OxL268v0-Qs/fBeBY8zmh3oJ
 
  Thanks,
 
  On Sun, Nov 16, 2014 at 7:11 AM, Yiming (John) Zhang sdi...@gmail.com
  wrote:
 
   Hi,
  
  
  
   I am new in developing Spark and my current focus is about
  co-scheduling of
   spark tasks. However, I am confused with the building tools: sometimes
  the
   documentation uses mvn but sometimes uses sbt.
  
  
  
   So, my question is that which one is the preferred tool of Spark
  community?
   And what's the technical difference between them? Thank you!
  
  
  
   Cheers,
  
   Yiming
  
  
 
 
  --
  Thanks  Best Regards,
 
  *Dinesh J. Weerakkody*
 
 


 --
 Thanks  Best Regards,

 *Dinesh J. Weerakkody*



Re: mvn or sbt for studying and developing Spark?

2014-11-17 Thread Michael Armbrust

 * I moved from sbt to maven in June specifically due to Andrew Or's
 describing mvn as the default build tool.  Developers should keep in mind
 that jenkins uses mvn so we need to run mvn before submitting PR's - even
 if sbt were used for day to day dev work


To be clear, I think that the PR builder actually uses sbt
https://github.com/apache/spark/blob/master/dev/run-tests#L198 currently,
but there are master builds that make sure maven doesn't break (amongst
other things).


 *  In addition, as Sean has alluded to, the Intellij seems to comprehend
 the maven builds a bit more readily than sbt


Yeah, this is a very good point.  I have used `sbt/sbt gen-idea` in the
past, but I'm currently using the maven integration of inteliJ since it
seems more stable.


 * But for command line and day to day dev purposes:  sbt sounds great to
 use  Those sound bites you provided about exposing built-in test databases
 for hive and for displaying available testcases are sweet.  Any
 easy/convenient way to see more of  those kinds of facilities available
 through sbt ?


The Spark SQL developer readme
https://github.com/apache/spark/tree/master/sql has a little bit of this,
but we really should have some documentation on using SBT as well.

 Integrating with those systems is generally easier if you are also working
 with Spark in Maven.  (And I wouldn't classify all of those Maven-built
 systems as legacy, Michael :)


Also a good point, though I've seen some pretty clever uses of sbt's
external project references to link spark into other projects.  I'll
certainly admit I have a bias towards new shiny things in general though,
so my definition of legacy is probably skewed :)


Re: Creating a SchemaRDD from an existing API

2014-12-01 Thread Michael Armbrust
No, it should support any data source that has a schema and can produce
rows.

On Mon, Dec 1, 2014 at 1:34 AM, Niranda Perera nira...@wso2.com wrote:

 Hi Michael,

 About this new data source API, what type of data sources would it
 support? Does it have to be RDBMS necessarily?

 Cheers

 On Sat, Nov 29, 2014 at 12:57 AM, Michael Armbrust mich...@databricks.com
  wrote:

 You probably don't need to create a new kind of SchemaRDD.  Instead I'd
 suggest taking a look at the data sources API that we are adding in Spark
 1.2.  There is not a ton of documentation, but the test cases show how
 to implement the various interfaces
 https://github.com/apache/spark/tree/master/sql/core/src/test/scala/org/apache/spark/sql/sources,
 and there is an example library for reading Avro data
 https://github.com/databricks/spark-avro.

 On Thu, Nov 27, 2014 at 10:31 PM, Niranda Perera nira...@wso2.com
 wrote:

 Hi,

 I am evaluating Spark for an analytic component where we do batch
 processing of data using SQL.

 So, I am particularly interested in Spark SQL and in creating a SchemaRDD
 from an existing API [1].

 This API exposes elements in a database as datasources. Using the methods
 allowed by this data source, we can access and edit data.

 So, I want to create a custom SchemaRDD using the methods and provisions
 of
 this API. I tried going through Spark documentation and the Java Docs,
 but
 unfortunately, I was unable to come to a final conclusion if this was
 actually possible.

 I would like to ask the Spark Devs,
 1. As of the current Spark release, can we make a custom SchemaRDD?
 2. What is the extension point to a custom SchemaRDD? or are there
 particular interfaces?
 3. Could you please point me the specific docs regarding this matter?

 Your help in this regard is highly appreciated.

 Cheers

 [1]

 https://github.com/wso2-dev/carbon-analytics/tree/master/components/xanalytics

 --
 *Niranda Perera*
 Software Engineer, WSO2 Inc.
 Mobile: +94-71-554-8430
 Twitter: @n1r44 https://twitter.com/N1R44





 --
 *Niranda Perera*
 Software Engineer, WSO2 Inc.
 Mobile: +94-71-554-8430
 Twitter: @n1r44 https://twitter.com/N1R44



Re: [Thrift,1.2 RC] what happened to parquet.hive.serde.ParquetHiveSerDe

2014-12-02 Thread Michael Armbrust
 In Hive 13 (which is the default for Spark 1.2), parquet is included and
thus we no longer include the Hive parquet bundle. You can now use the
included
ParquetSerDe: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe

If you want to compile Spark 1.2 with Hive 12 instead you can pass
-Phive-0.12.0 and  parquet.hive.serde.ParquetHiveSerDe will be included as
before.

Michael

On Tue, Dec 2, 2014 at 9:31 AM, Yana Kadiyska yana.kadiy...@gmail.com
wrote:

 Apologies if people get this more than once -- I sent mail to dev@spark
 last night and don't see it in the archives. Trying the incubator list
 now...wanted to make sure it doesn't get lost in case it's a bug...

 -- Forwarded message --
 From: Yana Kadiyska yana.kadiy...@gmail.com
 Date: Mon, Dec 1, 2014 at 8:10 PM
 Subject: [Thrift,1.2 RC] what happened to
 parquet.hive.serde.ParquetHiveSerDe
 To: dev@spark.apache.org


 Hi all, apologies if this is not a question for the dev list -- figured
 User list might not be appropriate since I'm having trouble with the RC
 tag.

 I just tried deploying the RC and running ThriftServer. I see the following
 error:

 14/12/01 21:31:42 ERROR UserGroupInformation: PriviledgedActionException
 as:anonymous (auth:SIMPLE)
 cause:org.apache.hive.service.cli.HiveSQLException:
 java.lang.RuntimeException:
 MetaException(message:java.lang.ClassNotFoundException Class
 parquet.hive.serde.ParquetHiveSerDe not found)
 14/12/01 21:31:42 WARN ThriftCLIService: Error executing statement:
 org.apache.hive.service.cli.HiveSQLException: java.lang.RuntimeException:
 MetaException(message:java.lang.ClassNotFoundException Class
 parquet.hive.serde.ParquetHiveSerDe not found)
 at

 org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:192)
 at

 org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231)
 at

 org.apache.hive.service.cli.session.HiveSessionImpl.executeStatement(HiveSessionImpl.java:212)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at

 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at

 org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79)
 at

 org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37)
 at

 org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 ​


 I looked at a working installation that I have(build master a few weeks
 ago) and this class used to be included in spark-assembly:

 ls *.jar|xargs grep parquet.hive.serde.ParquetHiveSerDe
 Binary file spark-assembly-1.2.0-SNAPSHOT-hadoop2.0.0-mr1-cdh4.2.0.jar
 matches

 but with the RC build it's not there?

 I tried both the prebuilt CDH drop and later manually built the tag with
 the following command:

  ./make-distribution.sh --tgz -Phive -Dhadoop.version=2.0.0-mr1-cdh4.2.0
 -Phive-thriftserver
 $JAVA_HOME/bin/jar -tvf spark-assembly-1.2.0-hadoop2.0.0-mr1-cdh4.2.0.jar
 |grep parquet.hive.serde.ParquetHiveSerDe

 comes back empty...



Re: [Thrift,1.2 RC] what happened to parquet.hive.serde.ParquetHiveSerDe

2014-12-04 Thread Michael Armbrust
Here's a fix: https://github.com/apache/spark/pull/3586

On Wed, Dec 3, 2014 at 11:05 AM, Michael Armbrust mich...@databricks.com
wrote:

 Thanks for reporting. As a workaround you should be able to SET
 spark.sql.hive.convertMetastoreParquet=false, but I'm going to try to fix
 this before the next RC.

 On Wed, Dec 3, 2014 at 7:09 AM, Yana Kadiyska yana.kadiy...@gmail.com
 wrote:

 Thanks Michael, you are correct.

 I also opened https://issues.apache.org/jira/browse/SPARK-4702 -- if
 someone can comment on why this might be happening that would be great.
 This would be a blocker to me using 1.2 and it used to work so I'm a bit
 puzzled. I was hoping that it's again a result of the default profile
 switch but it didn't seem to be the case

 (ps. please advise if this is more user-list appropriate. I'm posting to
 dev as it's an RC)

 On Tue, Dec 2, 2014 at 8:37 PM, Michael Armbrust mich...@databricks.com
 wrote:

 In Hive 13 (which is the default for Spark 1.2), parquet is included and
 thus we no longer include the Hive parquet bundle. You can now use the
 included
 ParquetSerDe: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe

 If you want to compile Spark 1.2 with Hive 12 instead you can pass
 -Phive-0.12.0 and  parquet.hive.serde.ParquetHiveSerDe will be included as
 before.

 Michael

 On Tue, Dec 2, 2014 at 9:31 AM, Yana Kadiyska yana.kadiy...@gmail.com
 wrote:

 Apologies if people get this more than once -- I sent mail to dev@spark
 last night and don't see it in the archives. Trying the incubator list
 now...wanted to make sure it doesn't get lost in case it's a bug...

 -- Forwarded message --
 From: Yana Kadiyska yana.kadiy...@gmail.com
 Date: Mon, Dec 1, 2014 at 8:10 PM
 Subject: [Thrift,1.2 RC] what happened to
 parquet.hive.serde.ParquetHiveSerDe
 To: dev@spark.apache.org


 Hi all, apologies if this is not a question for the dev list -- figured
 User list might not be appropriate since I'm having trouble with the RC
 tag.

 I just tried deploying the RC and running ThriftServer. I see the
 following
 error:

 14/12/01 21:31:42 ERROR UserGroupInformation: PriviledgedActionException
 as:anonymous (auth:SIMPLE)
 cause:org.apache.hive.service.cli.HiveSQLException:
 java.lang.RuntimeException:
 MetaException(message:java.lang.ClassNotFoundException Class
 parquet.hive.serde.ParquetHiveSerDe not found)
 14/12/01 21:31:42 WARN ThriftCLIService: Error executing statement:
 org.apache.hive.service.cli.HiveSQLException:
 java.lang.RuntimeException:
 MetaException(message:java.lang.ClassNotFoundException Class
 parquet.hive.serde.ParquetHiveSerDe not found)
 at

 org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:192)
 at

 org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231)
 at

 org.apache.hive.service.cli.session.HiveSessionImpl.executeStatement(HiveSessionImpl.java:212)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at

 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at

 org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79)
 at

 org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37)
 at

 org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 ​


 I looked at a working installation that I have(build master a few weeks
 ago) and this class used to be included in spark-assembly:

 ls *.jar|xargs grep parquet.hive.serde.ParquetHiveSerDe
 Binary file spark-assembly-1.2.0-SNAPSHOT-hadoop2.0.0-mr1-cdh4.2.0.jar
 matches

 but with the RC build it's not there?

 I tried both the prebuilt CDH drop and later manually built the tag with
 the following command:

  ./make-distribution.sh --tgz -Phive -Dhadoop.version=2.0.0-mr1-cdh4.2.0
 -Phive-thriftserver
 $JAVA_HOME/bin/jar -tvf
 spark-assembly-1.2.0-hadoop2.0.0-mr1-cdh4.2.0.jar
 |grep parquet.hive.serde.ParquetHiveSerDe

 comes back empty...







Re: drop table if exists throws exception

2014-12-05 Thread Michael Armbrust
The command run fine for me on master.  Note that Hive does print an
exception in the logs, but that exception does not propogate to user code.

On Thu, Dec 4, 2014 at 11:31 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:

 Hi,

 I got exception saying Hive: NoSuchObjectException(message:table table
 not found)

 when running DROP TABLE IF EXISTS table

 Looks like a new regression in Hive module.

 Anyone can confirm this?

 Thanks,
 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/



Re: CREATE TABLE AS SELECT does not work with temp tables in 1.2.0

2014-12-05 Thread Michael Armbrust
Thanks for reporting.  This looks like a regression related to:
https://github.com/apache/spark/pull/2570

I've filed it here: https://issues.apache.org/jira/browse/SPARK-4769

On Fri, Dec 5, 2014 at 12:03 PM, kb kend...@hotmail.com wrote:

 I am having trouble getting create table as select or saveAsTable from a
 hiveContext to work with temp tables in spark 1.2.  No issues in 1.1.0 or
 1.1.1

 Simple modification to test case in the hive SQLQuerySuite.scala:

 test(double nested data) {
 sparkContext.parallelize(Nested1(Nested2(Nested3(1))) ::
 Nil).registerTempTable(nested)
 checkAnswer(
   sql(SELECT f1.f2.f3 FROM nested),
   1)
 checkAnswer(sql(CREATE TABLE test_ctas_1234 AS SELECT * from nested),
 Seq.empty[Row])
 checkAnswer(
   sql(SELECT * FROM test_ctas_1234),
   sql(SELECT * FROM nested).collect().toSeq)
   }


 output:

 11:57:15.974 ERROR org.apache.hadoop.hive.ql.parse.SemanticAnalyzer:
 org.apache.hadoop.hive.ql.parse.SemanticException: Line 1:45 Table not
 found
 'nested'
 at

 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:1243)
 at

 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:1192)
 at

 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:9209)
 at

 org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:327)
 at

 org.apache.spark.sql.hive.execution.CreateTableAsSelect.metastoreRelation$lzycompute(CreateTableAsSelect.scala:59)
 at

 org.apache.spark.sql.hive.execution.CreateTableAsSelect.metastoreRelation(CreateTableAsSelect.scala:55)
 at

 org.apache.spark.sql.hive.execution.CreateTableAsSelect.sideEffectResult$lzycompute(CreateTableAsSelect.scala:82)
 at

 org.apache.spark.sql.hive.execution.CreateTableAsSelect.sideEffectResult(CreateTableAsSelect.scala:70)
 at

 org.apache.spark.sql.hive.execution.CreateTableAsSelect.execute(CreateTableAsSelect.scala:89)
 at

 org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425)
 at
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425)
 at
 org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58)
 at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:105)
 at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:103)
 at

 org.apache.spark.sql.hive.execution.SQLQuerySuite$$anonfun$4.apply$mcV$sp(SQLQuerySuite.scala:122)
 at

 org.apache.spark.sql.hive.execution.SQLQuerySuite$$anonfun$4.apply(SQLQuerySuite.scala:117)
 at

 org.apache.spark.sql.hive.execution.SQLQuerySuite$$anonfun$4.apply(SQLQuerySuite.scala:117)
 at

 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
 at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
 at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
 at org.scalatest.Transformer.apply(Transformer.scala:22)
 at org.scalatest.Transformer.apply(Transformer.scala:20)
 at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
 at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
 at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
 at

 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
 at
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
 at
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
 at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
 at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
 at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
 at

 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
 at

 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
 at

 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
 at

 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
 at
 org.scalatest.SuperEngine.org
 $scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
 at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
 at
 org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
 at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
 at org.scalatest.Suite$class.run(Suite.scala:1424)
 at
 org.scalatest.FunSuite.org
 $scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
 at
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
 at
 

Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

2014-12-08 Thread Michael Armbrust
This is by hive's design.  From the Hive documentation:

The column change command will only modify Hive's metadata, and will not
 modify data. Users should make sure the actual data layout of the
 table/partition conforms with the metadata definition.



On Sat, Dec 6, 2014 at 8:28 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:

 Ok, found another possible bug in Hive.

 My current solution is to use ALTER TABLE CHANGE to rename the column
 names.

 The problem is after renaming the column names, the value of the columns
 became all NULL.

 Before renaming:
 scala sql(select `sorted::cre_ts` from pmt limit 1).collect
 res12: Array[org.apache.spark.sql.Row] = Array([12/02/2014 07:38:54])

 Execute renaming:
 scala sql(alter table pmt change `sorted::cre_ts` cre_ts string)
 res13: org.apache.spark.sql.SchemaRDD =
 SchemaRDD[972] at RDD at SchemaRDD.scala:108
 == Query Plan ==
 Native command: executed by Hive

 After renaming:
 scala sql(select cre_ts from pmt limit 1).collect
 res16: Array[org.apache.spark.sql.Row] = Array([null])

 I created a JIRA for it:

   https://issues.apache.org/jira/browse/SPARK-4781


 Jianshi

 On Sun, Dec 7, 2014 at 1:06 AM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:

 Hmm... another issue I found doing this approach is that ANALYZE TABLE
 ... COMPUTE STATISTICS will fail to attach the metadata to the table, and
 later broadcast join and such will fail...

 Any idea how to fix this issue?

 Jianshi

 On Sat, Dec 6, 2014 at 9:10 PM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:

 Very interesting, the line doing drop table will throws an exception.
 After removing it all works.

 Jianshi

 On Sat, Dec 6, 2014 at 9:11 AM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:

 Here's the solution I got after talking with Liancheng:

 1) using backquote `..` to wrap up all illegal characters

 val rdd = parquetFile(file)
 val schema = rdd.schema.fields.map(f = s`${f.name}`
 ${HiveMetastoreTypes.toMetastoreType(f.dataType)}).mkString(,\n)

 val ddl_13 = s
   |CREATE EXTERNAL TABLE $name (
   |  $schema
   |)
   |STORED AS PARQUET
   |LOCATION '$file'
   .stripMargin

 sql(ddl_13)

 2) create a new Schema and do applySchema to generate a new SchemaRDD,
 had to drop and register table

 val t = table(name)
 val newSchema = StructType(t.schema.fields.map(s = s.copy(name =
 s.name.replaceAll(.*?::, 
 sql(sdrop table $name)
 applySchema(t, newSchema).registerTempTable(name)

 I'm testing it for now.

 Thanks for the help!


 Jianshi

 On Sat, Dec 6, 2014 at 8:41 AM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:

 Hi,

 I had to use Pig for some preprocessing and to generate Parquet files
 for Spark to consume.

 However, due to Pig's limitation, the generated schema contains Pig's
 identifier

 e.g.
 sorted::id, sorted::cre_ts, ...

 I tried to put the schema inside CREATE EXTERNAL TABLE, e.g.

   create external table pmt (
 sorted::id bigint
   )
   stored as parquet
   location '...'

 Obviously it didn't work, I also tried removing the identifier
 sorted::, but the resulting rows contain only nulls.

 Any idea how to create a table in HiveContext from these Parquet files?

 Thanks,
 Jianshi
 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/




 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/




 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/




 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/




 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/



Re: CREATE TABLE AS SELECT does not work with temp tables in 1.2.0

2014-12-08 Thread Michael Armbrust
This is merged now and should be fixed in the next 1.2 RC.

On Sat, Dec 6, 2014 at 8:28 PM, Cheng, Hao hao.ch...@intel.com wrote:

 I've created(reused) the PR https://github.com/apache/spark/pull/3336,
 hopefully we can fix this regression.

 Thanks for the reporting.

 Cheng Hao

 -Original Message-
 From: Michael Armbrust [mailto:mich...@databricks.com]
 Sent: Saturday, December 6, 2014 4:51 AM
 To: kb
 Cc: d...@spark.incubator.apache.org; Cheng Hao
 Subject: Re: CREATE TABLE AS SELECT does not work with temp tables in 1.2.0

 Thanks for reporting.  This looks like a regression related to:
 https://github.com/apache/spark/pull/2570

 I've filed it here: https://issues.apache.org/jira/browse/SPARK-4769

 On Fri, Dec 5, 2014 at 12:03 PM, kb kend...@hotmail.com wrote:

  I am having trouble getting create table as select or saveAsTable
  from a hiveContext to work with temp tables in spark 1.2.  No issues
  in 1.1.0 or
  1.1.1
 
  Simple modification to test case in the hive SQLQuerySuite.scala:
 
  test(double nested data) {
  sparkContext.parallelize(Nested1(Nested2(Nested3(1))) ::
  Nil).registerTempTable(nested)
  checkAnswer(
sql(SELECT f1.f2.f3 FROM nested),
1)
  checkAnswer(sql(CREATE TABLE test_ctas_1234 AS SELECT * from
  nested),
  Seq.empty[Row])
  checkAnswer(
sql(SELECT * FROM test_ctas_1234),
sql(SELECT * FROM nested).collect().toSeq)
}
 
 
  output:
 
  11:57:15.974 ERROR org.apache.hadoop.hive.ql.parse.SemanticAnalyzer:
  org.apache.hadoop.hive.ql.parse.SemanticException: Line 1:45 Table not
  found 'nested'
  at
 
 
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:1243)
  at
 
 
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:1192)
  at
 
 
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:9209)
  at
 
 
 org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:327)
  at
 
 
 org.apache.spark.sql.hive.execution.CreateTableAsSelect.metastoreRelation$lzycompute(CreateTableAsSelect.scala:59)
  at
 
 
 org.apache.spark.sql.hive.execution.CreateTableAsSelect.metastoreRelation(CreateTableAsSelect.scala:55)
  at
 
 
 org.apache.spark.sql.hive.execution.CreateTableAsSelect.sideEffectResult$lzycompute(CreateTableAsSelect.scala:82)
  at
 
 
 org.apache.spark.sql.hive.execution.CreateTableAsSelect.sideEffectResult(CreateTableAsSelect.scala:70)
  at
 
 
 org.apache.spark.sql.hive.execution.CreateTableAsSelect.execute(CreateTableAsSelect.scala:89)
  at
 
 
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425)
  at
 
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425)
  at
  org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58)
  at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:105)
  at
 org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:103)
  at
 
 
 org.apache.spark.sql.hive.execution.SQLQuerySuite$$anonfun$4.apply$mcV$sp(SQLQuerySuite.scala:122)
  at
 
 
 org.apache.spark.sql.hive.execution.SQLQuerySuite$$anonfun$4.apply(SQLQuerySuite.scala:117)
  at
 
 
 org.apache.spark.sql.hive.execution.SQLQuerySuite$$anonfun$4.apply(SQLQuerySuite.scala:117)
  at
 
 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
  at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
  at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
  at org.scalatest.Transformer.apply(Transformer.scala:22)
  at org.scalatest.Transformer.apply(Transformer.scala:20)
  at
 org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
  at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
  at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
  at
 
 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
  at
 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
  at
 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
  at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
  at
 org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
  at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
  at
 
 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
  at
 
 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
  at
 
 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
  at
 
 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
  at scala.collection.immutable.List.foreach(List.scala:318

Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

2014-12-09 Thread Michael Armbrust
You might also try out the recently added support for views.

On Mon, Dec 8, 2014 at 9:31 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:

 Ah... I see. Thanks for pointing it out.

 Then it means we cannot mount external table using customized column
 names. hmm...

 Then the only option left is to use a subquery to add a bunch of column
 alias. I'll try it later.

 Thanks,
 Jianshi

 On Tue, Dec 9, 2014 at 3:34 AM, Michael Armbrust mich...@databricks.com
 wrote:

 This is by hive's design.  From the Hive documentation:

 The column change command will only modify Hive's metadata, and will not
 modify data. Users should make sure the actual data layout of the
 table/partition conforms with the metadata definition.



 On Sat, Dec 6, 2014 at 8:28 PM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:

 Ok, found another possible bug in Hive.

 My current solution is to use ALTER TABLE CHANGE to rename the column
 names.

 The problem is after renaming the column names, the value of the columns
 became all NULL.

 Before renaming:
 scala sql(select `sorted::cre_ts` from pmt limit 1).collect
 res12: Array[org.apache.spark.sql.Row] = Array([12/02/2014 07:38:54])

 Execute renaming:
 scala sql(alter table pmt change `sorted::cre_ts` cre_ts string)
 res13: org.apache.spark.sql.SchemaRDD =
 SchemaRDD[972] at RDD at SchemaRDD.scala:108
 == Query Plan ==
 Native command: executed by Hive

 After renaming:
 scala sql(select cre_ts from pmt limit 1).collect
 res16: Array[org.apache.spark.sql.Row] = Array([null])

 I created a JIRA for it:

   https://issues.apache.org/jira/browse/SPARK-4781


 Jianshi

 On Sun, Dec 7, 2014 at 1:06 AM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:

 Hmm... another issue I found doing this approach is that ANALYZE TABLE
 ... COMPUTE STATISTICS will fail to attach the metadata to the table, and
 later broadcast join and such will fail...

 Any idea how to fix this issue?

 Jianshi

 On Sat, Dec 6, 2014 at 9:10 PM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:

 Very interesting, the line doing drop table will throws an exception.
 After removing it all works.

 Jianshi

 On Sat, Dec 6, 2014 at 9:11 AM, Jianshi Huang jianshi.hu...@gmail.com
  wrote:

 Here's the solution I got after talking with Liancheng:

 1) using backquote `..` to wrap up all illegal characters

 val rdd = parquetFile(file)
 val schema = rdd.schema.fields.map(f = s`${f.name}`
 ${HiveMetastoreTypes.toMetastoreType(f.dataType)}).mkString(,\n)

 val ddl_13 = s
   |CREATE EXTERNAL TABLE $name (
   |  $schema
   |)
   |STORED AS PARQUET
   |LOCATION '$file'
   .stripMargin

 sql(ddl_13)

 2) create a new Schema and do applySchema to generate a new
 SchemaRDD, had to drop and register table

 val t = table(name)
 val newSchema = StructType(t.schema.fields.map(s = s.copy(name =
 s.name.replaceAll(.*?::, 
 sql(sdrop table $name)
 applySchema(t, newSchema).registerTempTable(name)

 I'm testing it for now.

 Thanks for the help!


 Jianshi

 On Sat, Dec 6, 2014 at 8:41 AM, Jianshi Huang 
 jianshi.hu...@gmail.com wrote:

 Hi,

 I had to use Pig for some preprocessing and to generate Parquet
 files for Spark to consume.

 However, due to Pig's limitation, the generated schema contains
 Pig's identifier

 e.g.
 sorted::id, sorted::cre_ts, ...

 I tried to put the schema inside CREATE EXTERNAL TABLE, e.g.

   create external table pmt (
 sorted::id bigint
   )
   stored as parquet
   location '...'

 Obviously it didn't work, I also tried removing the identifier
 sorted::, but the resulting rows contain only nulls.

 Any idea how to create a table in HiveContext from these Parquet
 files?

 Thanks,
 Jianshi
 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/




 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/




 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/




 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/




 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/





 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/



Re: SparkSQL not honoring schema

2014-12-10 Thread Michael Armbrust
As the scala doc for applySchema says, It is important to make sure that
the structure of every [[Row]] of the provided RDD matches the provided
schema. Otherwise, there will be runtime exceptions.  We don't check as
doing runtime reflection on all of the data would be very expensive.  You
will only get errors if you try to manipulate the data, but otherwise it
will pass it though.

I have written some debugging code (developer API, not guaranteed to be
stable) though that you can use.

import org.apache.spark.sql.execution.debug._
schemaRDD.typeCheck()

On Wed, Dec 10, 2014 at 6:19 PM, Alessandro Baretta alexbare...@gmail.com
wrote:

 Hello,

 I defined a SchemaRDD by applying a hand-crafted StructType to an RDD. Some
 of the Rows in the RDD are malformed--that is, they do not conform to the
 schema defined by the StructType. When running a select statement on this
 SchemaRDD I would expect SparkSQL to either reject the malformed rows or
 fail. Instead, it returns whatever data it finds, even if malformed. Is
 this the desired behavior? Is there no method in SparkSQL to check for
 validity with respect to the schema?

 Thanks.

 Alex



Re: Is there any document to explain how to build the hive jars for spark?

2014-12-14 Thread Michael Armbrust
The modified version of hive can be found here:
https://github.com/pwendell/hive

On Thu, Dec 11, 2014 at 5:47 PM, Yi Tian tianyi.asiai...@gmail.com wrote:

 Hi, all

 We found some bugs in hive-0.12, but we could not wait for hive community
 fixing them.

 We want to fix these bugs in our lab and build a new release which could
 be recognized by spark.

 As we know, spark depends on a special release of hive, like:

 |dependency
   groupIdorg.spark-project.hive/groupId
   artifactIdhive-metastore/artifactId
   version${hive.version}/version
 /dependency
 |

 The different between |org.spark-project.hive| and |org.apache.hive| was
 described by Patrick:

 |There are two differences:

 1. We publish hive with a shaded protobuf dependency to avoid
 conflicts with some Hadoop versions.
 2. We publish a proper hive-exec jar that only includes hive packages.
 The upstream version of hive-exec bundles a bunch of other random
 dependencies in it which makes it really hard for third-party projects
 to use it.
 |

 Is there any document to guide us how to build the hive jars for spark?

 Any help would be greatly appreciated.

 ​



Re: Data source interface for making multiple tables available for query

2014-12-22 Thread Michael Armbrust
I agree and this is something that we have discussed in the past.
Essentially I think instead of creating a RelationProvider that returns a
single table, we'll have something like an external catalog that can return
multiple base relations.

On Sun, Dec 21, 2014 at 6:43 PM, Venkata ramana gollamudi 
ramana.gollam...@huawei.com wrote:

 Hi,

 Data source ddl.scala, CREATE TEMPORARY TABLE makes one table at time
 available to temp tables, how about the case if multiple/all tables from
 some data source needs to be available for query, just like hive tables. I
 think we also need that interface to connect such data sources. Please
 comment.

 Regards,
 Ramana



Re: Unsupported Catalyst types in Parquet

2014-12-29 Thread Michael Armbrust
I'd love to get both of these in.  There is some trickiness that I talk
about on the JIRA for timestamps since the SQL timestamp class can support
nano seconds and I don't think parquet has a type for this.  Other systems
(impala) seem to use INT96.  It would be great to maybe ask on the parquet
mailing list what the plan is there to make sure that whatever we do is
going to be compatible long term.

Michael

On Mon, Dec 29, 2014 at 8:13 AM, Alessandro Baretta alexbare...@gmail.com
wrote:

 Daoyuan,

 Thanks for creating the jiras. I need these features by... last week, so
 I'd be happy to take care of this myself, if only you or someone more
 experienced than me in the SparkSQL codebase could provide some guidance.

 Alex
 On Dec 29, 2014 12:06 AM, Wang, Daoyuan daoyuan.w...@intel.com wrote:

 Hi Alex,

 I'll create JIRA SPARK-4985 for date type support in parquet, and
 SPARK-4987 for timestamp type support. For decimal type, I think we only
 support decimals that fits in a long.

 Thanks,
 Daoyuan

 -Original Message-
 From: Alessandro Baretta [mailto:alexbare...@gmail.com]
 Sent: Saturday, December 27, 2014 2:47 PM
 To: dev@spark.apache.org; Michael Armbrust
 Subject: Unsupported Catalyst types in Parquet

 Michael,

 I'm having trouble storing my SchemaRDDs in Parquet format with SparkSQL,
 due to my RDDs having having DateType and DecimalType fields. What would it
 take to add Parquet support for these Catalyst? Are there any other
 Catalyst types for which there is no Catalyst support?

 Alex




Re: query planner design doc?

2015-01-23 Thread Michael Armbrust
No, are you looking for something in particular?

On Fri, Jan 23, 2015 at 9:44 AM, Nicholas Murphy halcyo...@gmail.com
wrote:

 Okay, thanks.  The design document mostly details the infrastructure for
 optimization strategies but doesn’t detail the strategies themselves.  I
 take it the set of strategies are basically embodied in
 SparkStrategies.scala...is there a design doc/roadmap/JIRA issue detailing
 what strategies exist and which are planned?

 Thanks,
 Nick

 On Jan 22, 2015, at 7:45 PM, Michael Armbrust mich...@databricks.com
 wrote:

 Here is the initial design document for catalyst :

 https://docs.google.com/document/d/1Hc_Ehtr0G8SQUg69cmViZsMi55_Kf3tISD9GPGU5M1Y/edit

 Strategies (many of which are in SparkStragegies.scala) are the part that
 creates the physical operators from a catalyst logical plan.  These
 operators have execute() methods that actually call RDD operations.

 On Thu, Jan 22, 2015 at 3:19 PM, Nicholas Murphy halcyo...@gmail.com
 wrote:

 Hi-

 Quick question: is there a design doc (or something more than “look at
 the code”) for the query planner for Spark SQL (i.e., the component that
 takes…Catalyst?…operator trees and translates them into SPARK operations)?

 Thanks,
 Nick
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org






Re: Are there any plans to run Spark on top of Succinct

2015-01-26 Thread Michael Armbrust
There was work being done at Berkeley on prototyping support for Succinct
in Spark SQL.  Rachit might have more information.

On Thu, Jan 22, 2015 at 7:04 AM, Dean Wampler deanwamp...@gmail.com wrote:

 Interesting. I was wondering recently if anyone has explored working with
 compressed data directly.

 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition
 http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
 Typesafe http://typesafe.com
 @deanwampler http://twitter.com/deanwampler
 http://polyglotprogramming.com

 On Thu, Jan 22, 2015 at 2:59 AM, Mick Davies michael.belldav...@gmail.com
 
 wrote:

 
  http://succinct.cs.berkeley.edu/wp/wordpress/
 
  Looks like a really interesting piece of work that could dovetail well
 with
  Spark.
 
  I have been trying recently to optimize some queries I have running on
  Spark
  on top of Parquet but the support from Parquet for predicate push down
  especially for dictionary based columns is a bit limiting. I am not sure,
  but from a cursory view it looks like this format may help in this area.
 
  Mick
 
 
 
 
  --
  View this message in context:
 
 http://apache-spark-developers-list.1001551.n3.nabble.com/Are-there-any-plans-to-run-Spark-on-top-of-Succinct-tp10243.html
  Sent from the Apache Spark Developers List mailing list archive at
  Nabble.com.
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
 



Re: Optimize encoding/decoding strings when using Parquet

2015-01-16 Thread Michael Armbrust
+1 to adding such an optimization to parquet.  The bytes are tagged
specially as UTF8 in the parquet schema so it seem like it would be
possible to add this.

On Fri, Jan 16, 2015 at 8:17 AM, Mick Davies michael.belldav...@gmail.com
wrote:

 Hi,

 It seems that a reasonably large proportion of query time using Spark SQL
 seems to be spent decoding Parquet Binary objects to produce Java Strings.
 Has anyone considered trying to optimize these conversions as many are
 duplicated.

 Details are outlined in the conversation in the user mailing list

 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-amp-Parquet-data-are-reading-very-very-slow-td21061.html
 
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-amp-Parquet-data-are-reading-very-very-slow-td21061.html
 
 , I have copied a bit of that discussion here.

 It seems that as Spark processes each row from Parquet it makes a call to
 convert the Binary representation for each String column into a Java
 String.
 However in many (probably most) circumstances the underlying Binary
 instance
 from Parquet will have come from a Dictionary, for example when column
 cardinality is low. Therefore Spark is converting the same byte array to a
 copy of the same Java String over and over again. This is bad due to extra
 cpu, extra memory used for these strings, and probably results in more
 expensive grouping comparisons.


 I tested a simple hack to cache the last Binary-String conversion per
 column in ParquetConverter and this led to a 25% performance improvement
 for
 the queries I used. Admittedly this was over a data set with lots or runs
 of
 the same Strings in the queried columns.

 These costs are quite significant for the type of data that I expect will
 be
 stored in Parquet which will often have denormalized tables and probably
 lots of fairly low cardinality string columns

 I think a good way to optimize this would be if changes could be made to
 Parquet so that  the encoding/decoding of Objects to Binary is handled on
 Parquet side of fence. Parquet could deal with Objects (Strings) as the
 client understands them and only use encoding/decoding to store/read from
 underlying storage medium. Doing this I think Parquet could ensure that the
 encoding/decoding of each Object occurs only once.

 Does anyone have an opinion on this, has it been considered already?

 Cheers Mick







 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Optimize-encoding-decoding-strings-when-using-Parquet-tp10141.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: query planner design doc?

2015-01-22 Thread Michael Armbrust
Here is the initial design document for catalyst :
https://docs.google.com/document/d/1Hc_Ehtr0G8SQUg69cmViZsMi55_Kf3tISD9GPGU5M1Y/edit

Strategies (many of which are in SparkStragegies.scala) are the part that
creates the physical operators from a catalyst logical plan.  These
operators have execute() methods that actually call RDD operations.

On Thu, Jan 22, 2015 at 3:19 PM, Nicholas Murphy halcyo...@gmail.com
wrote:

 Hi-

 Quick question: is there a design doc (or something more than “look at the
 code”) for the query planner for Spark SQL (i.e., the component that
 takes…Catalyst?…operator trees and translates them into SPARK operations)?

 Thanks,
 Nick
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: HiveContext cannot be serialized

2015-02-16 Thread Michael Armbrust
I'd suggest marking the HiveContext as @transient since its not valid to
use it on the slaves anyway.

On Mon, Feb 16, 2015 at 4:27 AM, Haopu Wang hw...@qilinsoft.com wrote:

 When I'm investigating this issue (in the end of this email), I take a
 look at HiveContext's code and find this change
 (https://github.com/apache/spark/commit/64945f868443fbc59cb34b34c16d782d
 da0fb63d#diff-ff50aea397a607b79df9bec6f2a841db):



 -  @transient protected[hive] lazy val hiveconf = new
 HiveConf(classOf[SessionState])

 -  @transient protected[hive] lazy val sessionState = {

 -val ss = new SessionState(hiveconf)

 -setConf(hiveconf.getAllProperties)  // Have SQLConf pick up the
 initial set of HiveConf.

 -ss

 -  }

 +  @transient protected[hive] lazy val (hiveconf, sessionState) =

 +Option(SessionState.get())

 +  .orElse {



 With the new change, Scala compiler always generate a Tuple2 field of
 HiveContext as below:



 private Tuple2 x$3;

 private transient OutputStream outputBuffer;

 private transient HiveConf hiveconf;

 private transient SessionState sessionState;

 private transient HiveMetastoreCatalog catalog;



 That x$3 field's key is HiveConf object that cannot be serialized. So
 can you suggest how to resolve this issue? Thank you very much!



 



 I have a streaming application which registered temp table on a
 HiveContext for each batch duration.

 The application runs well in Spark 1.1.0. But I get below error from
 1.1.1.

 Do you have any suggestions to resolve it? Thank you!



 java.io.NotSerializableException: org.apache.hadoop.hive.conf.HiveConf

 - field (class scala.Tuple2, name: _1, type: class
 java.lang.Object)

 - object (class scala.Tuple2, (Configuration: core-default.xml,
 core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml,
 yarn-site.xml, hdfs-default.xml, hdfs-site.xml,
 org.apache.hadoop.hive.conf.LoopingByteArrayInputStream@2158ce23,org.apa
 che.hadoop.hive.ql.session.SessionState@49b6eef9))

 - field (class org.apache.spark.sql.hive.HiveContext, name: x$3,
 type: class scala.Tuple2)

 - object (class org.apache.spark.sql.hive.HiveContext,
 org.apache.spark.sql.hive.HiveContext@4e6e66a4)

 - field (class
 example.BaseQueryableDStream$$anonfun$registerTempTable$2, name:
 sqlContext$1, type: class org.apache.spark.sql.SQLContext)

- object (class
 example.BaseQueryableDStream$$anonfun$registerTempTable$2,
 function1)

 - field (class
 org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1,
 name: foreachFunc$1, type: interface scala.Function1)

 - object (class
 org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1,
 function2)

 - field (class org.apache.spark.streaming.dstream.ForEachDStream,
 name: org$apache$spark$streaming$dstream$ForEachDStream$$foreachFunc,
 type: interface scala.Function2)

 - object (class org.apache.spark.streaming.dstream.ForEachDStream,
 org.apache.spark.streaming.dstream.ForEachDStream@5ccbdc20)

 - element of array (index: 0)

 - array (class [Ljava.lang.Object;, size: 16)

 - field (class scala.collection.mutable.ArrayBuffer, name:
 array, type: class [Ljava.lang.Object;)

 - object (class scala.collection.mutable.ArrayBuffer,
 ArrayBuffer(org.apache.spark.streaming.dstream.ForEachDStream@5ccbdc20))

 - field (class org.apache.spark.streaming.DStreamGraph, name:
 outputStreams, type: class scala.collection.mutable.ArrayBuffer)

 - custom writeObject data (class
 org.apache.spark.streaming.DStreamGraph)

 - object (class org.apache.spark.streaming.DStreamGraph,
 org.apache.spark.streaming.DStreamGraph@776ae7da)

 - field (class org.apache.spark.streaming.Checkpoint, name:
 graph, type: class org.apache.spark.streaming.DStreamGraph)

 - root object (class org.apache.spark.streaming.Checkpoint,
 org.apache.spark.streaming.Checkpoint@5eade065)

 at java.io.ObjectOutputStream.writeObject0(Unknown Source)










Re: HiveContext cannot be serialized

2015-02-16 Thread Michael Armbrust
I was suggesting you mark the variable that is holding the HiveContext
'@transient' since the scala compiler is not correctly propagating this
through the tuple extraction.  This is only a workaround.  We can also
remove the tuple extraction.

On Mon, Feb 16, 2015 at 10:47 AM, Reynold Xin r...@databricks.com wrote:

 Michael - it is already transient. This should probably considered a bug
 in the scala compiler, but we can easily work around it by removing the use
 of destructuring binding.

 On Mon, Feb 16, 2015 at 10:41 AM, Michael Armbrust mich...@databricks.com
  wrote:

 I'd suggest marking the HiveContext as @transient since its not valid to
 use it on the slaves anyway.

 On Mon, Feb 16, 2015 at 4:27 AM, Haopu Wang hw...@qilinsoft.com wrote:

  When I'm investigating this issue (in the end of this email), I take a
  look at HiveContext's code and find this change
  (
 https://github.com/apache/spark/commit/64945f868443fbc59cb34b34c16d782d
  da0fb63d#diff-ff50aea397a607b79df9bec6f2a841db):
 
 
 
  -  @transient protected[hive] lazy val hiveconf = new
  HiveConf(classOf[SessionState])
 
  -  @transient protected[hive] lazy val sessionState = {
 
  -val ss = new SessionState(hiveconf)
 
  -setConf(hiveconf.getAllProperties)  // Have SQLConf pick up the
  initial set of HiveConf.
 
  -ss
 
  -  }
 
  +  @transient protected[hive] lazy val (hiveconf, sessionState) =
 
  +Option(SessionState.get())
 
  +  .orElse {
 
 
 
  With the new change, Scala compiler always generate a Tuple2 field of
  HiveContext as below:
 
 
 
  private Tuple2 x$3;
 
  private transient OutputStream outputBuffer;
 
  private transient HiveConf hiveconf;
 
  private transient SessionState sessionState;
 
  private transient HiveMetastoreCatalog catalog;
 
 
 
  That x$3 field's key is HiveConf object that cannot be serialized. So
  can you suggest how to resolve this issue? Thank you very much!
 
 
 
  
 
 
 
  I have a streaming application which registered temp table on a
  HiveContext for each batch duration.
 
  The application runs well in Spark 1.1.0. But I get below error from
  1.1.1.
 
  Do you have any suggestions to resolve it? Thank you!
 
 
 
  java.io.NotSerializableException: org.apache.hadoop.hive.conf.HiveConf
 
  - field (class scala.Tuple2, name: _1, type: class
  java.lang.Object)
 
  - object (class scala.Tuple2, (Configuration: core-default.xml,
  core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml,
  yarn-site.xml, hdfs-default.xml, hdfs-site.xml,
  org.apache.hadoop.hive.conf.LoopingByteArrayInputStream@2158ce23
 ,org.apa
  che.hadoop.hive.ql.session.SessionState@49b6eef9))
 
  - field (class org.apache.spark.sql.hive.HiveContext, name: x$3,
  type: class scala.Tuple2)
 
  - object (class org.apache.spark.sql.hive.HiveContext,
  org.apache.spark.sql.hive.HiveContext@4e6e66a4)
 
  - field (class
  example.BaseQueryableDStream$$anonfun$registerTempTable$2, name:
  sqlContext$1, type: class org.apache.spark.sql.SQLContext)
 
 - object (class
  example.BaseQueryableDStream$$anonfun$registerTempTable$2,
  function1)
 
  - field (class
  org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1,
  name: foreachFunc$1, type: interface scala.Function1)
 
  - object (class
  org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1,
  function2)
 
  - field (class org.apache.spark.streaming.dstream.ForEachDStream,
  name: org$apache$spark$streaming$dstream$ForEachDStream$$foreachFunc,
  type: interface scala.Function2)
 
  - object (class org.apache.spark.streaming.dstream.ForEachDStream,
  org.apache.spark.streaming.dstream.ForEachDStream@5ccbdc20)
 
  - element of array (index: 0)
 
  - array (class [Ljava.lang.Object;, size: 16)
 
  - field (class scala.collection.mutable.ArrayBuffer, name:
  array, type: class [Ljava.lang.Object;)
 
  - object (class scala.collection.mutable.ArrayBuffer,
  ArrayBuffer(org.apache.spark.streaming.dstream.ForEachDStream@5ccbdc20
 ))
 
  - field (class org.apache.spark.streaming.DStreamGraph, name:
  outputStreams, type: class scala.collection.mutable.ArrayBuffer)
 
  - custom writeObject data (class
  org.apache.spark.streaming.DStreamGraph)
 
  - object (class org.apache.spark.streaming.DStreamGraph,
  org.apache.spark.streaming.DStreamGraph@776ae7da)
 
  - field (class org.apache.spark.streaming.Checkpoint, name:
  graph, type: class org.apache.spark.streaming.DStreamGraph)
 
  - root object (class org.apache.spark.streaming.Checkpoint,
  org.apache.spark.streaming.Checkpoint@5eade065)
 
  at java.io.ObjectOutputStream.writeObject0(Unknown Source)
 
 
 
 
 
 
 
 





Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-19 Thread Michael Armbrust

 P.S: For some reason replacing  import sqlContext.createSchemaRDD with 
 import sqlContext.implicits._ doesn't do the implicit conversations.
 registerTempTable
 gives syntax error. I will dig deeper tomorrow. Has anyone seen this ?


We will write up a whole migration guide before the final release, but I
can quickly explain this one.  We made the implicit conversion
significantly less broad to avoid the chance of confusing conflicts.
However, now you have to call .toDF in order to force RDDs to become
DataFrames.


Re: Hive SKEWED feature supported in Spark SQL ?

2015-02-19 Thread Michael Armbrust

 1) is SKEWED BY honored ? If so, has anyone run into directories not being
 created ?


It is not.

2) if it is not honored, does it matter ? Hive introduced this feature to
 better handle joins where tables had a skewed distribution on keys joined
 on so that the single mapper handling one of the keys didn't hold up the
 whole process. Could that happen in Spark / Spark SQL?


It could matter for very skewed data, though I have not heard many
complaints.  We could consider adding it in the future if people are having
problems with skewed data.


Re: renaming SchemaRDD - DataFrame

2015-01-28 Thread Michael Armbrust
In particular the performance tricks are in SpecificMutableRow.

On Wed, Jan 28, 2015 at 5:49 PM, Evan Chan velvia.git...@gmail.com wrote:

 Yeah, it's null.   I was worried you couldn't represent it in Row
 because of primitive types like Int (unless you box the Int, which
 would be a performance hit).  Anyways, I'll take another look at the
 Row API again  :-p

 On Wed, Jan 28, 2015 at 4:42 PM, Reynold Xin r...@databricks.com wrote:
  Isn't that just null in SQL?
 
  On Wed, Jan 28, 2015 at 4:41 PM, Evan Chan velvia.git...@gmail.com
 wrote:
 
  I believe that most DataFrame implementations out there, like Pandas,
  supports the idea of missing values / NA, and some support the idea of
  Not Meaningful as well.
 
  Does Row support anything like that?  That is important for certain
  applications.  I thought that Row worked by being a mutable object,
  but haven't looked into the details in a while.
 
  -Evan
 
  On Wed, Jan 28, 2015 at 4:23 PM, Reynold Xin r...@databricks.com
 wrote:
   It shouldn't change the data source api at all because data sources
   create
   RDD[Row], and that gets converted into a DataFrame automatically
   (previously
   to SchemaRDD).
  
  
  
 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala
  
   One thing that will break the data source API in 1.3 is the location
 of
   types. Types were previously defined in sql.catalyst.types, and now
   moved to
   sql.types. After 1.3, sql.catalyst is hidden from users, and all
 public
   APIs
   have first class classes/objects defined in sql directly.
  
  
  
   On Wed, Jan 28, 2015 at 4:20 PM, Evan Chan velvia.git...@gmail.com
   wrote:
  
   Hey guys,
  
   How does this impact the data sources API?  I was planning on using
   this for a project.
  
   +1 that many things from spark-sql / DataFrame is universally
   desirable and useful.
  
   By the way, one thing that prevents the columnar compression stuff in
   Spark SQL from being more useful is, at least from previous talks
 with
   Reynold and Michael et al., that the format was not designed for
   persistence.
  
   I have a new project that aims to change that.  It is a
   zero-serialisation, high performance binary vector library, designed
   from the outset to be a persistent storage friendly.  May be one day
   it can replace the Spark SQL columnar compression.
  
   Michael told me this would be a lot of work, and recreates parts of
   Parquet, but I think it's worth it.  LMK if you'd like more details.
  
   -Evan
  
   On Tue, Jan 27, 2015 at 4:35 PM, Reynold Xin r...@databricks.com
   wrote:
Alright I have merged the patch (
https://github.com/apache/spark/pull/4173
) since I don't see any strong opinions against it (as a matter of
fact
most were for it). We can still change it if somebody lays out a
strong
argument.
   
On Tue, Jan 27, 2015 at 12:25 PM, Matei Zaharia
matei.zaha...@gmail.com
wrote:
   
The type alias means your methods can specify either type and they
will
work. It's just another name for the same type. But Scaladocs and
such
will
show DataFrame as the type.
   
Matei
   
 On Jan 27, 2015, at 12:10 PM, Dirceu Semighini Filho 
dirceu.semigh...@gmail.com wrote:

 Reynold,
 But with type alias we will have the same problem, right?
 If the methods doesn't receive schemardd anymore, we will have
 to
 change
 our code to migrade from schema to dataframe. Unless we have an
 implicit
 conversion between DataFrame and SchemaRDD



 2015-01-27 17:18 GMT-02:00 Reynold Xin r...@databricks.com:

 Dirceu,

 That is not possible because one cannot overload return types.

 SQLContext.parquetFile (and many other methods) needs to return
 some
type,
 and that type cannot be both SchemaRDD and DataFrame.

 In 1.3, we will create a type alias for DataFrame called
 SchemaRDD
 to
not
 break source compatibility for Scala.


 On Tue, Jan 27, 2015 at 6:28 AM, Dirceu Semighini Filho 
 dirceu.semigh...@gmail.com wrote:

 Can't the SchemaRDD remain the same, but deprecated, and be
 removed
 in
the
 release 1.5(+/- 1)  for example, and the new code been added
 to
DataFrame?
 With this, we don't impact in existing code for the next few
 releases.



 2015-01-27 0:02 GMT-02:00 Kushal Datta 
 kushal.da...@gmail.com:

 I want to address the issue that Matei raised about the heavy
 lifting
 required for a full SQL support. It is amazing that even
 after
 30
years
 of
 research there is not a single good open source columnar
 database
 like
 Vertica. There is a column store option in MySQL, but it is
 not
 nearly
 as
 sophisticated as Vertica or MonetDB. But there's a true need
 for
 such
a
 system. 

Re: Caching tables at column level

2015-02-01 Thread Michael Armbrust
Its not completely transparent, but you can do something like the following
today:

CACHE TABLE hotData AS SELECT columns, I, care, about FROM fullTable

On Sun, Feb 1, 2015 at 3:03 AM, Mick Davies michael.belldav...@gmail.com
wrote:

 I have been working a lot recently with denormalised tables with lots of
 columns, nearly 600. We are using this form to avoid joins.

 I have tried to use cache table with this data, but it proves too expensive
 as it seems to try to cache all the data in the table.

 For data sets such as the one I am using you find that certain columns will
 be hot, referenced frequently in queries, others will be used very
 infrequently.

 Therefore it would be great if caches could be column based. I realise that
 this may not be optimal for all use cases, but I think it could be quite a
 common need.  Has something like this been considered?

 Thanks Mick



 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Caching-tables-at-column-level-tp10377.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




GitHub Syncing Down

2015-03-10 Thread Michael Armbrust
FYI: https://issues.apache.org/jira/browse/INFRA-9259


Re: Spark 1.3 SQL Type Parser Changes?

2015-03-10 Thread Michael Armbrust
Thanks for reporting.  This was a result of a change to our DDL parser that
resulted in types becoming reserved words.  I've filled a JIRA and will
investigate if this is something we can fix.
https://issues.apache.org/jira/browse/SPARK-6250

On Tue, Mar 10, 2015 at 1:51 PM, Nitay Joffe ni...@actioniq.co wrote:

 In Spark 1.2 I used to be able to do this:

 scala
 org.apache.spark.sql.hive.HiveMetastoreTypes.toDataType(structint:bigint)
 res30: org.apache.spark.sql.catalyst.types.DataType =
 StructType(List(StructField(int,LongType,true)))

 That is, the name of a column can be a keyword like int. This is no
 longer the case in 1.3:

 data-pipeline-shell HiveTypeHelper.toDataType(structint:bigint)
 org.apache.spark.sql.sources.DDLException: Unsupported dataType: [1.8]
 failure: ``'' expected but `int' found

 structint:bigint
^
 at org.apache.spark.sql.sources.DDLParser.parseType(ddl.scala:52)
 at
 org.apache.spark.sql.hive.HiveMetastoreTypes$.toDataType(HiveMetastoreCatalog.scala:785)
 at
 org.apache.spark.sql.hive.HiveTypeHelper$.toDataType(HiveTypeHelper.scala:9)

 Note HiveTypeHelper is simply an object I load in to expose
 HiveMetastoreTypes since it was made private. See
 https://gist.github.com/nitay/460b41ed5fd7608507f5
 https://app.relateiq.com/r?c=chrome_gmailurl=https%3A%2F%2Fgist.github.com%2Fnitay%2F460b41ed5fd7608507f5t=AFwhZf262cJFT8YSR54ZotvY2aTmpm_zHTSKNSd4jeT-a6b8q-yMXQ-BqEX9-Ym54J1bkDFiFOXyRKsNxXoDGIh7bhqbBVKsGGq6YTJIfLZxs375XXPdS13KHsE_3Lffk4UIFkRFZ_7c

 This is actually a pretty big problem for us as we have a bunch of legacy
 tables with column names like timestamp. They work fine in 1.2, but now
 everything throws in 1.3.

 Any thoughts?

 Thanks,
 - Nitay
 Founder  CTO




Re: Any guidance on when to back port and how far?

2015-03-24 Thread Michael Armbrust
Two other criteria that I use when deciding what to backport:
 - Is it a regression from a previous minor release?  I'm much more likely
to backport fixes in this case, as I'd love for most people to stay up to
date.
 - How scary is the change?  I think the primary goal is stability of the
maintenance branches.  When I am confident that something is isolated and
unlikely to break things (i.e. I'm fixing a confusing error message), then
i'm much more likely to backport it.

Regarding the length of time to continue backporting, I mostly don't
backport to N-1, but this is partially because SQL is changing too fast for
that to generally be useful.  These old branches usually only get attention
from me when there is an explicit request.

I'd love to hear more feedback from others.

Michael

On Tue, Mar 24, 2015 at 6:13 AM, Sean Owen so...@cloudera.com wrote:

 So far, my rule of thumb has been:

 - Don't back-port new features or improvements in general, only bug fixes
 - Don't back-port minor bug fixes
 - Back-port bug fixes that seem important enough to not wait for the
 next minor release
 - Back-port site doc changes to the release most likely to go out
 next, to make it a part of the next site publish

 But, how far should back-ports go, in general? If the last minor
 release was 1.N, then to branch 1.N surely. Farther back is a question
 of expectation for support of past minor releases. Given the pace of
 change and time available, I assume there's not much support for
 continuing to use release 1.(N-1) and very little for 1.(N-2).

 Concretely: does anyone expect a 1.1.2 release ever? a 1.2.2 release?
 It'd be good to hear the received wisdom explicitly.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: enum-like types in Spark

2015-03-04 Thread Michael Armbrust
#4 with a preference for CamelCaseEnums

On Wed, Mar 4, 2015 at 5:29 PM, Joseph Bradley jos...@databricks.com
wrote:

 another vote for #4
 People are already used to adding () in Java.


 On Wed, Mar 4, 2015 at 5:14 PM, Stephen Boesch java...@gmail.com wrote:

  #4 but with MemoryOnly (more scala-like)
 
  http://docs.scala-lang.org/style/naming-conventions.html
 
  Constants, Values, Variable and Methods
 
  Constant names should be in upper camel case. That is, if the member is
  final, immutable and it belongs to a package object or an object, it may
 be
  considered a constant (similar to Java’sstatic final members):
 
 
 1. object Container {
 2. val MyConstant = ...
 3. }
 
 
  2015-03-04 17:11 GMT-08:00 Xiangrui Meng men...@gmail.com:
 
   Hi all,
  
   There are many places where we use enum-like types in Spark, but in
   different ways. Every approach has both pros and cons. I wonder
   whether there should be an “official” approach for enum-like types in
   Spark.
  
   1. Scala’s Enumeration (e.g., SchedulingMode, WorkerState, etc)
  
   * All types show up as Enumeration.Value in Java.
  
  
 
 http://spark.apache.org/docs/latest/api/java/org/apache/spark/scheduler/SchedulingMode.html
  
   2. Java’s Enum (e.g., SaveMode, IOMode)
  
   * Implementation must be in a Java file.
   * Values doesn’t show up in the ScalaDoc:
  
  
 
 http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.network.util.IOMode
  
   3. Static fields in Java (e.g., TripletFields)
  
   * Implementation must be in a Java file.
   * Doesn’t need “()” in Java code.
   * Values don't show up in the ScalaDoc:
  
  
 
 http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.graphx.TripletFields
  
   4. Objects in Scala. (e.g., StorageLevel)
  
   * Needs “()” in Java code.
   * Values show up in both ScalaDoc and JavaDoc:
  
  
 
 http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.storage.StorageLevel$
  
  
 
 http://spark.apache.org/docs/latest/api/java/org/apache/spark/storage/StorageLevel.html
  
   It would be great if we have an “official” approach for this as well
   as the naming convention for enum-like values (“MEMORY_ONLY” or
   “MemoryOnly”). Personally, I like 4) with “MEMORY_ONLY”. Any thoughts?
  
   Best,
   Xiangrui
  
   -
   To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
   For additional commands, e-mail: dev-h...@spark.apache.org
  
  
 



Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-23 Thread Michael Armbrust
On Sun, Feb 22, 2015 at 11:20 PM, Mark Hamstra m...@clearstorydata.com
wrote:

 So what are we expecting of Hive 0.12.0 builds with this RC?  I know not
 every combination of Hadoop and Hive versions, etc., can be supported, but
 even an example build from the Building Spark page isn't looking too good
 to me.


I would definitely expect this to build and we do actually test that for
each PR.  We don't yet run the tests for both versions of Hive and thus
unfortunately these do get out of sync.  Usually these are just problems
diff-ing golden output or cases where we have added a test that uses a
feature not available in hive 12.

Have you seen problems with using Hive 12 outside of these test failures?


Re: [SQL][Feature] Access row by column name instead of index

2015-04-24 Thread Michael Armbrust
Already done :)

https://github.com/apache/spark/commit/2e8c6ca47df14681c1110f0736234ce76a3eca9b

On Fri, Apr 24, 2015 at 2:37 PM, Reynold Xin r...@databricks.com wrote:

 Can you elaborate what you mean by that? (what's already available in
 Python?)


 On Fri, Apr 24, 2015 at 2:24 PM, Shuai Zheng szheng.c...@gmail.com
 wrote:

  Hi All,
 
  I want to ask whether there is a plan to implement the feature to access
  the Row in sql by name? Currently we can only allow to access a row by
  index (there is a python version api of access by name, but none for
 java)
 
  Regards,
 
  Shuai
 
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
 



Re: Is SQLContext thread-safe?

2015-04-30 Thread Michael Armbrust
Unfortunately, I think the SQLParser is not threadsafe.  I would recommend
using HiveQL.

On Thu, Apr 30, 2015 at 4:07 AM, Wangfei (X) wangf...@huawei.com wrote:

 actually this is a sql parse exception, are you sure your sql is right?

 发自我的 iPhone

  在 2015年4月30日,18:50,Haopu Wang hw...@qilinsoft.com 写道:
 
  Hi, in a test on SparkSQL 1.3.0, multiple threads are doing select on a
  same SQLContext instance, but below exception is thrown, so it looks
  like SQLContext is NOT thread safe? I think this is not the desired
  behavior.
 
  ==
 
  java.lang.RuntimeException: [1.1] failure: ``insert'' expected but
  identifier select found
 
  select id ,ext.d from UNIT_TEST
  ^
  at scala.sys.package$.error(package.scala:27)
  at
  org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSpark
  SQLParser.scala:40)
  at
  org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:130)
  at
  org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:130)
  at
  org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkS
  QLParser$$others$1.apply(SparkSQLParser.scala:96)
  at
  org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkS
  QLParser$$others$1.apply(SparkSQLParser.scala:95)
  at
  scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
  at
  scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
  at
  scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parser
  s.scala:242)
  at
  scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parser
  s.scala:242)
  at
  scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
  at
  scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$
  apply$2.apply(Parsers.scala:254)
  at
  scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$
  apply$2.apply(Parsers.scala:254)
  at
  scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
  at
  scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Par
  sers.scala:254)
  at
  scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Par
  sers.scala:254)
  at
  scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
  at
  scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Pa
  rsers.scala:891)
  at
  scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Pa
  rsers.scala:891)
  at
  scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
  at
  scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
  at
  scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParser
  s.scala:110)
  at
  org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSpark
  SQLParser.scala:38)
  at
  org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.sca
  la:134)
  at
  org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.sca
  la:134)
  at scala.Option.getOrElse(Option.scala:120)
  at
  org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:134)
  at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:915)
 
  -Original Message-
  From: Cheng, Hao [mailto:hao.ch...@intel.com]
  Sent: Monday, March 02, 2015 9:05 PM
  To: Haopu Wang; user
  Subject: RE: Is SQLContext thread-safe?
 
  Yes it is thread safe, at least it's supposed to be.
 
  -Original Message-
  From: Haopu Wang [mailto:hw...@qilinsoft.com]
  Sent: Monday, March 2, 2015 4:43 PM
  To: user
  Subject: Is SQLContext thread-safe?
 
  Hi, is it safe to use the same SQLContext to do Select operations in
  different threads at the same time? Thank you very much!
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
  commands, e-mail: user-h...@spark.apache.org
 
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 



Re: Uninitialized session in HiveContext?

2015-04-30 Thread Michael Armbrust
Hey Marcelo,

Thanks for the heads up!  I'm currently in the process of refactoring all
of this (to separate the metadata connection from the execution side) and
as part of this I'm making the initialization of the session not lazy.  It
would be great to hear if this also works for your internal integration
tests once the patch is up (hopefully this weekend).

Michael

On Thu, Apr 30, 2015 at 2:36 PM, Marcelo Vanzin van...@cloudera.com wrote:

 Hey all,

 We ran into some test failures in our internal branch (which builds
 against Hive 1.1), and I narrowed it down to the fix below. I'm not
 super familiar with the Hive integration code, but does this look like
 a bug for other versions of Hive too?

 This caused an error where some internal Hive configuration that is
 initialized by the session were not available.

 diff --git
 a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala
 b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala
 index dd06b26..6242745 100644
 --- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala
 +++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala
 @@ -93,6 +93,10 @@ class HiveContext(sc: SparkContext) extends
 SQLContext(sc) {
  if (conf.dialect == sql) {
super.sql(substituted)
  } else if (conf.dialect == hiveql) {
 +  // Make sure Hive session state is initialized.
 +  if (SessionState.get() != sessionState) {
 +SessionState.start(sessionState)
 +  }
val ddlPlan = ddlParserWithHiveQL.parse(sqlText,
 exceptionOnError = false)
DataFrame(this, ddlPlan.getOrElse(HiveQl.parseSql(substituted)))
  }  else {



 --
 Marcelo

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: Plans for upgrading Hive dependency?

2015-04-29 Thread Michael Armbrust
I am working on it.  Here is the (very rough) version:
https://github.com/apache/spark/compare/apache:master...marmbrus:multiHiveVersions

On Mon, Apr 27, 2015 at 1:03 PM, Punyashloka Biswal punya.bis...@gmail.com
wrote:

 Thanks Marcelo and Patrick - I don't know how I missed that ticket in my
 Jira search earlier. Is anybody working on the sub-issues yet, or is there
 a design doc I should look at before taking a stab?

 Regards,
 Punya

 On Mon, Apr 27, 2015 at 3:56 PM Patrick Wendell pwend...@gmail.com
 wrote:

  Hey Punya,
 
  There is some ongoing work to help make Hive upgrades more manageable
  and allow us to support multiple versions of Hive. Once we do that, it
  will be much easier for us to upgrade.
 
  https://issues.apache.org/jira/browse/SPARK-6906
 
  - Patrick
 
  On Mon, Apr 27, 2015 at 12:47 PM, Marcelo Vanzin van...@cloudera.com
  wrote:
   That's a lot more complicated than you might think.
  
   We've done some basic work to get HiveContext to compile against Hive
   1.1.0. Here's the code:
  
 
 https://github.com/cloudera/spark/commit/00e2c7e35d4ac236bcfbcd3d2805b483060255ec
  
   We didn't sent that upstream because that only solves half of the
   problem; the hive-thriftserver is disabled in our CDH build because it
   uses a lot of Hive APIs that have been removed in 1.1.0, so even
   getting it to compile is really complicated.
  
   If there's interest in getting the HiveContext part fixed up I can
   send a PR for that code. But at this time I don't really have plans to
   look at the thrift server.
  
  
   On Mon, Apr 27, 2015 at 11:58 AM, Punyashloka Biswal
   punya.bis...@gmail.com wrote:
   Dear Spark devs,
  
   Is there a plan for staying up-to-date with current (and future)
  versions
   of Hive? Spark currently supports version 0.13 (June 2014), but the
  latest
   version of Hive is 1.1.0 (March 2015). I don't see any Jira tickets
  about
   updating beyond 0.13, so I was wondering if this was intentional or it
  was
   just that nobody had started work on this yet.
  
   I'd be happy to work on a PR for the upgrade if one of the core
  developers
   can tell me what pitfalls to watch out for.
  
   Punya
  
  
  
   --
   Marcelo
  
   -
   To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
   For additional commands, e-mail: dev-h...@spark.apache.org
  
 



Re: DataFrame distinct vs RDD distinct

2015-05-07 Thread Michael Armbrust
I'd happily merge a PR that changes the distinct implementation to be more
like Spark core, assuming it includes benchmarks that show better
performance for both the fits in memory case and the too big for memory
case.

On Thu, May 7, 2015 at 2:23 AM, Olivier Girardot 
o.girar...@lateral-thoughts.com wrote:

 Ok, but for the moment, this seems to be killing performances on some
 computations...
 I'll try to give you precise figures on this between rdd and dataframe.

 Olivier.

 Le jeu. 7 mai 2015 à 10:08, Reynold Xin r...@databricks.com a écrit :

  In 1.5, we will most likely just rewrite distinct in SQL to either use
 the
  Aggregate operator which will benefit from all the Tungsten
 optimizations,
  or have a Tungsten version of distinct for SQL/DataFrame.
 
  On Thu, May 7, 2015 at 1:32 AM, Olivier Girardot 
  o.girar...@lateral-thoughts.com wrote:
 
  Hi everyone,
  there seems to be different implementations of the distinct feature in
  DataFrames and RDD and some performance issue with the DataFrame
 distinct
  API.
 
  In RDD.scala :
 
  def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null):
  RDD[T] =
  withScope { map(x = (x, null)).reduceByKey((x, y) = x,
  numPartitions).map(_._1) }
  And in DataFrame :
 
 
  case class Distinct(partial: Boolean, child: SparkPlan) extends
 UnaryNode
  {
  override def output: Seq[Attribute] = child.output override def
  requiredChildDistribution: Seq[Distribution] = if (partial)
  UnspecifiedDistribution :: Nil else ClusteredDistribution(child.output)
 ::
 
  Nil *override def execute(): RDD[Row] = {**
 child.execute().mapPartitions {
  iter =** val hashSet = new scala.collection.mutable.HashSet[Row]()* *
 var
  currentRow: Row = null** while (iter.hasNext) {** currentRow =
  iter.next()**
  if (!hashSet.contains(currentRow)) {** hashSet.add(currentRow.copy())**
  }**
  }* * hashSet.iterator** }** }*}
 
 
 
 
 
 
  I can try to reproduce more clearly the performance issue, but do you
 have
  any insights into why we can't have the same distinct strategy between
  DataFrame and RDD ?
 
  Regards,
 
  Olivier.
 
 



Re: Speeding up Spark build during development

2015-05-04 Thread Michael Armbrust
FWIW... My Spark SQL development workflow is usually to run build/sbt
sparkShell or build/sbt 'sql/test-only testSuiteName'.  These commands
starts in as little as 30s on my laptop, automatically figure out which
subprojects need to be rebuilt, and don't require the expensive assembly
creation.

On Mon, May 4, 2015 at 5:48 AM, Meethu Mathew meethu.mat...@flytxt.com
wrote:

 *
 *
 ** ** ** ** **  **  Hi,

  Is it really necessary to run **mvn --projects assembly/ -DskipTests
 install ? Could you please explain why this is needed?
 I got the changes after running mvn --projects streaming/ -DskipTests
 package.

 Regards,
 Meethu


 On Monday 04 May 2015 02:20 PM, Emre Sevinc wrote:

 Just to give you an example:

 When I was trying to make a small change only to the Streaming component
 of
 Spark, first I built and installed the whole Spark project (this took
 about
 15 minutes on my 4-core, 4 GB RAM laptop). Then, after having changed
 files
 only in Streaming, I ran something like (in the top-level directory):

 mvn --projects streaming/ -DskipTests package

 and then

 mvn --projects assembly/ -DskipTests install


 This was much faster than trying to build the whole Spark from scratch,
 because Maven was only building one component, in my case the Streaming
 component, of Spark. I think you can use a very similar approach.

 --
 Emre Sevinç



 On Mon, May 4, 2015 at 10:44 AM, Pramod Biligiri 
 pramodbilig...@gmail.com
 wrote:

  No, I just need to build one project at a time. Right now SparkSql.

 Pramod

 On Mon, May 4, 2015 at 12:09 AM, Emre Sevinc emre.sev...@gmail.com
 wrote:

  Hello Pramod,

 Do you need to build the whole project every time? Generally you don't,
 e.g., when I was changing some files that belong only to Spark
 Streaming, I
 was building only the streaming (of course after having build and
 installed
 the whole project, but that was done only once), and then the assembly.
 This was much faster than trying to build the whole Spark every time.

 --
 Emre Sevinç

 On Mon, May 4, 2015 at 9:01 AM, Pramod Biligiri 
 pramodbilig...@gmail.com

 wrote:
 Using the inbuilt maven and zinc it takes around 10 minutes for each
 build.
 Is that reasonable?
 My maven opts looks like this:
 $ echo $MAVEN_OPTS
 -Xmx12000m -XX:MaxPermSize=2048m

 I'm running it as build/mvn -DskipTests package

 Should I be tweaking my Zinc/Nailgun config?

 Pramod

 On Sun, May 3, 2015 at 3:40 PM, Mark Hamstra m...@clearstorydata.com
 wrote:



 https://spark.apache.org/docs/latest/building-spark.html#building-with-buildmvn

 On Sun, May 3, 2015 at 2:54 PM, Pramod Biligiri 

 pramodbilig...@gmail.com

 wrote:

  This is great. I didn't know about the mvn script in the build

 directory.

 Pramod

 On Fri, May 1, 2015 at 9:51 AM, York, Brennon 
 brennon.y...@capitalone.com
 wrote:

  Following what Ted said, if you leverage the `mvn` from within the
 `build/` directory of Spark you¹ll get zinc for free which should

 help

 speed up build times.

 On 5/1/15, 9:45 AM, Ted Yu yuzhih...@gmail.com wrote:

  Pramod:
 Please remember to run Zinc so that the build is faster.

 Cheers

 On Fri, May 1, 2015 at 9:36 AM, Ulanov, Alexander
 alexander.ula...@hp.com
 wrote:

  Hi Pramod,

 For cluster-like tests you might want to use the same code as in

 mllib's

 LocalClusterSparkContext. You can rebuild only the package that

 you

 change
 and then run this main class.

 Best regards, Alexander

 -Original Message-
 From: Pramod Biligiri [mailto:pramodbilig...@gmail.com]
 Sent: Friday, May 01, 2015 1:46 AM
 To: dev@spark.apache.org
 Subject: Speeding up Spark build during development

 Hi,
 I'm making some small changes to the Spark codebase and trying

 it out

 on a
 cluster. I was wondering if there's a faster way to build than

 running

 the
 package target each time.
 Currently I'm using: mvn -DskipTests  package

 All the nodes have the same filesystem mounted at the same mount

 point.

 Pramod

  

 The information contained in this e-mail is confidential and/or
 proprietary to Capital One and/or its affiliates. The information
 transmitted herewith is intended only for use by the individual or

 entity

 to which it is addressed.  If the reader of this message is not the
 intended recipient, you are hereby notified that any review,
 retransmission, dissemination, distribution, copying or other use

 of, or

 taking of any action in reliance upon this information is strictly
 prohibited. If you have received this communication in error, please
 contact the sender and delete the material from your computer.





 --
 Emre Sevinc







Re: [VOTE] Release Apache Spark 1.3.1 (RC2)

2015-04-10 Thread Michael Armbrust
-1 (binding)

We just were alerted to a pretty serious regression since 1.3.0 (
https://issues.apache.org/jira/browse/SPARK-6851).  Should have a fix
shortly.

Michael

On Fri, Apr 10, 2015 at 6:10 AM, Corey Nolet cjno...@gmail.com wrote:

 +1 (non-binding)

 - Verified signatures
 - built on Mac OSX
 - built on Fedora 21

 All builds were done using profiles: hive, hive-thriftserver, hadoop-2.4,
 yarn

 +1 tested ML-related items on Mac OS X

 On Wed, Apr 8, 2015 at 7:59 PM, Krishna Sankar ksanka...@gmail.com
 wrote:

  +1 (non-binding, of course)
 
  1. Compiled OSX 10.10 (Yosemite) OK Total time: 14:16 min
   mvn clean package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4
  -Dhadoop.version=2.6.0 -Phive -DskipTests -Dscala-2.11
  2. Tested pyspark, mlib - running as well as compare results with 1.3.0
 pyspark works well with the new iPython 3.0.0 release
  2.1. statistics (min,max,mean,Pearson,Spearman) OK
  2.2. Linear/Ridge/Laso Regression OK
  2.3. Decision Tree, Naive Bayes OK
  2.4. KMeans OK
 Center And Scale OK
  2.5. RDD operations OK
State of the Union Texts - MapReduce, Filter,sortByKey (word count)
  2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
 Model evaluation/optimization (rank, numIter, lambda) with
 itertools
  OK
  3. Scala - MLlib
  3.1. statistics (min,max,mean,Pearson,Spearman) OK
  3.2. LinearRegressionWithSGD OK
  3.3. Decision Tree OK
  3.4. KMeans OK
  3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
  4.0. Spark SQL from Python OK
  4.1. result = sqlContext.sql(SELECT * from people WHERE State = 'WA')
 OK
 
  On Tue, Apr 7, 2015 at 10:46 PM, Patrick Wendell pwend...@gmail.com
  wrote:
 
   Please vote on releasing the following candidate as Apache Spark
 version
   1.3.1!
  
   The tag to be voted on is v1.3.1-rc2 (commit 7c4473a):
  
  
 

 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=7c4473aa5a7f5de0323394aaedeefbf9738e8eb5
  
   The list of fixes present in this release can be found at:
   http://bit.ly/1C2nVPY
  
   The release files, including signatures, digests, etc. can be found at:
   http://people.apache.org/~pwendell/spark-1.3.1-rc2/
  
   Release artifacts are signed with the following key:
   https://people.apache.org/keys/committer/pwendell.asc
  
   The staging repository for this release can be found at:
  
 https://repository.apache.org/content/repositories/orgapachespark-1083/
  
   The documentation corresponding to this release can be found at:
   http://people.apache.org/~pwendell/spark-1.3.1-rc2-docs/
  
   The patches on top of RC1 are:
  
   [SPARK-6737] Fix memory leak in OutputCommitCoordinator
   https://github.com/apache/spark/pull/5397
  
   [SPARK-6636] Use public DNS hostname everywhere in spark_ec2.py
   https://github.com/apache/spark/pull/5302
  
   [SPARK-6205] [CORE] UISeleniumSuite fails for Hadoop 2.x test with
   NoClassDefFoundError
   https://github.com/apache/spark/pull/4933
  
   Please vote on releasing this package as Apache Spark 1.3.1!
  
   The vote is open until Saturday, April 11, at 07:00 UTC and passes
   if a majority of at least 3 +1 PMC votes are cast.
  
   [ ] +1 Release this package as Apache Spark 1.3.1
   [ ] -1 Do not release this package because ...
  
   To learn more about Apache Spark, please see
   http://spark.apache.org/
  
   -
   To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
   For additional commands, e-mail: dev-h...@spark.apache.org
  
  
 



Re: [Catalyst] RFC: Using PartialFunction literals instead of objects

2015-05-19 Thread Michael Armbrust
Overall this seems like a reasonable proposal to me.  Here are a few
thoughts:

 - There is some debugging utility to the ruleName, so we would probably
want to at least make that an argument to the rule function.
 - We also have had rules that operate on SparkPlan, though since there is
only one ATM maybe we don't need sugar there.
 - I would not call the sugar for creating Strategies rule/seqrule, as I
think the one-to-one vs one-to-many distinction is useful.
 - I'm generally pro-refactoring to make the code nicer, especially when
its not official public API, but I do think its important to maintain
source compatibility (which I think you are) when possible as there are
other projects using catalyst.
 - Finally, we'll have to balance this with other code changes / conflicts.

You should probably open a JIRA and we can continue the discussion there.

On Tue, May 19, 2015 at 4:16 AM, Edoardo Vacchi uncommonnonse...@gmail.com
wrote:

 Hi everybody,

 At the moment, Catalyst rules are defined using two different types of
 rules:
 `Rule[LogicalPlan]` and `Strategy` (which in turn maps to
 `GenericStrategy[SparkPlan]`).

 I propose to introduce utility methods to

   a) reduce the boilerplate to define rewrite rules
   b) turning them back into what they essentially represent: function
 types.

 These changes would be backwards compatible, and would greatly help in
 understanding what the code does. Personally, I feel like the current
 use of objects is redundant and possibly confusing.

 ## `Rule[LogicalPlan]`

 The analyzer and optimizer use `Rule[LogicalPlan]`, which, besides
 defining a default `val ruleName`
 only defines the method `apply(plan: TreeType): TreeType`.
 Because the body of such method is always supposed to read `plan match
 pf`, with `pf`
 being some `PartialFunction[LogicalPlan, LogicalPlan]`, we can
 conclude that `Rule[LogicalPlan]`
 might be substituted by a PartialFunction.

 I propose the following:

 a) Introduce the utility method

 def rule(pf: PartialFunction[LogicalPlan, LogicalPlan]):
 Rule[LogicalPlan] =
   new Rule[LogicalPlan] {
 def apply (plan: LogicalPlan): LogicalPlan = plan transform pf
   }

 b) progressively replace the boilerplate-y object definitions; e.g.

 object MyRewriteRule extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = plan transform {
 case ... = ...
 }

 with

 // define a Rule[LogicalPlan]
 val MyRewriteRule = rule {
   case ... = ...
 }

 it might also be possible to make rule method `implicit`, thereby
 further reducing MyRewriteRule to:

 // define a PartialFunction[LogicalPlan, LogicalPlan]
 // the implicit would convert it into a Rule[LogicalPlan] at the use
 sites
 val MyRewriteRule = {
   case ... = ...
 }


 ## Strategies

 A similar solution could be applied to shorten the code for
 Strategies, which are total functions
 only because they are all supposed to manage the default case,
 possibly returning `Nil`. In this case
 we might introduce the following utility methods:

 /**
  * Generate a Strategy from a PartialFunction[LogicalPlan, SparkPlan].
  * The partial function must therefore return *one single* SparkPlan
 for each case.
  * The method will automatically wrap them in a [[Seq]].
  * Unhandled cases will automatically return Seq.empty
  */
 protected def rule(pf: PartialFunction[LogicalPlan, SparkPlan]): Strategy =
   new Strategy {
 def apply(plan: LogicalPlan): Seq[SparkPlan] =
   if (pf.isDefinedAt(plan)) Seq(pf.apply(plan)) else Seq.empty
   }

 /**
  * Generate a Strategy from a PartialFunction[ LogicalPlan, Seq[SparkPlan]
 ].
  * The partial function must therefore return a Seq[SparkPlan] for each
 case.
  * Unhandled cases will automatically return Seq.empty
  */
 protected def seqrule(pf: PartialFunction[LogicalPlan,
 Seq[SparkPlan]]): Strategy =
   new Strategy {
 def apply(plan: LogicalPlan): Seq[SparkPlan] =
   if (pf.isDefinedAt(plan)) pf.apply(plan) else Seq.empty[SparkPlan]
   }

 Thanks in advance
 e.v.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: [SparkSQL 1.4]Could not use concat with UDF in where clause

2015-06-23 Thread Michael Armbrust
Can you file a JIRA please?

On Tue, Jun 23, 2015 at 1:42 AM, StanZhai m...@zhaishidan.cn wrote:

 Hi all,

 After upgraded the cluster from spark 1.3.1 to 1.4.0(rc4), I encountered
 the
 following exception when use concat with UDF in where clause:

 ===Exception
 org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to
 dataType on unresolved object, tree:
 'concat(HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFYear(date#1776),年)
 at

 org.apache.spark.sql.catalyst.analysis.UnresolvedFunction.dataType(unresolved.scala:82)
 at

 org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$InConversion$$anonfun$apply$5$$anonfun$applyOrElse$15.apply(HiveTypeCoercion.scala:299)
 at

 org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$InConversion$$anonfun$apply$5$$anonfun$applyOrElse$15.apply(HiveTypeCoercion.scala:299)
 at

 scala.collection.LinearSeqOptimized$class.exists(LinearSeqOptimized.scala:80)
 at scala.collection.immutable.List.exists(List.scala:84)
 at

 org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$InConversion$$anonfun$apply$5.applyOrElse(HiveTypeCoercion.scala:299)
 at

 org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$InConversion$$anonfun$apply$5.applyOrElse(HiveTypeCoercion.scala:298)
 at

 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
 at

 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
 at

 org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
 at

 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:221)
 at
 org.apache.spark.sql.catalyst.plans.QueryPlan.org
 $apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionDown$1(QueryPlan.scala:75)
 at

 org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:85)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
 at
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
 at
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
 at scala.collection.TraversableOnce$class.to
 (TraversableOnce.scala:273)
 at scala.collection.AbstractIterator.to(Iterator.scala:1157)
 at
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
 at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
 at
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
 at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
 at

 org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:94)
 at

 org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:64)
 at

 org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformAllExpressions$1.applyOrElse(QueryPlan.scala:136)
 at

 org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformAllExpressions$1.applyOrElse(QueryPlan.scala:135)
 at

 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
 at

 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
 at

 org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
 at

 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:221)
 at

 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:242)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
 at
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
 at
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
 at scala.collection.TraversableOnce$class.to
 (TraversableOnce.scala:273)
 at scala.collection.AbstractIterator.to(Iterator.scala:1157)
 at
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
 at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
 at
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
 at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
 at

 org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:272)
 at

 

Re: how to implement my own datasource?

2015-06-25 Thread Michael Armbrust
I'd suggest looking at the avro data source as an example implementation:

https://github.com/databricks/spark-avro

I also gave a talk a while ago: https://www.youtube.com/watch?v=GQSNJAzxOr8
Hi,

You can connect to by JDBC as described in
https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases.
Other option is using HadoopRDD and NewHadoopRDD to connect to databases
compatible with Hadoop, like HBase, some examples can be found at chapter 5
of Learning Spark
https://books.google.es/books?id=tOptBgAAQBAJpg=PT190dq=learning+spark+hadooprddhl=ensa=Xei=4bqLVaDaLsXaU46NgfgLved=0CCoQ6AEwAA#v=onepageq=%20hadooprddf=false
For Spark Streaming see the section Custom Sources of
https://spark.apache.org/docs/latest/streaming-programming-guide.html

Hope that helps.

Greetings,

Juan

2015-06-25 8:25 GMT+02:00 诺铁 noty...@gmail.com:

 hi,

 I can't find documentation about datasource api,  how to implement custom
 datasource.  any hint is appreciated.thanks.



Re: When to expect UTF8String?

2015-06-12 Thread Michael Armbrust

 1. Custom aggregators that do map-side combine.


This is something I'd hoping to add in Spark 1.5


 2. UDFs with more than 22 arguments which is not supported by ScalaUdf,
 and to avoid wrapping a Java function interface in one of 22 different
 Scala function interfaces depending on the number of parameters.


I'm super open to suggestions here.  Mind possibly opening a JIRA with a
proposed interface?


Re: When to expect UTF8String?

2015-06-11 Thread Michael Armbrust
Through the DataFrame API, users should never see UTF8String.

Expression (and any class in the catalyst package) is considered internal
and so uses the internal representation of various types.  Which type we
use here is not stable across releases.

Is there a reason you aren't defining a UDF instead?

On Thu, Jun 11, 2015 at 8:08 PM, zsampson zsamp...@palantir.com wrote:

 I'm hoping for some clarity about when to expect String vs UTF8String when
 using the Java DataFrames API.

 In upgrading to Spark 1.4, I'm dealing with a lot of errors where what was
 once a String is now a UTF8String. The comments in the file and the related
 commit message indicate that maybe it should be internal to SparkSQL's
 implementation.

 However, when I add a column containing a custom subclass of Expression,
 the
 row passed to the eval method contains instances of UTF8String. Ditto for
 AggregateFunction.update. Is this expected? If so, when should I generally
 know to deal with UTF8String objects?



 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/When-to-expect-UTF8String-tp12710.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: [VOTE] Release Apache Spark 1.4.0 (RC3)

2015-06-01 Thread Michael Armbrust
Its no longer valid to start more than one instance of HiveContext in a
single JVM, as one of the goals of this refactoring was to allow connection
to more than one metastore from a single context.

For tests I suggest you use TestHive as we do in our unit tests.  It has a
reset() method you can use to cleanup state between tests/suites.

We could also add an explicit close() method to remove this restriction,
but if thats something you want to investigate we should move this off the
vote thread and onto JIRA.

On Tue, Jun 2, 2015 at 7:19 AM, Peter Rudenko petro.rude...@gmail.com
wrote:

  Thanks Yin, tried on a clean VM - works now. But tests in my app still
 fails:

 [info]   Cause: javax.jdo.JDOFatalDataStoreException: Unable to open a test 
 connection to the given database. JDBC url = 
 jdbc:derby:;databaseName=metastore_db;create=true, username = APP. 
 Terminating connection pool (set lazyInit to true if you expect to start your 
 database after your app). Original Exception: --
 [info] java.sql.SQLException: Failed to start database 'metastore_db' with 
 class loader 
 org.apache.spark.sql.hive.client.IsolatedClientLoader$anon$1@380628de, see 
 the next exception for details.
 [info] at 
 org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown 
 Source)
 [info] at org.apache.derby.impl.jdbc.Util.newEmbedSQLException(Unknown 
 Source)
 [info] at org.apache.derby.impl.jdbc.Util.seeNextException(Unknown Source)
 [info] at org.apache.derby.impl.jdbc.EmbedConnection.bootDatabase(Unknown 
 Source)
 [info] at org.apache.derby.impl.jdbc.EmbedConnection.init(Unknown 
 Source)
 [info] at org.apache.derby.impl.jdbc.EmbedConnection40.init(Unknown 
 Source)
 [info] at org.apache.derby.jdbc.Driver40.getNewEmbedConnection(Unknown 
 Source)
 [info] at org.apache.derby.jdbc.InternalDriver.connect(Unknown Source)
 [info] at org.apache.derby.jdbc.Driver20.connect(Unknown Source)
 [info] at org.apache.derby.jdbc.AutoloadedDriver.connect(Unknown Source)
 [info] at java.sql.DriverManager.getConnection(DriverManager.java:571)
 [info] at java.sql.DriverManager.getConnection(DriverManager.java:187)
 [info] at 
 com.jolbox.bonecp.BoneCP.obtainRawInternalConnection(BoneCP.java:361)
 [info] at com.jolbox.bonecp.BoneCP.init(BoneCP.java:416)
 [info] at 
 com.jolbox.bonecp.BoneCPDataSource.getConnection(BoneCPDataSource.java:120)

 I’ve set

 parallelExecution in Test := false,

 Thanks,
 Peter Rudenko

 On 2015-06-01 21:10, Yin Huai wrote:

   Hi Peter,

  Based on your error message, seems you were not using the RC3. For the
 error thrown at HiveContext's line 206, we have changed the message to this
 one
 https://github.com/apache/spark/blob/v1.4.0-rc3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala#L205-207
  just
 before RC3. Basically, we will not print out the class loader name. Can you
 check if a older version of 1.4 branch got used? Have you published a RC3
 to your local maven repo? Can you clean your local repo cache and try again?

  Thanks,

  Yin

 On Mon, Jun 1, 2015 at 10:45 AM, Peter Rudenko  petro.rude...@gmail.com
 petro.rude...@gmail.com wrote:

  Still have problem using HiveContext from sbt. Here’s an example of
 dependencies:

  val sparkVersion = 1.4.0-rc3

 lazy val root = Project(id = spark-hive, base = file(.),
settings = Project.defaultSettings ++ Seq(
name := spark-1.4-hive,
scalaVersion := 2.10.5,
scalaBinaryVersion := 2.10,
resolvers += Spark RC at 
 https://repository.apache.org/content/repositories/orgapachespark-1110/; 
 https://repository.apache.org/content/repositories/orgapachespark-1110/,
libraryDependencies ++= Seq(
  org.apache.spark %% spark-core % sparkVersion,
  org.apache.spark %% spark-mllib % sparkVersion,
  org.apache.spark %% spark-hive % sparkVersion,
  org.apache.spark %% spark-sql % sparkVersion
 )

   ))

 Launching sbt console with it and running:

 val conf = new SparkConf().setMaster(local[4]).setAppName(test)
 val sc = new SparkContext(conf)
 val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
 val data = sc.parallelize(1 to 1)
 import sqlContext.implicits._
 scala data.toDF
 java.lang.IllegalArgumentException: Unable to locate hive jars to connect to 
 metastore using classloader 
 scala.tools.nsc.interpreter.IMain$TranslatingClassLoader. Please set 
 spark.sql.hive.metastore.jars
 at 
 org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:206)
 at 
 org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:175)
 at 
 org.apache.spark.sql.hive.HiveContext$anon$2.init(HiveContext.scala:367)
 at 
 org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:367)
 at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:366)
 at 
 

Re: Catalyst: Reusing already computed expressions within a projection

2015-05-30 Thread Michael Armbrust
I think this is likely something that we'll want to do during the code
generation phase.  Though its probably not the lowest hanging fruit at this
point.

On Sun, May 31, 2015 at 5:02 AM, Reynold Xin r...@databricks.com wrote:

 I think you are looking for
 http://en.wikipedia.org/wiki/Common_subexpression_elimination in the
 optimizer.

 One thing to note is that as we do more and more optimization like this,
 the optimization time might increase. Do you see a case where this can
 bring you substantial performance gains?


 On Sat, May 30, 2015 at 9:02 AM, Justin Uang justin.u...@gmail.com
 wrote:

 On second thought, perhaps can this be done by writing a rule that builds
 the dag of dependencies between expressions, then convert it into several
 layers of projections, where each new layer is allowed to depend on
 expression results from previous projections?

 Are there any pitfalls to this approach?

 On Sat, May 30, 2015 at 11:30 AM Justin Uang justin.u...@gmail.com
 wrote:

 If I do the following

 df2 = df.withColumn('y', df['x'] * 7)
 df3 = df2.withColumn('z', df2.y * 3)
 df3.explain()

 Then the result is

  Project [date#56,id#57,timestamp#58,x#59,(x#59 * 7.0) AS
 y#64,((x#59 * 7.0) AS y#64 * 3.0) AS z#65]
   PhysicalRDD [date#56,id#57,timestamp#58,x#59],
 MapPartitionsRDD[125] at mapPartitions at SQLContext.scala:1163

 Effectively I want to compute

 y = f(x)
 z = g(y)

 The catalyst optimizer realizes that y#64 is the same as the one
 previously computed, however, when building the projection, it is ignoring
 the fact that it had already computed y, so it calculates `x * 7` twice.

 y = x * 7
 z = x * 7 * 3

 If I wanted to make this fix, would it be possible to do the logic in
 the optimizer phase? I imagine that it's difficult because the expressions
 in InterpretedMutableProjection don't have access to the previous
 expression results, only the input row, and that the design doesn't seem to
 be catered for this.





Re: [SparkSQL 1.4.0]The result of SUM(xxx) in SparkSQL is 0.0 but not null when the column xxx is all null

2015-07-06 Thread Michael Armbrust
This was a change that was made to match a wrong answer coming from older
versions of Hive.  Unfortunately I think its too late to fix this in the
1.4 branch (as I'd like to avoid changing answers at all in point
releases), but in Spark 1.5 we revert to the correct behavior.

https://issues.apache.org/jira/browse/SPARK-8828

On Thu, Jul 2, 2015 at 11:58 PM, StanZhai m...@zhaishidan.cn wrote:

 Hi all,

 I have a table named test like this:

 |  a  |  b  |
 |  1  | null |
 |  2  | null |

 After upgraded the cluster from spark 1.3.1 to 1.4.0, I found the Sum
 function in spark 1.4 and 1.3 are different.

 The SQL is: select sum(b) from test

 In Spark 1.4.0 the result is 0.0, in spark 1.3.1 the result is null. I
 think
 the result should be null, why the result is 0.0 in 1.4.0 but not null? Is
 this a bug?

 Any hint is appreciated.



 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/SparkSQL-1-4-0-The-result-of-SUM-xxx-in-SparkSQL-is-0-0-but-not-null-when-the-column-xxx-is-all-null-tp13008.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




  1   2   3   4   >