Re: test cases stuck on local-cluster mode of ReplSuite?

2014-03-14 Thread Michael Armbrust
Sorry to revive an old thread, but I just ran into this issue myself. It is likely that you do not have the assembly jar built, or that you have SPARK_HOME set incorrectly (it does not need to be set). Michael On Thu, Feb 27, 2014 at 8:13 AM, Nan Zhu wrote: Hi, all

Re: new Catalyst/SQL component merged into master

2014-03-20 Thread Michael Armbrust
Hi Everyone, I'm very excited about merging this new feature into Spark! We have a lot of cool things in the pipeline, including: porting Shark's in-memory columnar format to Spark SQL, code-generation for expression evaluation and improved support for complex types in parquet. I would love to

Re: new Catalyst/SQL component merged into master

2014-03-21 Thread Michael Armbrust
It will be great if there are any examples or usecases to look at ? There are examples in the Spark documentation. Patrick posted and updated copy here so people can see them before 1.0 is released: Does this feature

Making RDDs Covariant

2014-03-21 Thread Michael Armbrust
Hey Everyone, Here is a pretty major (but source compatible) change we are considering making to the RDD API for 1.0. Java and Python APIs would remain the same, but users of Scala would likely need to use less casts. This would be especially true for libraries whose functions take RDDs as

Re: Making RDDs Covariant

2014-03-22 Thread Michael Armbrust
From my experience, covariance often becomes a pain when dealing with serialization/deserialization (I've experienced a few cases while developing play-json datomisca). Moreover, if you have implicits, variance often becomes a headache... This is exactly the kind of feedback I was hoping

Re: Making RDDs Covariant

2014-03-22 Thread Michael Armbrust
Hi Pascal, Thanks for the input. I think we are going to be okay here since, as Koert said, the current serializers use runtime type information. We could also keep at ClassTag around for the original type when the RDD was created. Good things to be aware of though. Michael On Sat, Mar 22,

Re: new Catalyst/SQL component merged into master

2014-03-24 Thread Michael Armbrust
, so we're really curious as far as the architectural direction. -Evan On Fri, Mar 21, 2014 at 11:09 AM, Michael Armbrust wrote: It will be great if there are any examples or usecases to look at ? There are examples in the Spark documentation. Patrick posted

Travis CI

2014-03-25 Thread Michael Armbrust
Just a quick note to everyone that Patrick and I are playing around with Travis CI on the Spark github repository. For now, travis does not run all of the test cases, so will only be turned on experimentally. Long term it looks like Travis might give better integration with github, so we are

Re: Travis CI

2014-03-29 Thread Michael Armbrust
Is the migration from Jenkins to Travis finished? It is not finished and really at this point it is only something we are considering, not something that will happen for sure. We turned it on in addition to Jenkins so that we could start finding issues exactly like the ones you described

Re: Flaky streaming tests

2014-04-07 Thread Michael Armbrust
There is a JIRA for one of the flakey tests here: On Mon, Apr 7, 2014 at 11:32 AM, Patrick Wendell wrote: TD - do you know what is going on here? I looked into this ab it and at least a few of these that use Thread.sleep()

Re: RFC: varargs in Logging.scala?

2014-04-10 Thread Michael Armbrust
Hi Marcelo, Thanks for bringing this up here, as this has been a topic of debate recently. Some thoughts below. ... all of the suffer from the fact that the log message needs to be built even though it might not be used. This is not true of the current implementation (and this is actually

Re: RFC: varargs in Logging.scala?

2014-04-10 Thread Michael Armbrust
BTW... You can do calculations in string interpolation: sTime: ${timeMillis / 1000}s Or use format strings. fFloat with two decimal places: $floatValue%.2f More info: On Thu, Apr 10, 2014 at 5:46 PM, Michael Armbrust mich

Re: Problem creating objects through reflection

2014-04-24 Thread Michael Armbrust
The Spark REPL is slightly modified from the normal Scala REPL to prevent work from being done twice when closures are deserialized on the workers. I'm not sure exactly why this causes your problem, but its probably worth filing a JIRA about it. Here is another issues with classes defined in the

Re: [VOTE] Release Apache Spark 1.0.0 (rc8)

2014-05-16 Thread Michael Armbrust
-1 We found a regression in the way configuration is passed to executors. Michael On Fri, May 16, 2014 at 3:57 PM, Mark Hamstra m...@clearstorydata.comwrote: +1 On Fri, May 16, 2014 at 2:16 AM,

Re: Timestamp support in v1.0

2014-05-29 Thread Michael Armbrust
Thanks for reporting this! If you could test out that PR and see if it fixes your problems I'd really appreciate it! Michael On Thu, May 29, 2014 at 9:09 AM, Andrew Ash wrote: I

Re: Timestamp support in v1.0

2014-05-29 Thread Michael Armbrust
Yes, you'll need to download the code from that PR and reassemble Spark (sbt/sbt assembly). On Thu, May 29, 2014 at 10:02 AM, dataginjaninja wrote: Michael, Will I have to rebuild after adding the change? Thanks -- View this message in context:

Re: Timestamp support in v1.0

2014-05-29 Thread Michael Armbrust
You should be able to get away with only doing it locally. This bug is happening during analysis which only occurs on the driver. On Thu, May 29, 2014 at 10:17 AM, dataginjaninja wrote: Darn, I was hoping just to sneak it in that file. I am not the only person

Re: Timestamp support in v1.0

2014-06-05 Thread Michael Armbrust
Awesome, thanks for testing! On Thu, Jun 5, 2014 at 1:30 PM, dataginjaninja wrote: I can confirm that the patch fixed my issue. :-) - Cheers, Stephanie -- View this message in context:

Re: question about Hive compatiblilty tests

2014-06-18 Thread Michael Armbrust
I assume you are adding tests? because that is the only time you should see that message. That error could mean a couple of things: 1) The query is invalid and hive threw an exception 2) Your Hive setup is bad. Regarding #2, you need to have the source for Hive 0.12.0 available and built as

Re: [VOTE] Release Apache Spark 1.0.1 (RC2)

2014-07-05 Thread Michael Armbrust
+1 I tested sql/hive functionality. On Sat, Jul 5, 2014 at 9:30 AM, Mark Hamstra wrote: +1 On Fri, Jul 4, 2014 at 12:40 PM, Patrick Wendell wrote: I'll start the voting with a +1 - ran tests on the release candidate and ran some basic

Re: sparkSQL thread safe?

2014-07-10 Thread Michael Armbrust
Hey Ian, Thanks for bringing these up! Responses in-line: Just wondering if right now spark sql is expected to be thread safe on master? doing a simple hadoop file - RDD - schema RDD - write parquet will fail in reflection code if i run these in a thread pool. You are probably hitting

Re: Catalyst dependency on Spark Core

2014-07-14 Thread Michael Armbrust
Yeah, sadly this dependency was introduced when someone consolidated the logging infrastructure. However, the dependency should be very small and thus easy to remove, and I would like catalyst to be usable outside of Spark. A pull request to make this possible would be welcome. Ideally, we'd

Change when loading/storing String data using Parquet

2014-07-14 Thread Michael Armbrust
I just wanted to send out a quick note about a change in the handling of strings when loading / storing data using parquet and Spark SQL. Before, Spark SQL did not support binary data in Parquet, so all binary blobs were implicitly treated as Strings. 9fe693

Re: SQLQuerySuite error

2014-07-24 Thread Michael Armbrust
Thanks for reporting back. I was pretty confused trying to reproduce the error :) On Thu, Jul 24, 2014 at 1:09 PM, Stephen Boesch wrote: OK I did find my error. The missing step: mvn install I should have republished (mvn install) all of the other modules . The mvn

Re: [VOTE] Release Apache Spark 1.0.2 (RC1)

2014-07-25 Thread Michael Armbrust
That query is looking at Fix Version not Target Version. The fact that the first one is still open is only because the bug is not resolved in master. It is fixed in 1.0.2. The second one is partially fixed in 1.0.2, but is not worth blocking the release for. On Fri, Jul 25, 2014 at 4:23 PM,

Re: new JDBC server test cases seems failed ?

2014-07-27 Thread Michael Armbrust
How recent is this? We've already reverted this patch once due to failing tests. It would be helpful to include a link to the failed build. If its failing again we'll have to revert again. On Sun, Jul 27, 2014 at 5:26 PM, Nan Zhu wrote: Hi, all It seems that the JDBC

Re: Working Formula for Hive 0.13?

2014-07-28 Thread Michael Armbrust
A few things: - When we upgrade to Hive 0.13.0, Patrick will likely republish the hive-exec jar just as we did for 0.12.0 - Since we have to tie into some pretty low level APIs it is unsurprising that the code doesn't just compile out of the box against 0.13.0 - ScalaReflection is for

Re: How to run specific sparkSQL test with maven

2014-08-01 Thread Michael Armbrust
It seems that the HiveCompatibilitySuite need a hadoop and hive environment, am I right? Relative path in absolute URI: file:$%7Bsystem:test.tmp.dir%7D/tmp_showcrt1” You should only need Hadoop and Hive if you are creating new tests that we need to compute the answers for. Existing tests

Re: Working Formula for Hive 0.13?

2014-08-08 Thread Michael Armbrust
Could you make a PR as described here: On Fri, Aug 8, 2014 at 1:57 PM, Zhan Zhang wrote: Sorry, forget to upload files. I have never posted before :) hive.diff

Re: Working Formula for Hive 0.13?

2014-08-25 Thread Michael Armbrust
whether it's ok to make a PR now because hive-0.13 version is not compatible with hive-0.12 and here i used org.apache.hive. On 2014/7/29 8:22, Michael Armbrust wrote: A few things: - When we upgrade to Hive 0.13.0, Patrick will likely republish the hive-exec jar just as we did for 0.12.0

Re: [Spark SQL] off-heap columnar store

2014-08-25 Thread Michael Armbrust
What is the plan for getting Tachyon/off-heap support for the columnar compressed store? It's not in 1.1 is it? It is not in 1.1 and there are not concrete plans for adding it at this point. Currently, there is more engineering investment going into caching parquet data in Tachyon instead.

Re: Storage Handlers in Spark SQL

2014-08-25 Thread Michael Armbrust
- dev list + user list You should be able to query Spark SQL using JDBC, starting with the 1.1 release. There is some documentation is the repo, and we'll update the official docs once the

Re: [Spark SQL] off-heap columnar store

2014-08-26 Thread Michael Armbrust
Any initial proposal or design about the caching to Tachyon that you can share so far? Caching parquet files in tachyon with saveAsParquetFile and then reading them with parquetFile should already work. You can use SQL on these tables by using registerTempTable. Some of the general parquet

Re: CoHadoop Papers

2014-08-26 Thread Michael Armbrust
It seems like there are two things here: - Co-locating blocks with the same keys to avoid network transfer. - Leveraging partitioning information to avoid a shuffle when data is already partitioned correctly (even if those partitions aren't yet on the same machine). The former seems more

Re: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-09-02 Thread Michael Armbrust
+1 On Tue, Sep 2, 2014 at 5:18 PM, Matei Zaharia wrote: +1 Tested on Mac OS X. Matei On September 2, 2014 at 5:03:19 PM, Kan Zhang ( wrote: +1 Verified PySpark InputFormat/OutputFormat examples. On Tue, Sep 2, 2014 at 4:10 PM, Reynold Xin

Re: [VOTE] Release Apache Spark 1.1.0 (RC4)

2014-09-03 Thread Michael Armbrust
+1 On Wed, Sep 3, 2014 at 12:29 AM, Reynold Xin wrote: +1 Tested locally on Mac OS X with local-cluster mode. On Wed, Sep 3, 2014 at 12:24 AM, Patrick Wendell wrote: I'll kick it off with a +1 On Wed, Sep 3, 2014 at 12:24 AM, Patrick

Re: trimming unnecessary test output

2014-09-07 Thread Michael Armbrust
Feel free to submit a PR to add a log4j.properies file to sql/catalyst/src/test/resources similar to what we do in core/hive. On Sat, Sep 6, 2014 at 2:50 PM, Sean Owen wrote: This is just a line logging that one test succeeded right? I don't find that noise. Recently I

Re: parquet predicate / projection pushdown into unionAll

2014-09-09 Thread Michael Armbrust
On Tue, Sep 9, 2014 at 10:17 AM, Cody Koeninger wrote: Is there a reason in general not to push projections and predicates down into the individual ParquetTableScans in a union? This would be a great case to add to ColumnPruning. Would be awesome if you could open a JIRA

Re: parquet predicate / projection pushdown into unionAll

2014-09-09 Thread Michael Armbrust
Thanks! On Tue, Sep 9, 2014 at 11:07 AM, Cody Koeninger wrote: Opened I'll take a look at ColumnPruning and see what I can do On Tue, Sep 9, 2014 at 12:46 PM, Michael Armbrust wrote: On Tue, Sep

Re: parquet predicate / projection pushdown into unionAll

2014-09-09 Thread Michael Armbrust
kind of surprised this was not run into before. Do people not segregate their data by day/week in the HDFS directory structure? On Tue, Sep 9, 2014 at 2:08 PM, Michael Armbrust wrote: Thanks! On Tue, Sep 9, 2014 at 11:07 AM, Cody Koeninger wrote

Re: parquet predicate / projection pushdown into unionAll

2014-09-09 Thread Michael Armbrust
/d1 is a directory, not a parquet partition sqlContext.parquetFile(/foo) // works, but has the noted lack of pushdown sqlContext.parquetFile(/foo/d1).unionAll(sqlContext.parquetFile(/foo/d2)) Is there another alternative? On Tue, Sep 9, 2014 at 1:29 PM, Michael Armbrust mich

Re: parquet predicate / projection pushdown into unionAll

2014-09-10 Thread Michael Armbrust
) On Tue, Sep 9, 2014 at 3:02 PM, Michael Armbrust wrote: What Patrick said is correct. Two other points: - In the 1.2 release we are hoping to beef up the support for working with partitioned parquet independent of the metastore. - You can actually do operations

Re: parquet predicate / projection pushdown into unionAll

2014-09-12 Thread Michael Armbrust
chance of adding it to the 1.1.1 point release, assuming there ends up being one? On Wed, Sep 10, 2014 at 11:39 AM, Michael Armbrust wrote: Hey Cody, Thanks for doing this! Will look at your PR later today. Michael On Wed, Sep 10, 2014 at 9:31 AM, Cody Koeninger c

Re: problem with HiveContext inside Actor

2014-09-17 Thread Michael Armbrust
- dev Is it possible that you are constructing more than one HiveContext in a single JVM? Due to global state in Hive code this is not allowed. Michael On Wed, Sep 17, 2014 at 7:21 PM, Cheng, Hao wrote: Hi, Du I am not sure what you mean “triggers the HiveContext to

Re: Support for Hive buckets

2014-09-22 Thread Michael Armbrust
Hi Cody, There are currently no concrete plans for adding buckets to Spark SQL, but thats mostly due to lack of resources / demand for this feature. Adding full support is probably a fair amount of work since you'd have to make changes throughout parsing/optimization/execution. That said, there

Re: OutOfMemoryError on parquet SnappyDecompressor

2014-09-23 Thread Michael Armbrust
I actually submitted a patch to do this yesterday: Can you tell us more about your configuration. In particular how much memory/cores do the executors have and what does the schema of your data look like? On Tue, Sep 23, 2014 at 7:39 AM, Cody Koeninger

Re: view not supported in spark thrift server?

2014-09-28 Thread Michael Armbrust
Views are not supported yet. Its not currently on the near term roadmap, but that can change if there is sufficient demand or someone in the community is interested in implementing them. I do not think it would be very hard. Michael On Sun, Sep 28, 2014 at 11:59 AM, Du Li

Re: Extending Scala style checks

2014-10-01 Thread Michael Armbrust
The hard part here is updating the existing code base... which is going to create merge conflicts with like all of the open PRs... On Wed, Oct 1, 2014 at 6:13 PM, Nicholas Chammas wrote: Ah, since there appears to be a built-in rule for end-of-line whitespace,

Re: Parquet schema migrations

2014-10-05 Thread Michael Armbrust
Hi Cody, Assuming you are talking about 'safe' changes to the schema (i.e. existing column names are never reused with incompatible types), this is something I'd love to support. Perhaps you can describe more what sorts of changes you are making, and if simple merging of the schemas would be

Re: How to do broadcast join in SparkSQL

2014-10-08 Thread Michael Armbrust
Thanks for the input. We purposefully made sure that the config option did not make it into a release as it is not something that we are willing to support long term. That said we'll try and make this easier in the future either through hints or better support for statistics. In this particular

Re: will/when Spark/SparkSQL will support ORCFile format

2014-10-09 Thread Michael Armbrust
Yes, the foreign sources work is only about exposing a stable set of APIs for external libraries to link against (to avoid the spark assembly becoming a dependency mess). The code path these APIs use will be the same as that for datasources included in the core spark sql library. Michael On

Re: Trouble running tests

2014-10-09 Thread Michael Armbrust
Also, in general for SQL only changes it is sufficient to run sbt/sbt catatlyst/test sql/test hive/test. The hive/test part takes the longest, so I usually leave that out until just before submitting unless my changes are hive specific. On Thu, Oct 9, 2014 at 11:40 AM, Nicholas Chammas

Re: Parquet Migrations

2014-10-31 Thread Michael Armbrust
You can't change parquet schema without reencoding the data as you need to recalculate the footer index data. You can manually do what SPARK-3851 is going to do today however. Consider two schemas: Old Schema: (a: Int, b: String) New Schema,

Re: Surprising Spark SQL benchmark

2014-11-04 Thread Michael Armbrust
dev to bcc. Thanks for reaching out, Ozgun. Let's discuss if there were any missing optimizations off list. We'll make sure to report back or add any findings to the tuning guide. On Mon, Nov 3, 2014 at 3:01 PM, ozgun wrote: Hey Patrick, It's Ozgun from Citus Data. We'd

Re: [VOTE] Designating maintainers for some Spark components

2014-11-05 Thread Michael Armbrust
+1 (binding) On Wed, Nov 5, 2014 at 5:33 PM, Matei Zaharia wrote: BTW, my own vote is obviously +1 (binding). Matei On Nov 5, 2014, at 5:31 PM, Matei Zaharia wrote: Hi all, I wanted to share a discussion we've been having on the PMC

Re: Replacing Spark's native scheduler with Sparrow

2014-11-08 Thread Michael Armbrust
However, I haven't seen it be as high as the 100ms Michael quoted (maybe this was for jobs with tasks that have much larger objects that take a long time to deserialize?). I was thinking more about the average end-to-end latency for launching a query that has 100s of partitions. Its also

Re: [VOTE] Release Apache Spark 1.1.1 (RC1)

2014-11-13 Thread Michael Armbrust
Hey Sean, Thanks for pointing this out. Looks like a bad test where we should be doing Set comparison instead of Array. Michael On Thu, Nov 13, 2014 at 2:05 AM, Sean Owen wrote: LICENSE and NOTICE are fine. Signature and checksum is fine. I unzipped and built the plain

Re: mvn or sbt for studying and developing Spark?

2014-11-16 Thread Michael Armbrust
I'm going to have to disagree here. If you are building a release distribution or integrating with legacy systems then maven is probably the correct choice. However most of the core developers that I know use sbt, and I think its a better choice for exploration and development overall. That

Re: mvn or sbt for studying and developing Spark?

2014-11-17 Thread Michael Armbrust
* I moved from sbt to maven in June specifically due to Andrew Or's describing mvn as the default build tool. Developers should keep in mind that jenkins uses mvn so we need to run mvn before submitting PR's - even if sbt were used for day to day dev work To be clear, I think that the PR

Re: Creating a SchemaRDD from an existing API

2014-12-01 Thread Michael Armbrust
, Nov 29, 2014 at 12:57 AM, Michael Armbrust wrote: You probably don't need to create a new kind of SchemaRDD. Instead I'd suggest taking a look at the data sources API that we are adding in Spark 1.2. There is not a ton of documentation, but the test cases show how

Re: [Thrift,1.2 RC] what happened to parquet.hive.serde.ParquetHiveSerDe

2014-12-02 Thread Michael Armbrust
In Hive 13 (which is the default for Spark 1.2), parquet is included and thus we no longer include the Hive parquet bundle. You can now use the included ParquetSerDe: If you want to compile Spark 1.2 with Hive 12 instead you can pass

Re: [Thrift,1.2 RC] what happened to parquet.hive.serde.ParquetHiveSerDe

2014-12-04 Thread Michael Armbrust
Here's a fix: On Wed, Dec 3, 2014 at 11:05 AM, Michael Armbrust wrote: Thanks for reporting. As a workaround you should be able to SET spark.sql.hive.convertMetastoreParquet=false, but I'm going to try to fix this before

Re: drop table if exists throws exception

2014-12-05 Thread Michael Armbrust
The command run fine for me on master. Note that Hive does print an exception in the logs, but that exception does not propogate to user code. On Thu, Dec 4, 2014 at 11:31 PM, Jianshi Huang wrote: Hi, I got exception saying Hive: NoSuchObjectException(message:table

Re: CREATE TABLE AS SELECT does not work with temp tables in 1.2.0

2014-12-05 Thread Michael Armbrust
Thanks for reporting. This looks like a regression related to: I've filed it here: On Fri, Dec 5, 2014 at 12:03 PM, kb wrote: I am having trouble getting create table as select or

Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

2014-12-08 Thread Michael Armbrust
This is by hive's design. From the Hive documentation: The column change command will only modify Hive's metadata, and will not modify data. Users should make sure the actual data layout of the table/partition conforms with the metadata definition. On Sat, Dec 6, 2014 at 8:28 PM, Jianshi

Re: CREATE TABLE AS SELECT does not work with temp tables in 1.2.0

2014-12-08 Thread Michael Armbrust
Message- From: Michael Armbrust [] Sent: Saturday, December 6, 2014 4:51 AM To: kb Cc:; Cheng Hao Subject: Re: CREATE TABLE AS SELECT does not work with temp tables in 1.2.0 Thanks for reporting. This looks like a regression

Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

2014-12-09 Thread Michael Armbrust
is to use a subquery to add a bunch of column alias. I'll try it later. Thanks, Jianshi On Tue, Dec 9, 2014 at 3:34 AM, Michael Armbrust wrote: This is by hive's design. From the Hive documentation: The column change command will only modify Hive's metadata

Re: SparkSQL not honoring schema

2014-12-10 Thread Michael Armbrust
As the scala doc for applySchema says, It is important to make sure that the structure of every [[Row]] of the provided RDD matches the provided schema. Otherwise, there will be runtime exceptions. We don't check as doing runtime reflection on all of the data would be very expensive. You will

Re: Where are the docs for the SparkSQL DataTypes?

2014-12-11 Thread Michael Armbrust
[] Sent: Friday, December 12, 2014 6:37 AM To: Michael Armbrust; Subject: Where are the docs for the SparkSQL DataTypes? Michael other Spark SQL junkies, As I read through the Spark API docs, in particular those for the org.apache.spark.sql

Re: Is there any document to explain how to build the hive jars for spark?

2014-12-14 Thread Michael Armbrust
The modified version of hive can be found here: On Thu, Dec 11, 2014 at 5:47 PM, Yi Tian wrote: Hi, all We found some bugs in hive-0.12, but we could not wait for hive community fixing them. We want to fix these bugs in our lab and

Re: Data source interface for making multiple tables available for query

2014-12-22 Thread Michael Armbrust
I agree and this is something that we have discussed in the past. Essentially I think instead of creating a RelationProvider that returns a single table, we'll have something like an external catalog that can return multiple base relations. On Sun, Dec 21, 2014 at 6:43 PM, Venkata ramana

Re: Unsupported Catalyst types in Parquet

2014-12-29 Thread Michael Armbrust
for timestamp type support. For decimal type, I think we only support decimals that fits in a long. Thanks, Daoyuan -Original Message- From: Alessandro Baretta [] Sent: Saturday, December 27, 2014 2:47 PM To:; Michael Armbrust Subject

Re: query planner design doc?

2015-01-23 Thread Michael Armbrust
of strategies are basically embodied in there a design doc/roadmap/JIRA issue detailing what strategies exist and which are planned? Thanks, Nick On Jan 22, 2015, at 7:45 PM, Michael Armbrust wrote: Here is the initial design document

Re: Are there any plans to run Spark on top of Succinct

2015-01-26 Thread Michael Armbrust
There was work being done at Berkeley on prototyping support for Succinct in Spark SQL. Rachit might have more information. On Thu, Jan 22, 2015 at 7:04 AM, Dean Wampler wrote: Interesting. I was wondering recently if anyone has explored working with compressed data

Re: Optimize encoding/decoding strings when using Parquet

2015-01-16 Thread Michael Armbrust
+1 to adding such an optimization to parquet. The bytes are tagged specially as UTF8 in the parquet schema so it seem like it would be possible to add this. On Fri, Jan 16, 2015 at 8:17 AM, Mick Davies wrote: Hi, It seems that a reasonably large proportion of

Re: query planner design doc?

2015-01-22 Thread Michael Armbrust
Here is the initial design document for catalyst : Strategies (many of which are in SparkStragegies.scala) are the part that creates the physical operators from a catalyst logical plan. These operators have

Re: HiveContext cannot be serialized

2015-02-16 Thread Michael Armbrust
I'd suggest marking the HiveContext as @transient since its not valid to use it on the slaves anyway. On Mon, Feb 16, 2015 at 4:27 AM, Haopu Wang wrote: When I'm investigating this issue (in the end of this email), I take a look at HiveContext's code and find this change

Re: HiveContext cannot be serialized

2015-02-16 Thread Michael Armbrust wrote: Michael - it is already transient. This should probably considered a bug in the scala compiler, but we can easily work around it by removing the use of destructuring binding. On Mon, Feb 16, 2015 at 10:41 AM, Michael Armbrust wrote: I'd suggest

Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-19 Thread Michael Armbrust
P.S: For some reason replacing import sqlContext.createSchemaRDD with import sqlContext.implicits._ doesn't do the implicit conversations. registerTempTable gives syntax error. I will dig deeper tomorrow. Has anyone seen this ? We will write up a whole migration guide before the final

Re: Hive SKEWED feature supported in Spark SQL ?

2015-02-19 Thread Michael Armbrust
1) is SKEWED BY honored ? If so, has anyone run into directories not being created ? It is not. 2) if it is not honored, does it matter ? Hive introduced this feature to better handle joins where tables had a skewed distribution on keys joined on so that the single mapper handling one of

Re: renaming SchemaRDD - DataFrame

2015-01-28 Thread Michael Armbrust
In particular the performance tricks are in SpecificMutableRow. On Wed, Jan 28, 2015 at 5:49 PM, Evan Chan wrote: Yeah, it's null. I was worried you couldn't represent it in Row because of primitive types like Int (unless you box the Int, which would be a performance

Re: Caching tables at column level

2015-02-01 Thread Michael Armbrust
Its not completely transparent, but you can do something like the following today: CACHE TABLE hotData AS SELECT columns, I, care, about FROM fullTable On Sun, Feb 1, 2015 at 3:03 AM, Mick Davies wrote: I have been working a lot recently with denormalised tables

GitHub Syncing Down

2015-03-10 Thread Michael Armbrust

Re: Spark 1.3 SQL Type Parser Changes?

2015-03-10 Thread Michael Armbrust
Thanks for reporting. This was a result of a change to our DDL parser that resulted in types becoming reserved words. I've filled a JIRA and will investigate if this is something we can fix. On Tue, Mar 10, 2015 at 1:51 PM, Nitay Joffe

Re: Any guidance on when to back port and how far?

2015-03-24 Thread Michael Armbrust
Two other criteria that I use when deciding what to backport: - Is it a regression from a previous minor release? I'm much more likely to backport fixes in this case, as I'd love for most people to stay up to date. - How scary is the change? I think the primary goal is stability of the

Re: enum-like types in Spark

2015-03-04 Thread Michael Armbrust
#4 with a preference for CamelCaseEnums On Wed, Mar 4, 2015 at 5:29 PM, Joseph Bradley wrote: another vote for #4 People are already used to adding () in Java. On Wed, Mar 4, 2015 at 5:14 PM, Stephen Boesch wrote: #4 but with MemoryOnly (more

Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-23 Thread Michael Armbrust
On Sun, Feb 22, 2015 at 11:20 PM, Mark Hamstra wrote: So what are we expecting of Hive 0.12.0 builds with this RC? I know not every combination of Hadoop and Hive versions, etc., can be supported, but even an example build from the Building Spark page isn't looking too

Re: [SQL][Feature] Access row by column name instead of index

2015-04-24 Thread Michael Armbrust
Already done :) On Fri, Apr 24, 2015 at 2:37 PM, Reynold Xin wrote: Can you elaborate what you mean by that? (what's already available in Python?) On Fri, Apr 24, 2015 at 2:24 PM, Shuai

Re: Is SQLContext thread-safe?

2015-04-30 Thread Michael Armbrust
Unfortunately, I think the SQLParser is not threadsafe. I would recommend using HiveQL. On Thu, Apr 30, 2015 at 4:07 AM, Wangfei (X) wrote: actually this is a sql parse exception, are you sure your sql is right? 发自我的 iPhone 在 2015年4月30日,18:50,Haopu Wang

Re: Uninitialized session in HiveContext?

2015-04-30 Thread Michael Armbrust
Hey Marcelo, Thanks for the heads up! I'm currently in the process of refactoring all of this (to separate the metadata connection from the execution side) and as part of this I'm making the initialization of the session not lazy. It would be great to hear if this also works for your internal

Re: Plans for upgrading Hive dependency?

2015-04-29 Thread Michael Armbrust
I am working on it. Here is the (very rough) version: On Mon, Apr 27, 2015 at 1:03 PM, Punyashloka Biswal wrote: Thanks Marcelo and Patrick - I don't know how I missed that ticket in my

Re: DataFrame distinct vs RDD distinct

2015-05-07 Thread Michael Armbrust
I'd happily merge a PR that changes the distinct implementation to be more like Spark core, assuming it includes benchmarks that show better performance for both the fits in memory case and the too big for memory case. On Thu, May 7, 2015 at 2:23 AM, Olivier Girardot

Re: Speeding up Spark build during development

2015-05-04 Thread Michael Armbrust
FWIW... My Spark SQL development workflow is usually to run build/sbt sparkShell or build/sbt 'sql/test-only testSuiteName'. These commands starts in as little as 30s on my laptop, automatically figure out which subprojects need to be rebuilt, and don't require the expensive assembly creation.

Re: [SparkSQL] cannot filter by a DateType column

2015-05-08 Thread Michael Armbrust
What version of Spark are you using? It appears that at least in master we are doing the conversion correctly, but its possible older versions of applySchema do not. If you can reproduce the same bug in master, can you open a JIRA? On Fri, May 8, 2015 at 1:36 AM, Haopu Wang

Re: [VOTE] Release Apache Spark 1.3.1 (RC2)

2015-04-10 Thread Michael Armbrust
-1 (binding) We just were alerted to a pretty serious regression since 1.3.0 ( Should have a fix shortly. Michael On Fri, Apr 10, 2015 at 6:10 AM, Corey Nolet wrote: +1 (non-binding) - Verified signatures - built on Mac

Re: [Catalyst] RFC: Using PartialFunction literals instead of objects

2015-05-19 Thread Michael Armbrust
Overall this seems like a reasonable proposal to me. Here are a few thoughts: - There is some debugging utility to the ruleName, so we would probably want to at least make that an argument to the rule function. - We also have had rules that operate on SparkPlan, though since there is only one

Re: [SparkSQL 1.4]Could not use concat with UDF in where clause

2015-06-23 Thread Michael Armbrust
Can you file a JIRA please? On Tue, Jun 23, 2015 at 1:42 AM, StanZhai wrote: Hi all, After upgraded the cluster from spark 1.3.1 to 1.4.0(rc4), I encountered the following exception when use concat with UDF in where clause:

Re: how to implement my own datasource?

2015-06-25 Thread Michael Armbrust
I'd suggest looking at the avro data source as an example implementation: I also gave a talk a while ago: Hi, You can connect to by JDBC as described in

Re: When to expect UTF8String?

2015-06-12 Thread Michael Armbrust
1. Custom aggregators that do map-side combine. This is something I'd hoping to add in Spark 1.5 2. UDFs with more than 22 arguments which is not supported by ScalaUdf, and to avoid wrapping a Java function interface in one of 22 different Scala function interfaces depending on the number

Re: When to expect UTF8String?

2015-06-11 Thread Michael Armbrust
Through the DataFrame API, users should never see UTF8String. Expression (and any class in the catalyst package) is considered internal and so uses the internal representation of various types. Which type we use here is not stable across releases. Is there a reason you aren't defining a UDF

  1   2   3   4   >