[jira] [Assigned] (SPARK-17643) Remove comparable requirement from Offset

2016-09-22 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust reassigned SPARK-17643: Assignee: Michael Armbrust > Remove comparable requirement from Off

[jira] [Updated] (SPARK-17627) Streaming Providers should be labeled Experimental

2016-09-21 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-17627: - Component/s: SQL > Streaming Providers should be labeled Experimen

[jira] [Created] (SPARK-17627) Streaming Providers should be labeled Experimental

2016-09-21 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-17627: Summary: Streaming Providers should be labeled Experimental Key: SPARK-17627 URL: https://issues.apache.org/jira/browse/SPARK-17627 Project: Spark

[jira] [Commented] (SPARK-16407) Allow users to supply custom StreamSinkProviders

2016-09-21 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-16407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15510787#comment-15510787 ] Michael Armbrust commented on SPARK-16407: -- I'm still a little uncle

[jira] [Commented] (SPARK-16407) Allow users to supply custom StreamSinkProviders

2016-09-19 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-16407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15504301#comment-15504301 ] Michael Armbrust commented on SPARK-16407: -- You are taking an experime

[jira] [Commented] (SPARK-16407) Allow users to supply custom StreamSinkProviders

2016-09-19 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-16407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15504303#comment-15504303 ] Michael Armbrust commented on SPARK-16407: -- You are taking an experime

Re: Spark SQL - Applying transformation on a struct inside an array

2016-09-15 Thread Michael Armbrust
gt; >> Hi everyone, >> I'm currently trying to create a generic transformation mecanism on a >> Dataframe to modify an arbitrary column regardless of the underlying the >> schema. >> >> It's "relatively" straightforward for complex types like

[jira] [Commented] (SPARK-16407) Allow users to supply custom StreamSinkProviders

2016-09-15 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-16407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15494475#comment-15494475 ] Michael Armbrust commented on SPARK-16407: -- Sure, but the bar for compatibi

[jira] [Comment Edited] (SPARK-16407) Allow users to supply custom StreamSinkProviders

2016-09-15 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-16407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15494475#comment-15494475 ] Michael Armbrust edited comment on SPARK-16407 at 9/15/16 8:4

[jira] [Commented] (SPARK-16407) Allow users to supply custom StreamSinkProviders

2016-09-14 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-16407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491819#comment-15491819 ] Michael Armbrust commented on SPARK-16407: -- I think it is likely that we

[jira] [Updated] (SPARK-17445) Reference an ASF page as the main place to find third-party packages

2016-09-13 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-17445: - Target Version/s: 2.0.1, 2.1.0 > Reference an ASF page as the main place to find th

[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka

2016-09-13 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15488496#comment-15488496 ] Michael Armbrust commented on SPARK-15406: -- For the types that are coming

[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka

2016-09-13 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15488477#comment-15488477 ] Michael Armbrust commented on SPARK-15406: -- Streaming is labeled experime

[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka

2016-09-13 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15487875#comment-15487875 ] Michael Armbrust commented on SPARK-15406: -- Hey Cody, thanks for the input

Re: [SQL] Why does spark.read.csv.cache give me a WARN about cache but not text?!

2016-08-16 Thread Michael Armbrust
try running explain on each of these. my guess would be caching in broken in some cases. On Tue, Aug 16, 2016 at 6:05 PM, Jacek Laskowski wrote: > Hi, > > Can anyone explain why spark.read.csv("people.csv").cache.show ends up > with a WARN while spark.read.text("people.csv").cache.show does not

Re:

2016-08-14 Thread Michael Armbrust
thout the trick hoping it's gonna kick off brodcast. > Correct? > > Pozdrawiam, > Jacek Laskowski > > https://medium.com/@jaceklaskowski/ > Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark > Follow me at https://twitter.com/jaceklaskowski > > &

Re:

2016-08-14 Thread Michael Armbrust
Have you tried doing the join in two parts (id == 0 and id != 0) and then doing a union of the results? It is possible that with this technique, that the join which only contains skewed data would be filtered enough to allow broadcasting of one side. On Sat, Aug 13, 2016 at 11:15 PM, Jestin Ma w

Re: call a mysql stored procedure from spark

2016-08-14 Thread Michael Armbrust
As described here , you can use the DataSource API to connect to an external database using JDBC. While the dbtable option is usually just a table name, it can also be any valid SQL command that returns a table

Re: Spark 2.0.0 JaninoRuntimeException

2016-08-14 Thread Michael Armbrust
Anytime you see JaninoRuntimeException you are seeing a bug in our code generation. If you can come up with a small example that causes the problem it would be very helpful if you could open a JIRA. On Fri, Aug 12, 2016 at 2:30 PM, dhruve ashar wrote: > I see a similar issue being resolved rece

Re: [SQL] Why does (0 to 9).toDF("num").as[String] work?

2016-08-14 Thread Michael Armbrust
There are two type systems in play here. Spark SQL's and Scala's. >From the Scala side, this is type-safe. After calling as[String]the Dataset will only return Strings. It is impossible to ever get a class cast exception unless you do your own incorrect casting after the fact. Underneath the co

Re: Does Spark SQL support indexes?

2016-08-14 Thread Michael Armbrust
Using df.write.partitionBy is similar to a coarse-grained, clustered index in a traditional database. You can't use it on temporary tables, but it will let you efficiently select small parts of a much larger table. On Sat, Aug 13, 2016 at 11:13 PM, Jörn Franke wrote: > Use a format that has bui

Re: Sorting within partitions is not maintained in parquet?

2016-08-11 Thread Michael Armbrust
This is an optimization to avoid overloading the scheduler with many small tasks. It bin-packs data into tasks based on the file size. You can disable it by setting spark.sql.files.openCostInBytes very high (higher than spark.sql.files.maxPartitionBytes). On Thu, Aug 11, 2016 at 4:27 AM, Hyukjin

Re: Source API requires unbounded distributed storage?

2016-08-04 Thread Michael Armbrust
Yeah, this API is in the private execution package because we are planning to continue to iterate on it. Today, we will only ever go back one batch, though that might change in the future if we do async checkpointing of internal state. You are totally right that we should relay this info back to

Re: How to set nullable field when create DataFrame using case class

2016-08-04 Thread Michael Armbrust
Nullable is an optimization for Spark SQL. It is telling spark to not even do an if check when accessing that field. In this case, your data *is* nullable, because timestamp is an object in java and you could put null there. On Thu, Aug 4, 2016 at 2:56 PM, luismattor wrote: > Hi all, > > Consi

Re: issue with coalesce in Spark 2.0.0

2016-08-03 Thread Michael Armbrust
Spark 2.0 is not binary compatible with Spark 1.x, you'll need to recompile your jar. On Tue, Aug 2, 2016 at 2:57 AM, 陈宇航 wrote: > Hi all. > > > I'm testing on Spark 2.0.0 and found an issue when using coalesce in > my code. > > The procedure is simple doing a coalesce for a RDD[Stirng],

Re: error while running filter on dataframe

2016-07-31 Thread Michael Armbrust
You are hitting a bug in code generation. If you can come up with a small reproduction for the problem. It would be very helpful if you could open a JIRA. On Sun, Jul 31, 2016 at 9:14 AM, Tony Lane wrote: > Can someone help me understand this error which occurs while running a > filter on a da

Re: calling dataset.show on a custom object - displays toString() value as first column and blank for rest

2016-07-31 Thread Michael Armbrust
Can you share you code? This does not happen for me . On Sun, Jul 31, 2016 at 7:16 AM, Rohit Chaddha wrote: > I have a custom object c

Re: spark 2.0 readStream from a REST API

2016-07-31 Thread Michael Armbrust
You have to add a file in resource too (example ). Either that or give a full class name. On Sun, Jul 31, 2016 at 9:45 AM, Ayoub Benali wrote: > Looks like t

Re: [Spark 2.0] Why MutableInt cannot be cast to MutableLong?

2016-07-31 Thread Michael Armbrust
Are you sure you are running Spark 2.0? In your stack trace I see SqlNewHadoopRDD, which was removed in #12354 . On Sun, Jul 31, 2016 at 2:12 AM, Chanh Le wrote: > Hi everyone, > Why *MutableInt* cannot be cast to *MutableLong?* > It’s really weird an

Re: libraryDependencies

2016-07-26 Thread Michael Armbrust
gt; [error] import org.apache.spark.mllib.linalg.SingularValueDecomposition > [error] ^ > [error] > /Users/studio/.sbt/0.13/staging/42f93875138543b4e1d3/sparksample/src/main/scala/MyApp.scala:5: > object mllib is not a member of package org.apache.spark > [error] import org.apach

Re: libraryDependencies

2016-07-26 Thread Michael Armbrust
Also, you'll want all of the various spark versions to be the same. On Tue, Jul 26, 2016 at 12:34 PM, Michael Armbrust wrote: > If you are using %% (double) then you do not need _2.11. > > On Tue, Jul 26, 2016 at 12:18 PM, Martin Somers wrote: > >> >

Re: libraryDependencies

2016-07-26 Thread Michael Armbrust
If you are using %% (double) then you do not need _2.11. On Tue, Jul 26, 2016 at 12:18 PM, Martin Somers wrote: > > my build file looks like > > libraryDependencies ++= Seq( > // other dependencies here > "org.apache.spark" %% "spark-core" % "1.6.2" % "provided", >

Re: Outer Explode needed

2016-07-25 Thread Michael Armbrust
I don't think this would be hard to implement. The physical explode operator supports it (for our HiveQL compatibility). Perhaps comment on this JIRA? https://issues.apache.org/jira/browse/SPARK-13721 It could probably just be another argument to explode() Michael On Mon, Jul 25, 2016 at 6:12

[jira] [Created] (SPARK-16724) Expose DefinedByConstructorParams

2016-07-25 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-16724: Summary: Expose DefinedByConstructorParams Key: SPARK-16724 URL: https://issues.apache.org/jira/browse/SPARK-16724 Project: Spark Issue Type: Bug

Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

2016-07-22 Thread Michael Armbrust
+1 On Fri, Jul 22, 2016 at 2:42 PM, Holden Karau wrote: > +1 (non-binding) > > Built locally on Ubuntu 14.04, basic pyspark sanity checking & tested with > a simple structured streaming project (spark-structured-streaming-ml) & > spark-testing-base & high-performance-spark-examples (minor change

Re: transtition SQLContext to SparkSession

2016-07-18 Thread Michael Armbrust
+ dev, reynold Yeah, thats a good point. I wonder if SparkSession.sqlContext should be public/deprecated? On Mon, Jul 18, 2016 at 8:37 AM, Koert Kuipers wrote: > in my codebase i would like to gradually transition to SparkSession, so > while i start using SparkSession i also want a SQLContext

Re: transtition SQLContext to SparkSession

2016-07-18 Thread Michael Armbrust
+ dev, reynold Yeah, thats a good point. I wonder if SparkSession.sqlContext should be public/deprecated? On Mon, Jul 18, 2016 at 8:37 AM, Koert Kuipers wrote: > in my codebase i would like to gradually transition to SparkSession, so > while i start using SparkSession i also want a SQLContext

[jira] [Created] (SPARK-16609) Single function for parsing timestamps/dates

2016-07-18 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-16609: Summary: Single function for parsing timestamps/dates Key: SPARK-16609 URL: https://issues.apache.org/jira/browse/SPARK-16609 Project: Spark Issue

[jira] [Updated] (SPARK-16609) Single function for parsing timestamps/dates

2016-07-18 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-16609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-16609: - Target Version/s: 2.1.0 > Single function for parsing timestamps/da

[jira] [Resolved] (SPARK-16531) Remove TimeZone from DataFrameTimeWindowingSuite

2016-07-13 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-16531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-16531. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 14170

[jira] [Updated] (SPARK-16483) Unifying struct fields and columns

2016-07-11 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-16483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-16483: - Target Version/s: 2.1.0 > Unifying struct fields and colu

Re: Saving Table with Special Characters in Columns

2016-07-11 Thread Michael Armbrust
This is protecting you from a limitation in parquet. The library will let you write out invalid files that can't be read back, so we added this check. You can call .format("csv") (in spark 2.0) to switch it to CSV. On Mon, Jul 11, 2016 at 11:16 AM, Tobi Bosede wrote: > Hi everyone, > > I am tr

Re: DataFrame Min By Column

2016-07-09 Thread Michael Armbrust
st > UC Berkeley AMPLab Alumni > > pedrorodriguez.io | 909-353-4423 > github.com/EntilZha | LinkedIn > <https://www.linkedin.com/in/pedrorodriguezscience> > > On July 9, 2016 at 2:19:11 PM, Michael Armbrust (mich...@databricks.com) > wrote: > > You can do whats called an *a

Re: DataFrame Min By Column

2016-07-09 Thread Michael Armbrust
You can do whats called an *argmax/argmin*, where you take the min/max of a couple of columns that have been grouped together as a struct. We sort in column order, so you can put the timestamp first. Here is an example

Re: Multiple aggregations over streaming dataframes

2016-07-07 Thread Michael Armbrust
We are planning to address this issue in the future. At a high level, we'll have to add a delta mode so that updates can be communicated from one operator to the next. On Thu, Jul 7, 2016 at 8:59 AM, Arnaud Bailly wrote: > Indeed. But nested aggregation does not work with Structured Streaming,

[jira] [Commented] (SPARK-8360) Structured Streaming (aka Streaming DataFrames)

2016-07-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-8360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15359516#comment-15359516 ] Michael Armbrust commented on SPARK-8360: - This kind of question would be be

Re: Structured Streaming Sink in 2.0 collect/foreach restrictions added in SPARK-16020

2016-06-28 Thread Michael Armbrust
ustom Sink and then doing your operations on that be a reasonable work > around? > > > On Tuesday, June 28, 2016, Michael Armbrust > wrote: > >> This is not too broadly worded, and in general I would caution that any >> interface in org.apache.spark.sql.catalyst or >

Re: Structured Streaming Sink in 2.0 collect/foreach restrictions added in SPARK-16020

2016-06-28 Thread Michael Armbrust
This is not too broadly worded, and in general I would caution that any interface in org.apache.spark.sql.catalyst or org.apache.spark.sql.execution is considered internal and likely to change in between releases. We do plan to open a stable source/sink API in a future release. The problem here i

Re: Logging trait in Spark 2.0

2016-06-28 Thread Michael Armbrust
I'd suggest using the slf4j APIs directly. They provide a nice stable API that works with a variety of logging backends. This is what Spark does internally. On Sun, Jun 26, 2016 at 4:02 AM, Paolo Patierno wrote: > Yes ... the same here ... I'd like to know the best way for adding logging > in

[jira] [Closed] (SPARK-16188) Spark sql create a lot of small files

2016-06-28 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-16188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust closed SPARK-16188. Resolution: Not A Bug This is by design and changes would likely be too disruptive. The

Re: [VOTE] Release Apache Spark 1.6.2 (RC2)

2016-06-22 Thread Michael Armbrust
+1 On Wed, Jun 22, 2016 at 11:33 AM, Jonathan Kelly wrote: > +1 > > On Wed, Jun 22, 2016 at 10:41 AM Tim Hunter > wrote: > >> +1 This release passes all tests on the graphframes and tensorframes >> packages. >> >> On Wed, Jun 22, 2016 at 7:19 AM, Cody Koeninger >> wrote: >> >>> If we're consid

Re: cast only some columns

2016-06-21 Thread Michael Armbrust
Use `withColumn`. It will replace a column if you give it the same name. On Tue, Jun 21, 2016 at 4:16 AM, pseudo oduesp wrote: > Hi , > with fillna we can select some columns to perform replace some values > with chosing columns with dict > {columns :values } > but how i can do same with c

Re: Question about equality of o.a.s.sql.Row

2016-06-20 Thread Michael Armbrust
> > This is because two objects are compared by "o1 != o2" instead of > "o1.equals(o2)" at > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala#L408 Even equals(...) does not do what you want on the JVM: scala> Array(1,2).equals(Array(1,2)) res

[jira] [Resolved] (SPARK-16050) Flaky Test: Complete aggregation with Console sink

2016-06-20 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-16050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-16050. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13776

Re: Hello

2016-06-17 Thread Michael Armbrust
Another good signal is the "target version" (which by convention is only set by committers). When I set this for the upcoming version it means I think its important enough that I will prioritize reviewing a patch for it. On Fri, Jun 17, 2016 at 3:22 PM, Pedro Rodriguez wrote: > What is the best

Re: Encoder Guide / Option[T] Encoder

2016-06-16 Thread Michael Armbrust
There is no public API for writing encoders at the moment, though we are hoping to open this up in Spark 2.1. What is not working about encoders for options? Which version of Spark are you running? This is working as I would expect? https://databricks-prod-cloudfront.cloud.databricks.com/public

Re: cutting 1.6.2 rc and 2.0.0 rc this week?

2016-06-15 Thread Michael Armbrust
+1 to both of these! On Wed, Jun 15, 2016 at 12:21 PM, Sean Owen wrote: > 1.6.2 RC seems fine to me; I don't know of outstanding issues. Clearly > we need to keep the 1.x line going for a bit, so a bug fix release > sounds good, > > Although we've got some work to do before 2.0.0 it does look li

[jira] [Resolved] (SPARK-15964) Assignment to RDD-typed val fails

2016-06-15 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-15964. -- Resolution: Won't Fix > Assignment to RDD-typed v

[jira] [Commented] (SPARK-15964) Assignment to RDD-typed val fails

2016-06-15 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332224#comment-15332224 ] Michael Armbrust commented on SPARK-15964: -- Thanks for reporting this, b

[jira] [Updated] (SPARK-15915) CacheManager should use canonicalized plan for planToCache.

2016-06-14 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-15915: - Assignee: Takuya Ueshin > CacheManager should use canonicalized plan for planToCa

[jira] [Resolved] (SPARK-15915) CacheManager should use canonicalized plan for planToCache.

2016-06-14 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-15915. -- Resolution: Fixed Fix Version/s: 2.0.0 > CacheManager should use canonicali

Re: Spark 2.0: Unify DataFrames and Datasets question

2016-06-14 Thread Michael Armbrust
> > 1) What does this really mean to an Application developer? > It means there are less concepts to learn. > 2) Why this unification was needed in Spark 2.0? > To simplify the API and reduce the number of concepts that needed to be learned. We only didn't do it in 1.6 because we didn't want t

[jira] [Updated] (SPARK-15934) Return binary mode in ThriftServer

2016-06-14 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-15934: - Assignee: (was: Egor Pakhomov) > Return binary mode in ThriftSer

[jira] [Updated] (SPARK-15934) Return binary mode in ThriftServer

2016-06-14 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-15934: - Assignee: Egor Pakhomov Target Version/s: 2.0.0 Priority

Re: Databricks SparkPerf with Spark 2.0

2016-06-14 Thread Michael Armbrust
NoSuchMethodError always means that you are compiling against a different classpath than is available at runtime, so it sounds like you are on the right track. The project is not abandoned, we're just busy with the release. It would be great if you could open a pull request. On Tue, Jun 14, 2016

Re: Is there a limit on the number of tasks in one job?

2016-06-13 Thread Michael Armbrust
You might try with the Spark 2.0 preview. We spent a bunch of time improving the handling of many small files. On Mon, Jun 13, 2016 at 11:19 AM, khaled.hammouda wrote: > I'm trying to use Spark SQL to load json data that are split across about > 70k > files across 24 directories in hdfs, using

Re: Spark Thrift Server in CDH 5.3

2016-06-13 Thread Michael Armbrust
I'd try asking on the cloudera forums. On Sun, Jun 12, 2016 at 9:51 PM, pooja mehta wrote: > Hi, > > How do I start Spark Thrift Server with cloudera CDH 5.3? > > Thanks. >

Re: Spark 2.0: Unify DataFrames and Datasets question

2016-06-13 Thread Michael Armbrust
Here's a talk I gave on the topic: https://www.youtube.com/watch?v=i7l3JQRx7Qw http://www.slideshare.net/SparkSummit/structuring-spark-dataframes-datasets-and-streaming-by-michael-armbrust On Mon, Jun 13, 2016 at 4:01 AM, Arun Patel wrote: > In Spark 2.0, DataFrames and Datasets are

[jira] [Resolved] (SPARK-15489) Dataset kryo encoder won't load custom user settings

2016-06-10 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-15489. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13424

[jira] [Resolved] (SPARK-6320) Adding new query plan strategy to SQLContext

2016-06-10 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-6320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-6320. - Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13147 [https

[jira] [Resolved] (SPARK-15743) Prevent saving with all-column partitioning

2016-06-10 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-15743. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13486

Re: Spark 2.0 Streaming and Event Time

2016-06-09 Thread Michael Armbrust
There is no special setting for event time (though we will be adding one for setting a watermark in 2.1 to allow us to reduce the amount of state that needs to be kept around). Just window/groupBy on the on the column that is your event time. On Wed, Jun 8, 2016 at 4:12 PM, Chang Lim wrote: > H

Re: Seq.toDF vs sc.parallelize.toDF = no Spark job vs one - why?

2016-06-09 Thread Michael Armbrust
Look at the explain(). For a Seq we know its just local data so avoid spark jobs for simple operations. In contrast, an RDD is opaque to catalyst so we can't perform that optimization. On Wed, Jun 8, 2016 at 7:49 AM, Jacek Laskowski wrote: > Hi, > > I just noticed it today while toying with Sp

[jira] [Updated] (SPARK-15743) Prevent saving with all-column partitioning

2016-06-08 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-15743: - Labels: releasenotes (was: ) > Prevent saving with all-column partition

[jira] [Updated] (SPARK-15786) joinWith bytecode generation calling ByteBuffer.wrap with InternalRow

2016-06-06 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-15786: - Target Version/s: 2.0.0 > joinWith bytecode generation calling ByteBuffer.wrap w

Re: Dataset Outer Join vs RDD Outer Join

2016-06-06 Thread Michael Armbrust
e[], int, int)" > > The generated code is passing InternalRow objects into the ByteBuffer > > Starting from two Datasets of types Dataset[(Int, Int)] with expression > $"left._1" === $"right._1". I'll have to spend some time getting a better > understanding o

[jira] [Updated] (SPARK-15732) Dataset generated code "generated.java" Fails with Certain Case Classes

2016-06-02 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-15732: - Priority: Critical (was: Major) > Dataset generated code "generated.java&quo

[jira] [Updated] (SPARK-15732) Dataset generated code "generated.java" Fails with Certain Case Classes

2016-06-02 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-15732: - Target Version/s: 2.0.0 > Dataset generated code "generated.java" Fails

[jira] [Commented] (SPARK-12931) Improve bucket read path to only create one single RDD

2016-06-02 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-12931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15312702#comment-15312702 ] Michael Armbrust commented on SPARK-12931: -- It was fixed in: h

Re: Spark 2.0.0-preview artifacts still not available in Maven

2016-06-01 Thread Michael Armbrust
> > I'd think we want less effort, not more, to let people test it? for > example, right now I can't easily try my product build against > 2.0.0-preview. I don't feel super strongly one way or the other, so if we need to publish it permanently we can. However, either way you can still test again

Re: Spark 2.0.0-preview artifacts still not available in Maven

2016-06-01 Thread Michael Armbrust
Yeah, we don't usually publish RCs to central, right? On Wed, Jun 1, 2016 at 1:06 PM, Reynold Xin wrote: > They are here ain't they? > > https://repository.apache.org/content/repositories/orgapachespark-1182/ > > Did you mean publishing them to maven central? My understanding is that > publishin

Re: Dataset Outer Join vs RDD Outer Join

2016-06-01 Thread Michael Armbrust
ess Option doesn't have a first class Encoder or DataType > yet and maybe for good reasons. > > I did find the RDD join interface elegant, though. In the ideal world an > API comparable the following would be nice: > https://gist.github.com/rmarsch/3ea78b3a9a8a0e83ce162ed947fc

Re: Dataset Outer Join vs RDD Outer Join

2016-06-01 Thread Michael Armbrust
Thanks for the feedback. I think this will address at least some of the problems you are describing: https://github.com/apache/spark/pull/13425 On Wed, Jun 1, 2016 at 9:58 AM, Richard Marscher wrote: > Hi, > > I've been working on transitioning from RDD to Datasets in our codebase in > anticipa

[jira] [Resolved] (SPARK-15686) Move user-facing structured streaming classes into sql.streaming

2016-06-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-15686. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13429

Re: Map tuple to case class in Dataset

2016-06-01 Thread Michael Armbrust
31, 2016 at 7:35 PM, Tim Gautier >>> wrote: >>> >>>> 1.6.1 The exception is a null pointer exception. I'll paste the whole >>>> thing after I fire my cluster up again tomorrow. >>>> >>>> I take it by the responses that this is sup

Re: Map tuple to case class in Dataset

2016-05-31 Thread Michael Armbrust
Version of Spark? What is the exception? On Tue, May 31, 2016 at 4:17 PM, Tim Gautier wrote: > How should I go about mapping from say a Dataset[(Int,Int)] to a > Dataset[]? > > I tried to use a map, but it throws exceptions: > > case class Test(a: Int) > Seq(1,2).toDS.map(t => Test(t)).show > >

[jira] [Commented] (SPARK-15654) Reading gzipped files results in duplicate rows

2016-05-30 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15306831#comment-15306831 ] Michael Armbrust commented on SPARK-15654: -- Thanks for point this out! L

[jira] [Updated] (SPARK-15654) Reading gzipped files results in duplicate rows

2016-05-30 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-15654: - Target Version/s: 2.0.0 > Reading gzipped files results in duplicate r

[jira] [Updated] (SPARK-15654) Reading gzipped files results in duplicate rows

2016-05-30 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-15654: - Priority: Blocker (was: Critical) > Reading gzipped files results in duplicate r

[jira] [Commented] (SPARK-15489) Dataset kryo encoder won't load custom user settings

2016-05-30 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15306826#comment-15306826 ] Michael Armbrust commented on SPARK-15489: -- As soon as you open a PR it

Re: Undocumented left join constraint?

2016-05-27 Thread Michael Armbrust
Sounds like: https://issues.apache.org/jira/browse/SPARK-15441, for which a fix is in progress. Please do keep reporting issues though, these are great! Michael On Fri, May 27, 2016 at 1:01 PM, Tim Gautier wrote: > Is it truly impossible to left join a Dataset[T] on the right if T has any > no

Re: HiveContext standalone => without a Hive metastore

2016-05-26 Thread Michael Armbrust
You can also just make sure that each user is using their own directory. A rough example can be found in TestHive. Note: in Spark 2.0 there should be no need to use HiveContext unless you need to talk to a metastore. On Thu, May 26, 2016 at 1:36 PM, Mich Talebzadeh wrote: > Well make sure than

[jira] [Updated] (SPARK-15483) IncrementalExecution should use extra strategies.

2016-05-25 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-15483: - Assignee: Takuya Ueshin > IncrementalExecution should use extra strateg

[jira] [Resolved] (SPARK-15483) IncrementalExecution should use extra strategies.

2016-05-25 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-15483. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13261

Re: feedback on dataset api explode

2016-05-25 Thread Michael Armbrust
These APIs predate Datasets / encoders, so that is why they are Row instead of objects. We should probably rethink that. Honestly, I usually end up using the column expression version of explode now that it exists (i.e. explode($"arrayCol").as("Item")). It would be great to understand more why y

Re: Using Travis for JDK7/8 compilation and lint-java.

2016-05-24 Thread Michael Armbrust
> > i can't give you permissions -- that has to be (most likely) through > someone @ databricks, like michael. > Another clarification: not databricks, but the Apache Spark PMC grants access to the JIRA / wiki. That said... I'm not actually sure how its done.

Re: Dataset Set Operations

2016-05-24 Thread Michael Armbrust
What is the schema of the case class? On Tue, May 24, 2016 at 3:46 PM, Tim Gautier wrote: > Hello All, > > I've been trying to subtract one dataset from another. Both datasets > contain case classes of the same type. When I subtract B from A, I end up > with a copy of A that still has the record

[jira] [Commented] (SPARK-15489) Dataset kryo encoder fails on Collections$UnmodifiableCollection

2016-05-24 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15298883#comment-15298883 ] Michael Armbrust commented on SPARK-15489: -- It should run in the same JVM

[jira] [Updated] (SPARK-15489) Dataset kryo encoder fails on Collections$UnmodifiableCollection

2016-05-24 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-15489: - Target Version/s: 2.0.0 > Dataset kryo encoder fails on Collecti

[jira] [Commented] (SPARK-15489) Dataset kryo encoder fails on Collections$UnmodifiableCollection

2016-05-23 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15297095#comment-15297095 ] Michael Armbrust commented on SPARK-15489: -- Wild guess... https://github

<    3   4   5   6   7   8   9   10   11   12   >