some joins stopped working with spark 2.0.0 SNAPSHOT

2016-02-26 Thread Koert Kuipers
dataframe df1: schema: StructType(StructField(x,IntegerType,true)) explain: == Physical Plan == MapPartitions , obj#135: object, [if (input[0, object].isNullAt) null else input[0, object].get AS x#128] +- MapPartitions , createexternalrow(if (isnull(x#9)) null else x#9), [input[0, object] AS

Re: Spark 1.6.1

2016-02-26 Thread Josh Rosen
I updated the release packaging scripts to use SFTP via the *lftp* client: https://github.com/apache/spark/pull/11350 I'm starting the process of cutting a 1.6.1-RC1 tag and release artifacts right now, so please be extra careful about merging into branch-1.6 until after the release. Once the RC

Re: Upgrading to Kafka 0.9.x

2016-02-26 Thread Joel Koshy
The 0.9 release still has the old consumer as Jay mentioned but this specific release is a little unusual in that it also provides a completely new consumer client. Based on what I understand, users of Kafka need to upgrade their brokers to > Kafka 0.9.x first, before they upgrade their clients

Re: Upgrading to Kafka 0.9.x

2016-02-26 Thread Mark Grover
Thanks Jay. Yeah, if we were able to use the old consumer API from 0.9 clients to work with 0.8 brokers that would have been super helpful here. I am just trying to avoid a scenario where Spark cares about new features from every new major release of Kafka (which is a good thing) but ends up

Re: [discuss] DataFrame vs Dataset in Spark 2.0

2016-02-26 Thread Jakob Odersky
I would recommend (non-binding) option 1. Apart from the API breakage I can see only advantages, and that sole disadvantage is minimal for a few reasons: 1. the DataFrame API has been "Experimental" since its implementation, so no stability was ever implied 2. considering that the change is for

Re: More Robust DataSource Parameters

2016-02-26 Thread Reynold Xin
Thanks for the email. This sounds great in theory, but might run into two major problems: 1. Need to support 4+ programming languages (SQL, Python, Java, Scala) 2. API stability (both backward and forward) On Fri, Feb 26, 2016 at 8:44 AM, Hamel Kothari wrote: > Hi

Re: [discuss] DataFrame vs Dataset in Spark 2.0

2016-02-26 Thread Reynold Xin
That's actually not Row vs non-Row. It's just primitive vs non-primitive. Primitives get automatically flattened, to avoid having to type ._1 all the time. On Fri, Feb 26, 2016 at 2:06 AM, Sun, Rui wrote: > Thanks for the explaination. > > > > What confusing me is the

External dependencies in public APIs (was previously: Upgrading to Kafka 0.9.x)

2016-02-26 Thread Reynold Xin
Dropping Kafka list since this is about a slightly different topic. Every time we expose the API of a 3rd party application as a public Spark API has caused some problems down the road. This goes from Hadoop, Tachyon, Kafka, to Guava. Most of these are used for input/output. The good thing is

Re: make-distribution.sh fails because tachyon-project was renamed to Alluxio

2016-02-26 Thread Jiří Šimša
Hi Jong, the download links should be fixed now. Best, On Fri, Feb 26, 2016 at 9:19 AM, Jiří Šimša wrote: > Hi Jong, > > Thank you for pointing that out. I am one of the maintainers of the > Alluxio project, formerly known as Tachyon, and will make sure that the old >

Upgrading to Kafka 0.9.x

2016-02-26 Thread Mark Grover
Hi Kafka devs, I come to you with a dilemma and a request. Based on what I understand, users of Kafka need to upgrade their brokers to Kafka 0.9.x first, before they upgrade their clients to Kafka 0.9.x. However, that presents a problem to other projects that integrate with Kafka (Spark, Flume,

Re: Hbase in spark

2016-02-26 Thread Ted Malaska
Yes, and I have used HBASE-15271 and successful loaded over 20 billion records into HBase even with node failures. On Fri, Feb 26, 2016 at 11:55 AM, Ted Yu wrote: > In hbase, there is hbase-spark module which supports bulk load. > This module is to be backported in the

Re: make-distribution.sh fails because tachyon-project was renamed to Alluxio

2016-02-26 Thread Jiří Šimša
Hi Jong, Thank you for pointing that out. I am one of the maintainers of the Alluxio project, formerly known as Tachyon, and will make sure that the old download links still work. I will update this thread when it is fixed. On a related note, any Spark version will work with Alluxio 1.0 (or any

Re: make-distribution.sh fails because tachyon-project was renamed to Alluxio

2016-02-26 Thread Sean Owen
Yes, though more broadly, should this just be removed for 2.x? I had this sense Tachyon was going away, or at least being put into a corner of the project. There's probalby at least no need for special builds for it. On Fri, Feb 26, 2016 at 3:47 PM, Jong Wook Kim wrote: > Hi,

Re: Hbase in spark

2016-02-26 Thread Ted Yu
In hbase, there is hbase-spark module which supports bulk load. This module is to be backported in the upcoming 1.3.0 release. There is some pending work, such as HBASE-15271 . FYI On Fri, Feb 26, 2016 at 8:50 AM, Renu Yadav wrote: > Has anybody implemented bulk load into

More Robust DataSource Parameters

2016-02-26 Thread Hamel Kothari
Hi devs, Has there been any discussion around changing the DataSource parameters arguments be something more sophisticated than Map[String, String]? As you write more complex DataSources there are likely to be a variety of parameters of varying formats which are needed and having to coerce them

make-distribution.sh fails because tachyon-project was renamed to Alluxio

2016-02-26 Thread Jong Wook Kim
Hi, Spark's packaging script downloads tachyon from tachyon-project.org which is now redirected to alluxio.org. I guess the url should be changed to http://alluxio.org/downloads/files/0.8.2/ is it right? Jong Wook

Re: Aggregation + Adding static column + Union + Projection = Problem

2016-02-26 Thread Herman van Hövell tot Westerflier
Hi Jiří, Thanks for your mail. Could you create a JIRA ticket for this: https://issues.apache.org/jira/browse/SPARK/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel

Fwd: Aggregation + Adding static column + Union + Projection = Problem

2016-02-26 Thread Jiří Syrový
Hi, I've recently noticed a bug in Spark (branch 1.6) that appears if you do the following Let's have some DataFrame called df. 1) Aggregation of multiple columns on the Dataframe df and store result as result_agg_1 2) Do another aggregation of multiple columns, but on one less grouping columns

Is spark.driver.maxResultSize used correctly ?

2016-02-26 Thread Jeff Zhang
My job get this exception very easily even when I set large value of spark.driver.maxResultSize. After checking the spark code, I found spark.driver.maxResultSize is also used in Executor side to decide whether DirectTaskResult/InDirectTaskResult sent. This doesn't make sense to me. Using

Re: DirectFileOutputCommiter

2016-02-26 Thread Teng Qiu
Hi, thanks :) performance gain is huge, we have a INSERT INTO query, ca. 30GB in JSON format will be written to s3 at the end, without DirectOutputCommitter and our hack in hive and InsertIntoHiveTable.scala, it took more than 40min, with our changes, only 15min then. DirectOutputCommitter works

RE: [discuss] DataFrame vs Dataset in Spark 2.0

2016-02-26 Thread Sun, Rui
Thanks for the explaination. What confusing me is the different internal semantic of Dataset on non-Row type (primitive types for example) and Row type: Dataset[Int] is internally actually Dataset[Row(value:Int)] scala> val ds = sqlContext.createDataset(Seq(1,2,3)) ds:

Re: how about a custom coalesce() policy?

2016-02-26 Thread Reynold Xin
Using the right email for Nezih On Fri, Feb 26, 2016 at 12:01 AM, Reynold Xin wrote: > I think this can be useful. > > The only thing is that we are slowly migrating to the Dataset/DataFrame > API, and leave RDD mostly as is as a lower level API. Maybe we should do > both?

Re: how about a custom coalesce() policy?

2016-02-26 Thread Reynold Xin
I think this can be useful. The only thing is that we are slowly migrating to the Dataset/DataFrame API, and leave RDD mostly as is as a lower level API. Maybe we should do both? In either case it would be great to discuss the API on a pull request. Cheers. On Wed, Feb 24, 2016 at 2:08 PM, Nezih