Re: Selecting column in dataframe created with incompatible schema causes AnalysisException

2016-03-02 Thread Reynold Xin
Thanks. Once you create the jira just reply to this email with the link. On Wednesday, March 2, 2016, Ewan Leith wrote: > Thanks, I'll create the JIRA for it. Happy to help contribute to a patch if > we can, not sure if my own scala skills will be up to it but

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

2016-04-01 Thread Reynold Xin
spark.shuffle.spill actually has nothing to do with whether we write shuffle files to disk. Currently it is not possible to not write shuffle files to disk, and typically it is not a problem because the network fetch throughput is lower than what disks can sustain. In most cases, especially with

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

2016-04-01 Thread Reynold Xin
cal.dir as a buffer pool of > > others. > > > > Hence, the performance of Spark is gated by the performance of > > spark.local.dir, even on large memory systems. > > > > "Currently it is not possible to not write shuffle files to disk.” > > > > What c

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

2016-04-01 Thread Reynold Xin
ings but use spark.local.dir as a buffer pool of >> > others. >> > >> > Hence, the performance of Spark is gated by the performance of >> > spark.local.dir, even on large memory systems. >> > >> > "Currently it is not possible to not write shu

Re: Do transformation functions on RDD invoke a Job [sc.runJob]?

2016-04-24 Thread Reynold Xin
Usually no - but sortByKey does because it needs the range boundary to be built in order to have the RDD. It is a long standing problem that's unfortunately very difficult to solve without breaking the RDD API. In DataFrame/Dataset we don't have this issue though. On Sun, Apr 24, 2016 at 10:54

Re: Spark 2.0 - SQL Subqueries.

2016-05-21 Thread Reynold Xin
https://issues.apache.org/jira/browse/SPARK-15078 was just a bunch of test harness and added no new functionality. To reduce confusion, I just backported it into branch-2.0 so SPARK-15078 is now in 2.0 too. Can you paste a query you were testing? On Sat, May 21, 2016 at 10:49 AM, Kamalesh Nair

[ANNOUNCE] Announcing Apache Spark 2.0.0

2016-07-27 Thread Reynold Xin
Hi all, Apache Spark 2.0.0 is the first release of Spark 2.x line. It includes 2500+ patches from 300+ contributors. To download Spark 2.0, head over to the download page: http://spark.apache.org/downloads.html To view the release notes: http://spark.apache.org/releases/spark-release-2-0-0.html

Re: where I can find spark-streaming-kafka for spark2.0

2016-07-25 Thread Reynold Xin
The presentation at Spark Summit SF was probably referring to Structured Streaming. The existing Spark Streaming (dstream) in Spark 2.0 has the same production stability level as Spark 1.6. There is also Kafka 0.10 support in dstream. On July 25, 2016 at 10:26:49 AM, Andy Davidson (

Re: RDD vs Dataset performance

2016-07-28 Thread Reynold Xin
The performance difference is coming from the need to serialize and deserialize data to AnnotationText. The extra stage is probably very quick and shouldn't impact much. If you try cache the RDD using serialized mode, it would slow down a lot too. On Thu, Jul 28, 2016 at 9:52 AM, Darin McBeath

Re: Spark Website

2016-07-13 Thread Reynold Xin
Thanks for reporting. This is due to https://issues.apache.org/jira/servicedesk/agent/INFRA/issue/INFRA-12055 On Wed, Jul 13, 2016 at 11:52 AM, Pradeep Gollakota wrote: > Worked for me if I go to https://spark.apache.org/site/ but not > https://spark.apache.org > > On

Re: transtition SQLContext to SparkSession

2016-07-18 Thread Reynold Xin
Good idea. https://github.com/apache/spark/pull/14252 On Mon, Jul 18, 2016 at 12:16 PM, Michael Armbrust wrote: > + dev, reynold > > Yeah, thats a good point. I wonder if SparkSession.sqlContext should be > public/deprecated? > > On Mon, Jul 18, 2016 at 8:37 AM,

Re: transtition SQLContext to SparkSession

2016-07-19 Thread Reynold Xin
Yes. But in order to access methods available only in HiveContext a user cast is required. On Tuesday, July 19, 2016, Maciej Bryński <mac...@brynski.pl> wrote: > @Reynold Xin, > How this will work with Hive Support ? > SparkSession.sqlContext return HiveContext ? > > 2016

Re: ml and mllib persistence

2016-07-12 Thread Reynold Xin
Also Java serialization isn't great for cross platform compatibility. On Tuesday, July 12, 2016, aka.fe2s wrote: > Okay, I think I found an answer on my question. Some models (for instance > org.apache.spark.mllib.recommendation.MatrixFactorizationModel) hold RDDs, > so just

Re: [SPARK-2.0][SQL] UDF containing non-serializable object does not work as expected

2016-08-08 Thread Reynold Xin
> > The workaround I can imagine is just to cache and materialize `df` by > `df.cache.count()`, and then call `df.filter(...).show()`. > It should work, just a little bit tedious. > > > > On Mon, Aug 8, 2016 at 10:00 PM, Reynold Xin <r...@databricks.com> wrote: >

Re: [SPARK-2.0][SQL] UDF containing non-serializable object does not work as expected

2016-08-08 Thread Reynold Xin
That is unfortunately the way how Scala compiler captures (and defines) closures. Nothing is really final in the JVM. You can always use reflection or unsafe to modify the value of fields. On Mon, Aug 8, 2016 at 8:16 PM, Simon Scott wrote: > But does the “notSer”

Re: Logical Plan

2016-06-30 Thread Reynold Xin
Which version are you using here? If the underlying files change, technically we should go through optimization again. Perhaps the real "fix" is to figure out why is logical plan creation so slow for 700 columns. On Thu, Jun 30, 2016 at 1:58 PM, Darshan Singh wrote: >

Re: is dataframe thread safe?

2017-02-13 Thread Reynold Xin
Yes your use case should be fine. Multiple threads can transform the same data frame in parallel since they create different data frames. On Sun, Feb 12, 2017 at 9:07 AM Mendelson, Assaf wrote: > Hi, > > I was wondering if dataframe is considered thread safe. I know

[ANNOUNCE] Announcing Apache Spark 1.6.3

2016-11-07 Thread Reynold Xin
We are happy to announce the availability of Spark 1.6.3! This maintenance release includes fixes across several areas of Spark and encourage users on the 1.6.x line to upgrade to 1.6.3. Head to the project's download page to download the new version: http://spark.apache.org/downloads.html

[ANNOUNCE] Apache Spark 2.0.2

2016-11-14 Thread Reynold Xin
We are happy to announce the availability of Spark 2.0.2! Apache Spark 2.0.2 is a maintenance release containing 90 bug fixes along with Kafka 0.10 support and runtime metrics for Structured Streaming. This release is based on the branch-2.0 maintenance branch of Spark. We strongly recommend all

Re: Re: Re: Multiple streaming aggregations in structured streaming

2016-11-22 Thread Reynold Xin
It's just the "approx_count_distinct" aggregate function. On Tue, Nov 22, 2016 at 6:51 PM, Xinyu Zhang <wsz...@163.com> wrote: > Could you please tell me how to use the approximate count distinct? Is > there any docs? > > Thanks > > > At 2016-11-21 15:56:2

Re: Third party library

2016-11-25 Thread Reynold Xin
bcc dev@ and add user@ This is more a user@ list question rather than a dev@ list question. You can do something like this: object MySimpleApp { def loadResources(): Unit = // define some idempotent way to load resources, e.g. with a flag or lazy val def main() = { ...

Re: Bit-wise AND operation between integers

2016-11-28 Thread Reynold Xin
Bcc dev@ and add user@ The dev list is not meant for users to ask questions on how to use Spark. For that you should use StackOverflow or the user@ list. scala> sql("select 1 & 2").show() +---+ |(1 & 2)| +---+ | 0| +---+ scala> sql("select 1 & 3").show() +---+ |(1 & 3)|

Re: Third party library

2016-11-26 Thread Reynold Xin
o-C-library. > Is I am missing something ? If possible, can you point me to existing > implementation which i can refer to. > > Thanks again. > > ~ > > On Fri, Nov 25, 2016 at 3:32 PM, Reynold Xin <r...@databricks.com> wrote: > >> bcc dev@ and add us

Re: How do I convert json_encoded_blob_column into a data frame? (This may be a feature request)

2016-11-17 Thread Reynold Xin
Adding a new data type is an enormous undertaking and very invasive. I don't think it is worth it in this case given there are clear, simple workarounds. On Thu, Nov 17, 2016 at 12:24 PM, kant kodali wrote: > Can we have a JSONType for Spark SQL? > > On Wed, Nov 16, 2016 at

Re: Re: Multiple streaming aggregations in structured streaming

2016-11-20 Thread Reynold Xin
Can you use the approximate count distinct? On Sun, Nov 20, 2016 at 11:51 PM, Xinyu Zhang wrote: > > MapWithState is also very useful. > I want to calculate UV in real time, but "distinct count" and "multiple > streaming aggregations" are not supported. > Is there any method to

Mark DataFrame/Dataset APIs stable

2016-10-12 Thread Reynold Xin
I took a look at all the public APIs we expose in o.a.spark.sql tonight, and realized we still have a large number of APIs that are marked experimental. Most of these haven't really changed, except in 2.0 we merged DataFrame and Dataset. I think it's long overdue to mark them stable. I'm tracking

Re: Output Side Effects for different chain of operations

2016-12-15 Thread Reynold Xin
You can just write some files out directly (and idempotently) in your map/mapPartitions functions. It is just a function that you can run arbitrary code after all. On Thu, Dec 15, 2016 at 11:33 AM, Chawla,Sumit wrote: > Any suggestions on this one? > > Regards > Sumit

Re: Can't read tables written in Spark 2.1 in Spark 2.0 (and earlier)

2016-11-30 Thread Reynold Xin
This should fix it: https://github.com/apache/spark/pull/16080 On Wed, Nov 30, 2016 at 10:55 AM, Timur Shenkao wrote: > Hello, > > Yes, I used hiveContext, sqlContext, sparkSession from Java, Scala, > Python. > Via spark-shell, spark-submit, IDE (PyCharm, Intellij IDEA). >

Re: SQL specific documentation for recent Spark releases

2017-08-11 Thread Reynold Xin
This PR should help you in the next release: https://github.com/apache/spark/pull/18702 On Thu, Aug 10, 2017 at 7:46 PM, Stephen Boesch wrote: > > The correct link is https://docs.databricks.com/ > spark/latest/spark-sql/index.html . > > This link does have the core syntax

Re: Question on Spark code

2017-07-23 Thread Reynold Xin
, it will return the same type with the level that you called it. >> >> On Sun, Jul 23, 2017 at 8:20 PM Reynold Xin <r...@databricks.com> wrote: >> >>> It means the same object ("this") is returned. >>> >>> On Sun,

Re: Question on Spark code

2017-07-23 Thread Reynold Xin
It means the same object ("this") is returned. On Sun, Jul 23, 2017 at 8:16 PM, tao zhan wrote: > Hello, > > I am new to scala and spark. > What does the "this.type" in set function for? > > > ​ > https://github.com/apache/spark/blob/481f0792944d9a77f0fe8b5e2596da >

Re: the dependence length of RDD, can its size be greater than 1 pleaae?

2017-06-15 Thread Reynold Xin
A join? On Thu, Jun 15, 2017 at 1:11 AM 萝卜丝炒饭 <1427357...@qq.com> wrote: > Hi all, > > The RDD code keeps a member as below: > dependencies_ : seq[Dependency[_]] > > It is a seq, that means it can keep more than one dependency. > > I have an issue about this. > Is it possible that its size is

Re: Dataset API Question

2017-10-25 Thread Reynold Xin
It is a bit more than syntactic sugar, but not much more: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L533 BTW this is basically writing all the data out, and then create a new Dataset to load them in. On Wed, Oct 25, 2017 at 6:51 AM,

Re: Anyone knows how to build and spark on jdk9?

2017-10-26 Thread Reynold Xin
It probably depends on the Scala version we use in Spark supporting Java 9 first. On Thu, Oct 26, 2017 at 7:22 PM Zhang, Liyun wrote: > Hi all: > > 1. I want to build spark on jdk9 and test it with Hadoop on jdk9 > env. I search for jiras related to JDK9. I only

Re: Fw:multiple group by action

2018-08-24 Thread Reynold Xin
Use rollout and cube. On Fri, Aug 24, 2018 at 7:55 PM 崔苗 wrote: > > > > > > > Forwarding messages > From: "崔苗" > Date: 2018-08-25 10:54:31 > To: d...@spark.apache.org > Subject: multiple group by action > > Hi, > we have some user data with >

Re: [ANNOUNCE] Announcing Apache Spark 2.4.0

2018-11-08 Thread Reynold Xin
Do you have a cached copy? I see it here http://spark.apache.org/downloads.html On Thu, Nov 8, 2018 at 4:12 PM Li Gao wrote: > this is wonderful ! > I noticed the official spark download site does not have 2.4 download > links yet. > > On Thu, Nov 8, 2018, 4:11 PM Swapnil Shinde wrote: > >>

Re: Back to SQL

2018-10-03 Thread Reynold Xin
No we used to have that (for views) but it wasn’t working well enough so we removed it. On Wed, Oct 3, 2018 at 6:41 PM Olivier Girardot < o.girar...@lateral-thoughts.com> wrote: > Hi everyone, > Is there any known way to go from a Spark SQL Logical Plan (optimised ?) > Back to a SQL query ? > >

Re: Should python-2 be supported in Spark 3.0?

2018-09-17 Thread Reynold Xin
i'd like to second that. if we want to communicate timeline, we can add to the release notes saying py2 will be deprecated in 3.0, and removed in a 3.x release. -- excuse the brevity and lower case due to wrist injury On Mon, Sep 17, 2018 at 4:24 PM Matei Zaharia wrote: > That’s a good point

Re: [HELP WANTED] Apache Zipkin (incubating) needs Spark gurus

2019-03-21 Thread Reynold Xin
Are there specific questions you have? Might be easier to post them here also. On Wed, Mar 20, 2019 at 5:16 PM Andriy Redko wrote: > Hello Dear Spark Community! > > The hyper-popularity of the Apache Spark made it a de-facto choice for many > projects which need some sort of data processing

Re: [PySpark] Revisiting PySpark type annotations

2019-01-25 Thread Reynold Xin
If we can make the annotation compatible with Python 2, why don’t we add type annotation to make life easier for users of Python 3 (with type)? On Fri, Jan 25, 2019 at 7:53 AM Maciej Szymkiewicz wrote: > > Hello everyone, > > I'd like to revisit the topic of adding PySpark type annotations in

Re: Should python-2 be supported in Spark 3.0?

2019-05-30 Thread Reynold Xin
+1 on Xiangrui’s plan. On Thu, May 30, 2019 at 7:55 AM shane knapp wrote: > I don't have a good sense of the overhead of continuing to support >> Python 2; is it large enough to consider dropping it in Spark 3.0? >> >> from the build/test side, it will actually be pretty easy to continue >

Re: Exposing JIRA issue types at GitHub PRs

2019-06-13 Thread Reynold Xin
Seems like a good idea. Can we test this with a component first? On Thu, Jun 13, 2019 at 6:17 AM Dongjoon Hyun wrote: > Hi, All. > > Since we use both Apache JIRA and GitHub actively for Apache Spark > contributions, we have lots of JIRAs and PRs consequently. One specific > thing I've been

Re: Koalas show data in IDE or pyspark

2019-05-14 Thread Reynold Xin
This has been fixed and was included in the release 0.3 last week. We will be making another release (0.4) in the next 24 hours to include more features also. On Tue, Apr 30, 2019 at 12:42 AM, Manu Zhang < owenzhang1...@gmail.com > wrote: > > Hi, > > > It seems koalas.DataFrame can't be

Re: Help: What's the biggest length of SQL that's supported in SparkSQL?

2019-07-11 Thread Reynold Xin
There is no explicit limit but a JVM string cannot be bigger than 2G. It will also at some point run out of memory with too big of a query plan tree or become incredibly slow due to query planning complexity. I've seen queries that are tens of MBs in size. On Thu, Jul 11, 2019 at 5:01 AM, 李书明

Re: Help: What's the biggest length of SQL that's supported in SparkSQL?

2019-07-12 Thread Reynold Xin
's. Any samples to share :)  > > > Regards, > Gourav > > On Thu, Jul 11, 2019 at 5:03 PM Reynold Xin < rxin@ databricks. com ( > r...@databricks.com ) > wrote: > > >> There is no explicit limit but a JVM string cannot be bigger than 2G. It >> will also

Re: Problems running TPC-H on Raspberry Pi Cluster

2019-07-11 Thread Reynold Xin
I don't think Spark is meant to run with 1GB of memory on the entire system. The JVM loads almost 200MB of bytecode, and each page during query processing takes a min of 64MB. Maybe on the 4GB model of raspberry pi 4. On Wed, Jul 10, 2019 at 7:57 AM, agg212 < alexander_galaka...@brown.edu >

Re: Collections passed from driver to executors

2019-09-23 Thread Reynold Xin
A while ago we changed it so the task gets broadcasted too, so I think the two are fairly similar. On Mon, Sep 23, 2019 at 8:17 PM, Dhrubajyoti Hati < dhruba.w...@gmail.com > wrote: > > I was wondering if anyone could help with this question. > > On Fri, 20 Sep, 2019, 11:52 AM Dhrubajyoti

Re: Collections passed from driver to executors

2019-09-23 Thread Reynold Xin
t; We are still at 2.2. > > On Tue, 24 Sep, 2019, 9:17 AM Reynold Xin, < rxin@ databricks. com ( > r...@databricks.com ) > wrote: > > >> A while ago we changed it so the task gets broadcasted too, so I think the >> two are fairly similar. >> >> >>

Re: FYI: The evolution on `CHAR` type behavior

2020-03-14 Thread Reynold Xin
I don’t understand this change. Wouldn’t this “ban” confuse the hell out of both new and old users? For old users, their old code that was working for char(3) would now stop working. For new users, depending on whether the underlying metastore char(3) is either supported but different from ansi

Re: FYI: The evolution on `CHAR` type behavior

2020-03-15 Thread Reynold Xin
the proposed alternative to reduce the potential issue. > > > Please give us your opinion since it's still PR. > > > Bests, > Dongjoon. > > On Sat, Mar 14, 2020 at 17:54 Reynold Xin < rxin@ databricks. com ( > r...@databricks.com ) > wrote: > > >>

Re: results of taken(3) not appearing in console window

2020-03-26 Thread Reynold Xin
bcc dev, +user You need to print out the result. Take itself doesn't print. You only got the results printed to the console because the Scala REPL automatically prints the returned value from take. On Thu, Mar 26, 2020 at 12:15 PM, Zahid Rahman < zahidr1...@gmail.com > wrote: > > I am

Re: FYI: The evolution on `CHAR` type behavior

2020-03-16 Thread Reynold Xin
joon.h...@gmail.com ) > wrote: >>> >>> Hi, Reynold. >>> (And +Michael Armbrust) >>> >>> >>> If you think so, do you think it's okay that we change the return value >>> silently? Then, I'm wondering why we reverted `TRIM`

Re: FYI: The evolution on `CHAR` type behavior

2020-03-16 Thread Reynold Xin
so deviate away from the standard on this specific behavior. On Mon, Mar 16, 2020 at 5:29 PM, Reynold Xin < r...@databricks.com > wrote: > > I looked up our usage logs (sorry I can't share this publicly) and trim > has at least four orders of magnitude higher usage than char. >

Re: FYI: The evolution on `CHAR` type behavior

2020-03-16 Thread Reynold Xin
>> 100% agree with Reynold. >> >> >> >> >> Regards, >> Gourav Sengupta >> >> >> On Mon, Mar 16, 2020 at 3:31 AM Reynold Xin < rxin@ databricks. com ( >> r...@databricks.com ) > wrote: >> >> >>> Are

[ANNOUNCE] Apache Spark 3.0.0

2020-06-18 Thread Reynold Xin
Hi all, Apache Spark 3.0.0 is the first release of the 3.x line. It builds on many of the innovations from Spark 2.x, bringing new ideas as well as continuing long-term projects that have been in development. This release resolves more than 3400 tickets. We'd like to thank our contributors

Re: Stickers and Swag

2022-06-14 Thread Reynold Xin
Nice! Going to order a few items myself ... On Tue, Jun 14, 2022 at 7:54 PM, Gengliang Wang < ltn...@gmail.com > wrote: > > FYI now you can find the shopping information on https:/ / spark. apache. org/ > community ( https://spark.apache.org/community ) as well :) > > > > Gengliang > > >

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Reynold Xin
One of the problem in the past when something like this was brought up was that the ASF couldn't have officially blessed venues beyond the already approved ones. So that's something to look into. Now of course you are welcome to run unofficial things unblessed as long as they follow trademark

<    1   2