Re: [vote] Apache Spark 3.0 RC3

2020-06-09 Thread Shixiong(Ryan) Zhu
+1 (binding) Best Regards, Ryan On Tue, Jun 9, 2020 at 4:24 AM Wenchen Fan wrote: > +1 (binding) > > On Tue, Jun 9, 2020 at 6:15 PM Dr. Kent Yao wrote: > >> +1 (non-binding) >> >> >> >> -- >> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ >> >>

Re: [DISCUSS] "complete" streaming output mode

2020-05-20 Thread Shixiong(Ryan) Zhu
Hey Jungtaek, I totally agree with you about the issues of the complete mode you raised here. However, not all streaming queries have unbounded states and will grow quickly to a crazy state. Actually, I found the complete mode is pretty useful when the states are bounded and small. For example,

Re: More publicly documenting the options under spark.sql.*

2020-01-16 Thread Shixiong(Ryan) Zhu
"spark.sql("set -v")" returns a Dataset that has all non-internal SQL configurations. Should be pretty easy to automatically generate a SQL configuration page. Best Regards, Ryan On Wed, Jan 15, 2020 at 5:47 AM Hyukjin Kwon wrote: > I think automatically creating a configuration page isn't a

Re: Adding JIRA ID as the prefix for the test case name

2019-11-14 Thread Shixiong(Ryan) Zhu
Should we also add a guideline for non Scala tests? Other languages (Java, Python, R) don't support using string as a test name. Best Regards, Ryan On Thu, Nov 14, 2019 at 4:04 AM Hyukjin Kwon wrote: > I opened a PR - https://github.com/apache/spark-website/pull/231 > > 2019년 11월 13일 (수) 오전

Re: Why two netty libs?

2019-09-03 Thread Shixiong(Ryan) Zhu
Yep, historical reasons. And Netty 4 is under another namespace, so we can use Netty 3 and Netty 4 in the same JVM. On Tue, Sep 3, 2019 at 6:15 AM Sean Owen wrote: > It was for historical reasons; some other transitive dependencies needed > it. > I actually was just able to exclude Netty 3 last

Re: [SS] KafkaSource doesn't use KafkaSourceInitialOffsetWriter for initial offsets?

2019-08-26 Thread Shixiong(Ryan) Zhu
We were worried about regression when adding Kafka source v2 because it had lots of changes. Hence we copy-pasted codes to keep the Kafka source v1 untouched and provided a config to fallback to v1. On Mon, Aug 26, 2019 at 7:05 AM Jungtaek Lim wrote: > Thanks! The patch is here:

Re: [VOTE] Release Apache Spark 2.4.2

2019-04-22 Thread Shixiong(Ryan) Zhu
+1 I have tested it and looks good! Best Regards, Ryan On Sun, Apr 21, 2019 at 8:49 PM Wenchen Fan wrote: > Yea these should be mentioned in the 2.4.1 release notes. > > It seems we only have one ticket that is labeled as "release-notes" for > 2.4.2:

Re: Scala type checking thread-safety issue, and global locks to resolve it

2019-03-15 Thread Shixiong(Ryan) Zhu
Forgot to link the ticket that removed the global ScalaReflectionLock: https://issues.apache.org/jira/browse/SPARK-19810 Best Regards, Ryan On Fri, Mar 15, 2019 at 10:40 AM Shixiong(Ryan) Zhu wrote: > Hey Sean, > > Sounds good to me. At least, it's not worse than any versions prior t

Re: Scala type checking thread-safety issue, and global locks to resolve it

2019-03-15 Thread Shixiong(Ryan) Zhu
Hey Sean, Sounds good to me. At least, it's not worse than any versions prior to 2.3.0 which has a global ScalaReflectionLock. In addition, if someone hits a performance regression caused by this, they probably are creating too many Encoders. Reusing Encoders is a better solution for this case.

Re: [VOTE] SPARK 2.4.0 (RC2)

2018-10-04 Thread Shixiong(Ryan) Zhu
-1. Found an issue in a new 2.4 Java API: https://issues.apache.org/jira/browse/SPARK-25644 We should fix it in 2.4.0 to avoid future breaking changes. Best Regards, Ryan On Mon, Oct 1, 2018 at 7:22 PM Michael Heuer wrote: > FYI I’ve open two new issues against 2.4.0 rc2 > >

Re: Support SqlStreaming in spark

2018-06-27 Thread Shixiong(Ryan) Zhu
Structured Streaming supports standard SQL as the batch queries, so the users can switch their queries between batch and streaming easily. Could you clarify what problems SqlStreaming solves and what are the benefits of the new syntax? Best Regards, Ryan On Thu, Jun 14, 2018 at 7:06 PM, JackyLee

Re: [VOTE] Spark 2.3.0 (RC4)

2018-02-21 Thread Shixiong(Ryan) Zhu
@apache.org> >>>>>>> wrote: >>>>>>> > Sure, please feel free to backport. >>>>>>> > >>>>>>> > On 20 February 2018 at 18:02, Marcelo Vanzin <van...@cloudera.com> >>>>>>> wrote: >>>>>

Re: [VOTE] Spark 2.3.0 (RC4)

2018-02-20 Thread Shixiong(Ryan) Zhu
I'm -1 because of the UI regression https://issues.apache.org/jira /browse/SPARK-23470 : the All Jobs page may be too slow and cause "read timeout" when there are lots of jobs and stages. This is one of the most important pages because when it's broken, it's pretty hard to use Spark Web UI. On

Re: [VOTE] Spark 2.3.0 (RC2)

2018-01-25 Thread Shixiong(Ryan) Zhu
+ Jose On Thu, Jan 25, 2018 at 2:18 PM, Dongjoon Hyun wrote: > SPARK-23221 is one of the reasons for Kafka-test-suite deadlock issue. > > For the hang issues, it seems not to be marked as a failure correctly in > Apache Spark Jenkins history. > > > On Thu, Jan 25, 2018

Re: Build timed out for `branch-2.3 (hadoop-2.7)`

2018-01-12 Thread Shixiong(Ryan) Zhu
FYI, we reverted a commit in https://github.com/apache/spark/commit/55dbfbca37ce4c05f83180777ba3d4fe2d96a02e to fix the issue. On Fri, Jan 12, 2018 at 11:45 AM, Xin Lu wrote: > seems like someone should investigate what caused the build time to go up > an hour and if it's

Re: [SQL] Why no numOutputRows metric for LocalTableScanExec in webUI?

2017-11-16 Thread Shixiong(Ryan) Zhu
SQL metrics are collected using SparkListener. If there are no tasks, org.apache.spark.sql.execution.ui.SQLListener cannot collect any metrics. On Thu, Nov 16, 2017 at 1:53 AM, Jacek Laskowski wrote: > Hi, > > I seem to have figured out why the metric is not in the web UI for

Re: [Vote] SPIP: Continuous Processing Mode for Structured Streaming

2017-11-07 Thread Shixiong(Ryan) Zhu
+1 On Tue, Nov 7, 2017 at 1:34 PM, Joseph Bradley wrote: > +1 > > On Mon, Nov 6, 2017 at 5:11 PM, Michael Armbrust > wrote: > >> +1 >> >> On Sat, Nov 4, 2017 at 11:02 AM, Xiao Li wrote: >> >>> +1 >>> >>> 2017-11-04 11:00

Re: [SS] Why does StreamingQueryManager.notifyQueryTermination use id and runId (not just id)?

2017-10-27 Thread Shixiong(Ryan) Zhu
stateStoreCoordinator uses runId to deal with a small chance that Spark cannot turn a bad task down. Please see https://github.com/apache/spark/pull/18355 On Fri, Oct 27, 2017 at 3:40 AM, Jacek Laskowski wrote: > Hi, > > I'm wondering why

Re: What is d3kbcqa49mib13.cloudfront.net ?

2017-09-15 Thread Shixiong(Ryan) Zhu
Can we just create those tables once locally using official Spark versions and commit them? Then the unit tests can just read these files and don't need to download Spark. On Thu, Sep 14, 2017 at 8:13 AM, Sean Owen wrote: > I think the download could use the Apache mirror,

Re: SQLListener concurrency bug?

2017-06-26 Thread Shixiong(Ryan) Zhu
Right now they are safe because the caller also calls synchronized when using them. This is to avoid copying objects. It's probably a bad design. If you want to refactor them, PR is welcome. On Mon, Jun 26, 2017 at 2:27 AM, Oleksandr Vayda wrote: > Hi all, > > Reading

Re: [VOTE] Apache Spark 2.2.0 (RC5)

2017-06-26 Thread Shixiong(Ryan) Zhu
Hey Assaf, You need to "v2.2.0" to "v2.2.0-rc5" in GitHub links because there is no v2.2.0 right now. On Mon, Jun 26, 2017 at 12:57 AM, assaf.mendelson wrote: > Not a show stopper, however, I was looking at the structured streaming > programming guide and under

Re: structured streaming documentation does not match behavior

2017-06-16 Thread Shixiong(Ryan) Zhu
I created https://issues.apache.org/jira/browse/SPARK-21123. PR is welcome. On Thu, Jun 15, 2017 at 10:55 AM, Shixiong(Ryan) Zhu < shixi...@databricks.com> wrote: > Good catch. These are file source options. Could you submit a PR to fix > the doc? Thanks! > > On Thu, Jun 15

Re: structured streaming documentation does not match behavior

2017-06-15 Thread Shixiong(Ryan) Zhu
Good catch. These are file source options. Could you submit a PR to fix the doc? Thanks! On Thu, Jun 15, 2017 at 10:46 AM, Mendelson, Assaf wrote: > Hi, > > I have started to play around with structured streaming and it seems the > documentation (structured streaming

Re: Can I use ChannelTrafficShapingHandler to control the network read/write speed in shuffle?

2017-06-13 Thread Shixiong(Ryan) Zhu
I took a look at ChannelTrafficShapingHandler. Looks like it's because it doesn't support FileRegion. Spark's messages use this interface. See org.apache.spark.network.protocol.MessageWithHeader. On Tue, Jun 13, 2017 at 4:17 AM, Niu Zhaojie wrote: > Hi All: > > I am trying

Re: Question about upgrading Kafka client version

2017-03-10 Thread Shixiong(Ryan) Zhu
I did some investigation yesterday and just posted my finds in the ticket. Please read my latest comment in https://issues.apache.org/ jira/browse/SPARK-18057 On Fri, Mar 10, 2017 at 11:41 AM, Cody Koeninger wrote: > There are existing tickets on the issues around kafka

Re: PSA: Java 8 unidoc build

2017-02-07 Thread Shixiong(Ryan) Zhu
@Sean, I'm using Java 8 but don't see these errors until I manually build the API docs. Hence I think dropping Java 7 support may not help. Right now we don't build docs in most of builds as building docs takes a long time (e.g., https://amplab.cs.berkeley.edu/jenkins/job/spark-master-docs/2889/

Re: Structured Streaming Source error

2017-01-31 Thread Shixiong(Ryan) Zhu
You used one Spark version to compile your codes but another newer version to run. As the Source APIs are not stable, Spark doesn't guarantee that they are binary compatibility. On Tue, Jan 31, 2017 at 1:39 PM, Sam Elamin wrote: > Hi Folks > > > I am getting a weird

Re: welcoming Burak and Holden as committers

2017-01-24 Thread Shixiong(Ryan) Zhu
Congrats Burak & Holden! On Tue, Jan 24, 2017 at 10:39 AM, Joseph Bradley wrote: > Congratulations Burak & Holden! > > On Tue, Jan 24, 2017 at 10:33 AM, Dongjoon Hyun > wrote: > >> Great! Congratulations, Burak and Holden. >> >> Bests, >> Dongjoon.

Re: [Streaming] ConcurrentModificationExceptions when Windowing

2017-01-11 Thread Shixiong(Ryan) Zhu
Could you post your codes, please? On Wed, Jan 11, 2017 at 3:53 PM, Kalvin Chau <kalvinnc...@gmail.com> wrote: > "spark.speculation" is not set, so it would be whatever the default is. > > > On Wed, Jan 11, 2017 at 3:43 PM Shixiong(Ryan) Zhu < > shixi...@dat

Re: [Streaming] ConcurrentModificationExceptions when Windowing

2017-01-11 Thread Shixiong(Ryan) Zhu
be documented anywhere. > > Does the Kafka 0.10 require the number of cores on an executor be set to > 1? I didn't see that documented anywhere either. > > On Wed, Jan 11, 2017 at 3:27 PM Shixiong(Ryan) Zhu < > shixi...@databricks.com> wrote: > >> Do you change

Re: [Streaming] ConcurrentModificationExceptions when Windowing

2017-01-11 Thread Shixiong(Ryan) Zhu
of the worker threads. > > On Wed, Jan 11, 2017 at 2:53 PM Shixiong(Ryan) Zhu < > shixi...@databricks.com> wrote: > > I think you may reuse the kafka DStream (the DStream returned by > createDirectStream). If you need to read from the same Kafka source, you > need to create

Re: [Streaming] ConcurrentModificationExceptions when Windowing

2017-01-11 Thread Shixiong(Ryan) Zhu
I think you may reuse the kafka DStream (the DStream returned by createDirectStream). If you need to read from the same Kafka source, you need to create another DStream. On Wed, Jan 11, 2017 at 2:38 PM, Kalvin Chau wrote: > Hi, > > We've been running into

Re: Spark structured steaming from kafka - last message processed again after resume from checkpoint

2016-12-25 Thread Shixiong(Ryan) Zhu
Hi Niek, That's expected. Just answered on stackoverflow. On Sun, Dec 25, 2016 at 8:07 AM, Niek wrote: > Hi, > > I described my issue in full detail on > http://stackoverflow.com/questions/41300223/spark- >

Re: Kafka Spark structured streaming latency benchmark.

2016-12-19 Thread Shixiong(Ryan) Zhu
Hey Prashant. Thanks for your codes. I did some investigation and it turned out that ContextCleaner is too slow and its "referenceQueue" keeps growing. My hunch is cleaning broadcast is very slow since it's a blocking call. On Mon, Dec 19, 2016 at 12:50 PM, Shixiong(Ryan) Z

Re: Kafka Spark structured streaming latency benchmark.

2016-12-19 Thread Shixiong(Ryan) Zhu
Hey, Prashant. Could you track the GC root of byte arrays in the heap? On Sat, Dec 17, 2016 at 10:04 PM, Prashant Sharma wrote: > Furthermore, I ran the same thing with 26 GB as the memory, which would > mean 1.3GB per thread of memory. My jmap >

Re: [VOTE] Apache Spark 2.1.0 (RC2)

2016-12-09 Thread Shixiong(Ryan) Zhu
Sean, "stress test for failOnDataLoss=false" is because Kafka consumer may be thrown NPE when a topic is deleted. I added some logic to retry on such failure, however, it may still fail when topic deletion is too frequent (the stress test). Just reopened

Re: Difference between netty and netty-all

2016-12-05 Thread Shixiong(Ryan) Zhu
No. I meant only updating master. It's not worth to update a maintenance branch unless there are critical issues. On Mon, Dec 5, 2016 at 5:39 PM, Nicholas Chammas <nicholas.cham...@gmail.com > wrote: > You mean just for branch-2.0, right? > ​ > > On Mon, Dec 5, 2016 at 8:35 PM

Re: Difference between netty and netty-all

2016-12-05 Thread Shixiong(Ryan) Zhu
Hey Nick, It should be safe to upgrade Netty to the latest 4.0.x version. Could you submit a PR, please? On Mon, Dec 5, 2016 at 11:47 AM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > That file is in Netty 4.0.29, but I believe the PR I referenced is not. > It's only in Netty 4.0.37

Re: Can I add a new method to RDD class?

2016-12-05 Thread Shixiong(Ryan) Zhu
RDD.sparkContext is public: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD@sparkContext:org.apache.spark.SparkContext On Mon, Dec 5, 2016 at 1:04 PM, Teng Long wrote: > Thank you for providing another answer, Holden. > > So I did what

Re: [SparkStreaming] 1 SQL tab for each SparkStreaming batch in SparkUI

2016-11-22 Thread Shixiong(Ryan) Zhu
If you create a HiveContext before starting StreamingContext, then `SQLContext.getOrCreate` in foreachRDD will return the HiveContext you created. You can just call asInstanceOf[HiveContext] to convert it to HiveContext. On Tue, Nov 22, 2016 at 8:25 AM, Dirceu Semighini Filho <

Re: Running lint-java during PR builds?

2016-11-15 Thread Shixiong(Ryan) Zhu
I remember it's because you need to run `mvn install` before running lint-java if the maven cache is empty, and `mvn install` is pretty heavy. On Tue, Nov 15, 2016 at 1:21 PM, Marcelo Vanzin wrote: > Hey all, > > Is there a reason why lint-java is not run during PR builds?

Re: [VOTE] Release Apache Spark 2.0.2 (RC3)

2016-11-08 Thread Shixiong(Ryan) Zhu
+1 On Tue, Nov 8, 2016 at 5:50 AM, Ricardo Almeida < ricardo.alme...@actnowib.com> wrote: > +1 (non-binding) > > over Ubuntu 16.10, Java 8 (OpenJDK 1.8.0_111) built with Hadoop 2.7.3, > YARN, Hive > > > On 8 November 2016 at 12:38, Herman van Hövell tot Westerflier < > hvanhov...@databricks.com>

Re: Spark has a compile dependency on scalatest

2016-10-28 Thread Shixiong(Ryan) Zhu
This is my test pom: 4.0.0 foo bar 1.0 org.apache.spark spark-core_2.10 2.0.1 scalatest is in the compile scope: [INFO] bar:foo:jar:1.0 [INFO] \- org.apache.spark:spark-core_2.10:jar:2.0.1:compile [INFO]+- org.apache.avro:avro-mapred:jar:hadoop2:1.7.7:compile [INFO]

Re: Spark has a compile dependency on scalatest

2016-10-28 Thread Shixiong(Ryan) Zhu
You can just exclude scalatest from Spark. On Fri, Oct 28, 2016 at 12:51 PM, Jeremy Smith wrote: > spark-core depends on spark-launcher (compile) > spark-launcher depends on spark-tags (compile) > spark-tags depends on scalatest (compile) > > To be honest I'm not all

Re: Spark has a compile dependency on scalatest

2016-10-28 Thread Shixiong(Ryan) Zhu
spark-tags is in the compile scope of spark-core... On Fri, Oct 28, 2016 at 12:27 PM, Sean Owen wrote: > It's required because the tags module uses it to define annotations for > tests. I don't see it in compile scope for anything but the tags module, > which is then in test

Re: This Exception has been really hard to trace

2016-10-10 Thread Shixiong(Ryan) Zhu
Seems the runtime Spark is different from the compiled one. You should mark the Spark components "provided". See https://issues.apache.org/jira/browse/SPARK-9219 On Sun, Oct 9, 2016 at 8:13 PM, kant kodali wrote: > > I tried SpanBy but look like there is a strange error

Re: welcoming Xiao Li as a committer

2016-10-04 Thread Shixiong(Ryan) Zhu
Congrats! On Tue, Oct 4, 2016 at 9:09 AM, Yanbo Liang wrote: > Congrats and welcome! > > On Tue, Oct 4, 2016 at 9:01 AM, Herman van Hövell tot Westerflier < > hvanhov...@databricks.com> wrote: > >> Congratulations Xiao! Very well deserved! >> >> On Mon, Oct 3, 2016 at 10:46

Re: [VOTE] Release Apache Spark 2.0.1 (RC4)

2016-09-30 Thread Shixiong(Ryan) Zhu
Hey Mark, I can reproduce the failure locally using your command. There were a lot of OutOfMemoryError in the unit test log. I increased the heap size from 3g to 4g at https://github.com/apache/spark/blob/v2.0.1-rc4/pom.xml#L2029 and it passed tests. I think the patch you mentioned increased the

Re: [VOTE] Release Apache Spark 2.0.1 (RC3)

2016-09-26 Thread Shixiong(Ryan) Zhu
+1 On Sun, Sep 25, 2016 at 10:43 PM, Pete Lee wrote: > +1 > > > On Sun, Sep 25, 2016 at 3:26 PM, Herman van Hövell tot Westerflier < > hvanhov...@databricks.com> wrote: > >> +1 (non-binding) >> >> On Sun, Sep 25, 2016 at 2:05 PM, Ricardo Almeida < >>

Re: [VOTE] Release Apache Spark 1.6.2 (RC2)

2016-06-21 Thread Shixiong(Ryan) Zhu
r call. > > Cheers, > > On Tue, Jun 21, 2016 at 6:40 PM Shixiong(Ryan) Zhu < > shixi...@databricks.com> wrote: > >> Hey Pete, >> >> I didn't backport it to 1.6 because it just affects tests in most cases. >> I'm sure we also have other places calling

Re: [VOTE] Release Apache Spark 1.6.2 (RC2)

2016-06-21 Thread Shixiong(Ryan) Zhu
Hey Pete, I didn't backport it to 1.6 because it just affects tests in most cases. I'm sure we also have other places calling blocking methods in the event loops, so similar issues are still there even after applying this patch. Hence, I don't think it's a blocker for 1.6.2. On Tue, Jun 21, 2016

Re: Welcoming Yanbo Liang as a committer

2016-06-05 Thread Shixiong(Ryan) Zhu
Congrats, Yanbo! On Sun, Jun 5, 2016 at 6:25 PM, Liwei Lin wrote: > Congratulations Yanbo! > > On Mon, Jun 6, 2016 at 7:07 AM, Bryan Cutler wrote: > >> Congratulations Yanbo! >> On Jun 5, 2016 4:03 AM, "Kousuke Saruta" >> wrote:

Re: LiveListenerBus with started and stopped flags? Why both?

2016-05-26 Thread Shixiong(Ryan) Zhu
Just to prevent from restarting LiveListenerBus. The internal Thread cannot be restarted. On Wed, May 25, 2016 at 12:59 PM, Jacek Laskowski wrote: > Hi, > > I'm wondering why LiveListenerBus has two AtomicBoolean flags [1]? > Could it not have just one, say started? Why does

Re: BUILD FAILURE due to...Unable to find configuration file at location dev/scalastyle-config.xml

2016-03-07 Thread Shixiong(Ryan) Zhu
There is a fix: https://github.com/apache/spark/pull/11567 On Mon, Mar 7, 2016 at 11:39 PM, Reynold Xin wrote: > +Sean, who was playing with this. > > > > > On Mon, Mar 7, 2016 at 11:38 PM, Jacek Laskowski wrote: > >> Hi, >> >> Got the BUILD FAILURE.

Re: PySpark, spill-related (possibly psutil) issue, throwing an exception '_fill_function() takes exactly 4 arguments (5 given)'

2016-03-06 Thread Shixiong(Ryan) Zhu
Could you rebuild the whole project? I changed the python function serialization format in https://github.com/apache/spark/pull/11535 to fix a bug. This exception looks like some place was still using the old codes. On Sun, Mar 6, 2016 at 6:24 PM, Hyukjin Kwon wrote: > Just

Re: getting a list of executors for use in getPreferredLocations

2016-03-03 Thread Shixiong(Ryan) Zhu
You can take a look at "org.apache.spark.streaming.scheduler.ReceiverTracker#getExecutors" On Thu, Mar 3, 2016 at 3:10 PM, Reynold Xin wrote: > What do you mean by consistent? Throughout the life cycle of an app, the > executors can come and go and as a result really has no

Re: Welcoming two new committers

2016-02-08 Thread Shixiong(Ryan) Zhu
Congrats!!! Herman and Wenchen!!! On Mon, Feb 8, 2016 at 10:44 AM, Luciano Resende wrote: > > > On Mon, Feb 8, 2016 at 9:15 AM, Matei Zaharia > wrote: > >> Hi all, >> >> The PMC has recently added two new Spark committers -- Herman van Hovell >>

Re: Data not getting printed in Spark Streaming with print().

2016-01-28 Thread Shixiong(Ryan) Zhu
fileStream has a parameter "newFilesOnly". By default, it's true and means processing only new files and ignore existing files in the directory. So you need to ***move*** the files into the directory, otherwise it will ignore existing files. You can also set "newFilesOnly" to false. Then in the

Re: Spark streaming 1.6.0-RC4 NullPointerException using mapWithState

2015-12-29 Thread Shixiong(Ryan) Zhu
Hi Jan, could you post your codes? I could not reproduce this issue in my environment. Best Regards, Shixiong Zhu 2015-12-29 10:22 GMT-08:00 Shixiong Zhu : > Could you create a JIRA? We can continue the discussion there. Thanks! > > Best Regards, > Shixiong Zhu > >