Re: [ANNOUNCE] Apache Spark 3.1.3 released + Docker images

2022-02-22 Thread angers zhu
Hi,  seems

   - [SPARK-35391] :
   Memory leak in ExecutorAllocationListener breaks dynamic allocation under
   high load

Links to wrong jira ticket?

Mich Talebzadeh  于2022年2月22日周二 15:49写道:

> Well, that is pretty easy to do.
>
> However, a quick fix for now could be to retag the image created. It is a
> small volume which can be done manually for now. For example, I just
> downloaded v3.1.3
>
>
> docker image ls
>
> REPOSITORY TAG
> IMAGE ID   CREATEDSIZE
>
> apache/spark   v3.1.3
>31ed15daa2bf   12 hours ago   531MB
>
> Retag it with
>
>
> docker tag 31ed15daa2bf
> apache/spark/tags/spark-3.1.3-scala_2.12-8-jre-slim-buster
>
> docker image ls
>
> REPOSITORY   TAG
>   IMAGE ID   CREATEDSIZE
>
> apache/spark/tags/spark-3.1.3-scala_2.12-8-jre-slim-buster   latest
>  31ed15daa2bf   12 hours ago   531MB
>
> Then push it with (example)
>
> docker push apache/spark/tags/spark-3.1.3-scala_2.12-8-jre-slim-buster
>
>
> HTH
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 21 Feb 2022 at 23:51, Holden Karau  wrote:
>
>> Yeah I think we should still adopt that naming convention, however no one
>> has taken the time submit write a script to do it yet so until we get that
>> script merged I think we'll just have one build. I can try and do that for
>> the next release but it would be a great 2nd issue for someone getting more
>> familiar with the release tooling.
>>
>> On Mon, Feb 21, 2022 at 2:18 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Ok thanks for the correction.
>>>
>>> The docker pull line shows as follows:
>>>
>>> docker pull apache/spark:v3.2.1
>>>
>>>
>>> So this only tells me the version of Spark 3.2.1
>>>
>>>
>>> I thought we discussed deciding on the docker naming conventions in
>>> detail, and broadly agreed on what needs to be in the naming convention.
>>> For example, in this thread:
>>>
>>>
>>> Time to start publishing Spark Docker Images? -
>>> mich.talebza...@gmail.com - Gmail (google.com)
>>> 
>>>  dated
>>> 22nd July 2021
>>>
>>>
>>> Referring to that, I think the broad agreement was that the docker image
>>> name should be of the form:
>>>
>>>
>>> The name of the file provides:
>>>
>>>- Built for spark or spark-py (PySpark) spark-r
>>>- Spark version: 3.1.1, 3.1.2, 3.2.1 etc.
>>>- Scala version; 2.1.2
>>>- The OS version based on JAVA: 8-jre-slim-buster,
>>>11-jre-slim-buster meaning JAVA 8 and JAVA 11 respectively
>>>
>>> I believe it is a good thing and we ought to adopt that convention. For
>>> example:
>>>
>>>
>>> spark-py-3.2.1-scala_2.12-11-jre-slim-buster
>>>
>>>
>>> HTH
>>>
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Mon, 21 Feb 2022 at 21:58, Holden Karau  wrote:
>>>
 My bad, the correct link is:

 https://hub.docker.com/r/apache/spark/tags

 On Mon, Feb 21, 2022 at 1:17 PM Mich Talebzadeh <
 mich.talebza...@gmail.com> wrote:

> well that docker link is not found! may be permission issue
>
> [image: image.png]
>
>
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for
> any loss, damage or destruction of data or any other property which may
> arise from relying on this email's technical content is explicitly
> disclaimed. The author will in no case be liable for any monetary damages
> arising from such loss, damage or destruction.
>
>
>
>
> On Mon, 21 Feb 2022 at 21:09, Holden Karau 
> wrote:
>
>> We are happy 

Re: [Apache Spark Jenkins] build system shutting down Dec 23th, 2021

2021-12-27 Thread angers zhu
Thanks for your great work! Shane.

Best Regards
Angerszh

Yikun Jiang  于2021年12月28日周二 10:29写道:

> Thanks for your works! Shane!
>
> Regards,
> Yikun
>
>
> shane knapp ☠  于2021年12月28日周二 06:48写道:
>
>> # sysctl stop jenkins
>> #
>>
>> 
>>
>> goodbye jenkins!  
>>
>> On Mon, Dec 6, 2021 at 12:02 PM shane knapp ☠ 
>> wrote:
>>
>>> hey everyone!
>>>
>>> after a marathon run of nearly a decade, we're finally going to be
>>> shutting down {amp|rise}lab jenkins at the end of this month...
>>>
>>> the earliest snapshot i could find is from 2013 with builds for spark
>>> 0.7:
>>>
>>> https://web.archive.org/web/20130426155726/https://amplab.cs.berkeley.edu/jenkins/
>>>
>>> it's been a hell of a run, and i'm gonna miss randomly tweaking the
>>> build system, but technology has moved on and running a dedicated set of
>>> servers for just one open source project is just too expensive for us here
>>> at uc berkeley.
>>>
>>> if there's interest, i'll fire up a zoom session and all y'alls can
>>> watch me type the final command:
>>>
>>> systemctl stop jenkins
>>>
>>> feeling bittersweet,
>>>
>>> shane
>>> --
>>> Shane Knapp
>>> Computer Guy / Voice of Reason
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>
>>
>> --
>> Shane Knapp
>> Computer Guy / Voice of Reason
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>


Hadoop profile change to hadoop-2 and hadoop-3 since Spark 3.3

2021-12-08 Thread angers zhu
Hi all,

Since Spark 3.2, we have supported Hadoop 3.3.1 now, but its profile name
is *hadoop-3.2* (and *hadoop-2.7*) that is not correct.
So we made a change in https://github.com/apache/spark/pull/34715
Starting from Spark 3.3, we use hadoop profile *hadoop-2* and *hadoop-3 *,
and default hadoop profile is hadoop-3.
Profile changes

*hadoop-2.7* changed to *hadoop-2*
*hadoop-3.2* changed to *hadoop-3*
Release tar file

Spark-3.3.0 with profile hadoop-3: *spark-3.3.0-bin-hadoop3.tgz*
Spark-3.3.0 with profile hadoop-2: *spark-3.3.0-bin-hadoop2.tgz*

For Spark 3.2.0, the release tar file was, for example,
*spark-3.2.0-bin-hadoop3.2.tgz*.
Pip install option changes

For PySpark with/without a specific Hadoop version, you can install it by
using PYSPARK_HADOOP_VERSION environment variables as below (Hadoop 3):

PYSPARK_HADOOP_VERSION=3 pip install pyspark

For Hadoop 2:

PYSPARK_HADOOP_VERSION=2 pip install pyspark

Supported values in PYSPARK_HADOOP_VERSION are now:

   - without: Spark pre-built with user-provided Apache Hadoop
   - 2: Spark pre-built for Apache Hadoop 2.
   - 3: Spark pre-built for Apache Hadoop 3.3 and later (default)

Building Spark and Specifying the Hadoop Version


You can specify the exact version of Hadoop to compile against through the
hadoop.version property.
Example:

./build/mvn -Pyarn -Dhadoop.version=3.3.0 -DskipTests clean package

or you can specify *hadoop-3* profile

./build/mvn -Pyarn -Phadoop-3 -Dhadoop.version=3.3.0 -DskipTests clean package

If you want to build with Hadoop 2.x, enable *hadoop-2* profile:

./build/mvn -Phadoop-2 -Pyarn -Dhadoop.version=2.8.5 -DskipTests clean package

Notes

In the current master, it will use the default Hadoop 3 if you continue to
use -Phadoop-2.7 and -Phadoop-3.2 to build Spark
because Maven or SBT will just warn and ignore these non-existent profiles.
Please change profiles to -Phadoop-2 or -Phadoop-3.


Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source V2

2021-10-26 Thread angers zhu
+1 on this,

Wenchen Fan  于2021年10月26日周二 下午10:29写道:

> +1 to this SPIP and nice writeup of the design doc!
>
> Can we open comment permission in the doc so that we can discuss details
> there?
>
> On Tue, Oct 26, 2021 at 8:29 PM Hyukjin Kwon  wrote:
>
>> Seems making sense to me.
>>
>> Would be great to have some feedback from people such as @Wenchen Fan
>>  @Cheng Su  @angers zhu
>> .
>>
>>
>> On Tue, 26 Oct 2021 at 17:25, Dongjoon Hyun 
>> wrote:
>>
>>> +1 for this SPIP.
>>>
>>> On Sun, Oct 24, 2021 at 9:59 AM huaxin gao 
>>> wrote:
>>>
>>>> +1. Thanks for lifting the current restrictions on bucket join and
>>>> making this more generalized.
>>>>
>>>> On Sun, Oct 24, 2021 at 9:33 AM Ryan Blue  wrote:
>>>>
>>>>> +1 from me as well. Thanks Chao for doing so much to get it to this
>>>>> point!
>>>>>
>>>>> On Sat, Oct 23, 2021 at 11:29 PM DB Tsai  wrote:
>>>>>
>>>>>> +1 on this SPIP.
>>>>>>
>>>>>> This is a more generalized version of bucketed tables and bucketed
>>>>>> joins which can eliminate very expensive data shuffles when joins, and
>>>>>> many users in the Apache Spark community have wanted this feature for
>>>>>> a long time!
>>>>>>
>>>>>> Thank you, Ryan and Chao, for working on this, and I look forward to
>>>>>> it as a new feature in Spark 3.3
>>>>>>
>>>>>> DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1
>>>>>>
>>>>>> On Fri, Oct 22, 2021 at 12:18 PM Chao Sun  wrote:
>>>>>> >
>>>>>> > Hi,
>>>>>> >
>>>>>> > Ryan and I drafted a design doc to support a new type of join:
>>>>>> storage partitioned join which covers bucket join support for 
>>>>>> DataSourceV2
>>>>>> but is more general. The goal is to let Spark leverage distribution
>>>>>> properties reported by data sources and eliminate shuffle whenever 
>>>>>> possible.
>>>>>> >
>>>>>> > Design doc:
>>>>>> https://docs.google.com/document/d/1foTkDSM91VxKgkEcBMsuAvEjNybjja-uHk-r3vtXWFE
>>>>>> (includes a POC link at the end)
>>>>>> >
>>>>>> > We'd like to start a discussion on the doc and any feedback is
>>>>>> welcome!
>>>>>> >
>>>>>> > Thanks,
>>>>>> > Chao
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>>
>>>>


Re: [VOTE] Release Spark 3.2.0 (RC7)

2021-10-09 Thread angers zhu
+1 (non-binding)

Cheng Pan  于2021年10月9日周六 下午2:06写道:

> +1 (non-binding)
>
> Integration test passed[1] with my project[2].
>
> [1]
> https://github.com/housepower/spark-clickhouse-connector/runs/3834335017
> [2] https://github.com/housepower/spark-clickhouse-connector
>
> Thanks,
> Cheng Pan
>
>
> On Sat, Oct 9, 2021 at 2:01 PM Ye Zhou  wrote:
>
>> +1 (non-binding).
>>
>> Run Maven build, tested within our YARN cluster, in client or cluster
>> mode, with push based shuffle enabled/disalbled, and shuffling a large
>> amount of data. Applications ran successfully with expected shuffle
>> behavior.
>>
>> On Fri, Oct 8, 2021 at 10:06 PM sarutak  wrote:
>>
>>> +1
>>>
>>> I think no critical issue left.
>>> Thank you Gengliang.
>>>
>>> Kousuke
>>>
>>> > +1
>>> >
>>> > Looks good.
>>> >
>>> > Liang-Chi
>>> >
>>> > On 2021/10/08 16:16:12, Kent Yao  wrote:
>>> >> +1 (non-binding) BR
>>> >> 
>>> >> 
>>> >> 
>>> >> 
>>> >> 
>>> >> 
>>> >> font{
>>> >> line-height: 1.6;
>>> >> }
>>> >> 
>>> >> 
>>> >> 
>>> >> font{
>>> >> line-height: 1.6;
>>> >> }
>>> >> 
>>> >> 
>>> >> 
>>> >> font{
>>> >> line-height: 1.6;
>>> >> }
>>> >> 
>>> >> 
>>> >> 
>>> >> font{
>>> >> line-height: 1.6;
>>> >> }
>>> >> 
>>> >>
>>> >> 
>>> >> font{
>>> >> line-height: 1.6;
>>> >> }
>>> >> 
>>> >> 
>>> >> 
>>> >> font{
>>> >> line-height: 1.6;
>>> >> }
>>> >> 
>>> >>
>>> >> 
>>> >> font{
>>> >> line-height: 1.6;
>>> >> }
>>> >> 
>>> >>
>>> >> 
>>> >> font{
>>> >> line-height: 1.6;
>>> >> }
>>> >> 
>>> >>
>>> >> 
>>> >> font{
>>> >> line-height: 1.6;
>>> >> }
>>> >> 
>>> >> 
>>> >> 
>>> >> font{
>>> >> line-height: 1.6;
>>> >> }
>>> >> 
>>> >> >> >> size="3">Kent Yao >> >> style="color: rgb(82, 82, 82); font-family: 宋体-简; font-size:
>>> >> x-small;">@ Data Science Center, Hangzhou Research Institute, NetEase
>>> >> Corp.>> >> style="font-size: 13px;">a s>> >> color="#525252" style="font-size: 13px;">park >> style="orphans:
>>> >> 2; widows: 2;" class=" classDarkfont
>>> >> classDarkfont">enthusiast>> >> style="color: rgb(0, 0, 0); font-family: Helvetica; font-size:
>>> >> 13px;">>> > class="mr-2 flex-self-stretch" style="box-sizing: border-box;
>>> > align-self: stretch !important; margin-right: 8px !important;">>> > face="宋体-简" color="#525252" class=" classDarkfont" style="box-sizing:
>>> > border-box; font-size: 13px;">>> > class="" href="https://github.com/yaooqinn/kyuubi; style="box-sizing:
>>> > border-box;">kyuubiis a unified multi-tenant JDBC interface for large-scale data processing and
>>> > analytics, built on top of >> > href="http://spark.apache.org/; rel="nofollow" style="font-we
>>> >  ight: normal; color: rgb(49, 53, 59); font-family: 宋体-简; font-size:
>>> > 13px; box-sizing: border-box; font-variant-ligatures: normal;">Apache
>>> > Spark.>> > class=" d-flex flex-wrap flex-items-center break-word f3 text-normal
>>> > classDarkfont" style="box-sizing: border-box; margin: 0px;
>>> > font-variant-ligatures: normal; orphans: 2; widows: 2;
>>> > text-decoration-thickness: initial; flex-wrap: wrap !important;
>>> > align-items: center !important; word-break: break-word !important;
>>> > overflow-wrap: break-word !important; display: flex !important;">>> > style="color: rgb(49, 53, 59); font-family: Helvetica; font-size:
>>> > 14px;">>> > style="b
>>> >  ox-sizing: border-box; align-self: stretch !important; margin-right:
>>> > 8px !important;">>> > href="https://github.com/yaooqinn/spark-authorizer; style="box-sizing:
>>> > border-box; outline-width: 0px;">spark-authorizer>> > style="font-weight: normal;">A Spark SQL extension which provides SQL
>>> > Standard Authorization for http://spark.apache.org/; rel="nofollow"
>>> > style="font-weight: normal; color: rgb(49, 53, 59); font-family: 宋体-简;
>>> > font-size: 13px; box-sizing: border-box; font-variant-ligatures:
>>> > normal;">Apache Spark>> > style="font-weight: normal; color: rgb(82, 82, 82); font-family: 宋体-简;
>>> > font-size: 13px; caret-color: rgb(82, 82, 82); font-variant-ligatures:
>>> > normal; text-decoration-thickness: initial;">.>> > style="color: rgb(49, 53, 59); font-family: Helvetica; font-size:
>>> > 14px;">>> > class="" href="https://github.com/yaooqinn/spark-postgres;
>>> > style="font-size: 13px; font-family: 宋体-简; box-sizing: border-box;
>>> > outline-width: 0px;">spark-postgres A library for reading data from and
>>> > transferring data to Postgres / Greenplum with Spark SQL and
>>> > DataFrames, 10~100x faster.>> > class=" classDarkfont" style="font-variant-ligatures: normal;
>>> > text-decoration-thickness: initial; display: inline
>>> > !important;">>> >  emprop="name" class="mr-2 flex-self-stretch" 

Discuss about current yarn client mode problem

2021-08-31 Thread angers zhu
Hi devs,

In current yarn-client mode, we have several problem,

   1. When AM lost connection with driver, it will just finish
   application with final status of SUCCESS, then
   YarnClientSchedulerBackend.MonitorThread will got application status with
   SUCCESS final status and then call sc.stop().  SparkContext stoped and
   program exit with a 0 exit code. For scheduler system, always use the exit
   code to judge if the application is successful. This make scheduler system
   and user don't know the job is failed.
   2. In YarnClientSchedulerBackend.MonitorThread, even it got a yarn
   report with FAILED or KILLED final status. It just call sc.stop(), make
   program exit with code 0. When some user killed a wrong application, the
   real owner of the killed application still got a wrong SUCCESS status of it
   's job.

There are some history discuss on these two problem SPARK-3627
 SPARK-1516
. But that was the result
of a very early discussion. Now spark is widely used by various companies,
and a lot of spark-related job scheduling systems have been developed
accordingly. These problem make user confused and hard to manage their
jobs.

Hope to get more feedback from the developers, or is there any good way to
avoid these problems.

Below are some of my related pr about these two problems:
https://github.com/apache/spark/pull/33780
https://github.com/apache/spark/pull/33780

Best regards
Angers


Re: java.lang.ClassNotFoundException for custom hive authentication

2021-06-22 Thread angers zhu
Which version?


Jason Jun  于2021年6月22日周二 下午4:19写道:

> Hi there,
>
> I'm leveraging thriftserver to provide sql service, and using custom hive
> authentication:
> --
> 
> hive.server2.custom.authentication.class
> com.abc.ABCAuthenticationProvider
> 
> 
>
> I've got this error when logging into thrift server. class path was set
> using --jar option.
> I guess this is because my class is loaded by system class loader.
>
> Please let me know how to fix this.
> TIA
>
> -
> java.lang.RuntimeException: java.lang.ClassNotFoundException: Class
> com.abc.ABCAuthenticationProvider not found
>
> at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2595)
>
> at
> org.apache.hive.service.auth.CustomAuthenticationProviderImpl.(CustomAuthenticationProviderImpl.java:39)
>
> at
> org.apache.hive.service.auth.AuthenticationProviderFactory.getAuthenticationProvider(AuthenticationProviderFactory.java:64)
>
> at
> org.apache.hive.service.auth.PlainSaslHelper$PlainServerCallbackHandler.handle(PlainSaslHelper.java:105)
>
> at
> org.apache.hive.service.auth.PlainSaslServer.evaluateResponse(PlainSaslServer.java:102)
>
> at
> org.apache.thrift.transport.TSaslTransport$SaslParticipant.evaluateChallengeOrResponse(TSaslTransport.java:537)
>
> at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:283)
>
> at
> org.apache.thrift.transport.TSaslServerTransport.open(TSaslServerTransport.java:43)
>
> at
> org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:223)
>
> at
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:293)
>
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>
> at java.lang.Thread.run(Thread.java:748)
>
> Caused by: java.lang.ClassNotFoundException: Class
> com.abc.ABCAuthenticationProvider not found
>
> at
> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2499)
>
> at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2593)
>
> ... 12 more
>
>
>


Re: [Spark Core]: Adding support for size based partition coalescing

2021-03-31 Thread angers zhu
Hi all,

you mean something like this
https://github.com/apache/spark/pull/27248/files?
If you need I can raise a pr  add a SizeBasedCoaleaser

mhawes  于2021年3月30日周二 下午9:06写道:

> Hi Pol, I had considered repartitioning but the main issue for me there is
> that it will trigger a shuffle and could significantly slow down the
> query/application as a result. Thanks for contributing that as an
> alternative suggestion though :)
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Welcoming six new Apache Spark committers

2021-03-26 Thread angers zhu
Congratulations

Prashant Sharma  于2021年3月27日周六 上午8:35写道:

> Congratulations  all!!
>
> On Sat, Mar 27, 2021, 5:10 AM huaxin gao  wrote:
>
>> Congratulations to you all!!
>>
>> On Fri, Mar 26, 2021 at 4:22 PM Yuming Wang  wrote:
>>
>>> Congrats!
>>>
>>> On Sat, Mar 27, 2021 at 7:13 AM Takeshi Yamamuro 
>>> wrote:
>>>
 Congrats, all~

 On Sat, Mar 27, 2021 at 7:46 AM Jungtaek Lim <
 kabhwan.opensou...@gmail.com> wrote:

> Congrats all!
>
> 2021년 3월 27일 (토) 오전 6:56, Liang-Chi Hsieh 님이 작성:
>
>> Congrats! Welcome!
>>
>>
>> Matei Zaharia wrote
>> > Hi all,
>> >
>> > The Spark PMC recently voted to add several new committers. Please
>> join me
>> > in welcoming them to their new role! Our new committers are:
>> >
>> > - Maciej Szymkiewicz (contributor to PySpark)
>> > - Max Gekk (contributor to Spark SQL)
>> > - Kent Yao (contributor to Spark SQL)
>> > - Attila Zsolt Piros (contributor to decommissioning and Spark on
>> > Kubernetes)
>> > - Yi Wu (contributor to Spark Core and SQL)
>> > - Gabor Somogyi (contributor to Streaming and security)
>> >
>> > All six of them contributed to Spark 3.1 and we’re very excited to
>> have
>> > them join as committers.
>> >
>> > Matei and the Spark PMC
>> >
>> -
>> > To unsubscribe e-mail:
>>
>> > dev-unsubscribe@.apache
>>
>>
>>
>>
>>
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

 --
 ---
 Takeshi Yamamuro

>>>


Re: [ANNOUNCE] Announcing Apache Spark 3.1.1

2021-03-03 Thread angers zhu
Great work, Hyukjin !

Bests,
Angers

Wenchen Fan  于2021年3月3日周三 下午5:02写道:

> Great work and congrats!
>
> On Wed, Mar 3, 2021 at 3:51 PM Kent Yao  wrote:
>
>> Congrats, all!
>>
>> Bests,
>> *Kent Yao *
>> @ Data Science Center, Hangzhou Research Institute, NetEase Corp.
>> *a spark enthusiast*
>> *kyuubi is a
>> unified multi-tenant JDBC interface for large-scale data processing and
>> analytics, built on top of Apache Spark .*
>> *spark-authorizer A Spark
>> SQL extension which provides SQL Standard Authorization for **Apache
>> Spark .*
>> *spark-postgres  A library
>> for reading data from and transferring data to Postgres / Greenplum with
>> Spark SQL and DataFrames, 10~100x faster.*
>> *spark-func-extras A
>> library that brings excellent and useful functions from various modern
>> database management systems to Apache Spark .*
>>
>>
>>
>> On 03/3/2021 15:11,Takeshi Yamamuro
>>  wrote:
>>
>> Great work and Congrats, all!
>>
>> Bests,
>> Takeshi
>>
>> On Wed, Mar 3, 2021 at 2:18 PM Mridul Muralidharan 
>> wrote:
>>
>>>
>>> Thanks Hyukjin and congratulations everyone on the release !
>>>
>>> Regards,
>>> Mridul
>>>
>>> On Tue, Mar 2, 2021 at 8:54 PM Yuming Wang  wrote:
>>>
 Great work, Hyukjin!

 On Wed, Mar 3, 2021 at 9:50 AM Hyukjin Kwon 
 wrote:

> We are excited to announce Spark 3.1.1 today.
>
> Apache Spark 3.1.1 is the second release of the 3.x line. This release
> adds
> Python type annotations and Python dependency management support as
> part of Project Zen.
> Other major updates include improved ANSI SQL compliance support,
> history server support
> in structured streaming, the general availability (GA) of Kubernetes
> and node decommissioning
> in Kubernetes and Standalone. In addition, this release continues to
> focus on usability, stability,
> and polish while resolving around 1500 tickets.
>
> We'd like to thank our contributors and users for their contributions
> and early feedback to
> this release. This release would not have been possible without you.
>
> To download Spark 3.1.1, head over to the download page:
> http://spark.apache.org/downloads.html
>
> To view the release notes:
> https://spark.apache.org/releases/spark-release-3-1-1.html
>
>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>>


Chnage Restful API's default quanlites same as WebUI

2021-01-10 Thread angers . zhu







Hi devs,These days I found that about taskSummary, the Restful API(/applications/[app-id]/stages/[stage-id]/[stage-attempt-id]/taskSummary)'s default quantiles is `0.05,0.25,0.5,0.75,0.95`But in Spark Web UI’s stage page, the quantiles is `0.0,0.25,0.5,0.75,1.0`. And the restful api’s default value is not shown in Spark website’s  [monitoring page](https://spark.apache.org/docs/3.0.0/monitoring.html). Detail can be seen in https://github.com/apache/spark/pull/31048#issuecomment-756773764




I have create a PR to keep them consistence, and hope you dev’s  visibility and comments.Best regardsAngers

  










angers.zhu




angers@gmail.com








签名由
网易邮箱大师
定制

 






Re: [VOTE] Standardize Spark Exception Messages SPIP

2020-11-04 Thread angers . zhu







+1






  










angers.zhu




angers@gmail.com








签名由
网易邮箱大师
定制

 


On 11/5/2020 07:04,Mridul Muralidharan wrote: 


+1 Regards,Mridul On Wed, Nov 4, 2020 at 12:41 PM Xinyi Yu  wrote:Hi all, 

We had the discussion of SPIP: Standardize Spark Exception Messages at 
http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-SPIP-Standardize-Spark-Exception-Messages-td30341.html
 
. The SPIP document link is at 
https://docs.google.com/document/d/1XGj1o3xAFh8BA7RCn3DtwIPC6--hIFOaNUNSlpaOIZs/edit?usp=sharing
 
. We want to have the vote on this, for 72 hours.

Please vote before November 7th at noon:

[ ] +1: Accept this SPIP proposal
[ ] -1: Do not agree to standardize Spark exception messages, because ...


Thanks for your time and feedback!

--
Xinyi



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org







Re: [VOTE][SPARK-30602] SPIP: Support push-based shuffle to improve shuffle efficiency

2020-09-14 Thread angers . zhu






+1






  










angers.zhu




angers@gmail.com








签名由
网易邮箱大师
定制

 


On 09/15/2020 08:21,Xiao Li wrote: 


+1 XiaoDB Tsai  于2020年9月14日周一 下午4:09写道:+1On Mon, Sep 14, 2020 at 12:30 PM Chandni Singh  wrote:+1 ChandniOn Mon, Sep 14, 2020 at 11:41 AM Tom Graves  wrote:
+1Tom





On Sunday, September 13, 2020, 10:00:05 PM CDT, Mridul Muralidharan  wrote:



Hi,I'd like to call for a vote on SPARK-30602 - SPIP: Support push-based shuffle to improve shuffle efficiency.Please take a look at:SPIP jira: https://issues.apache.org/jira/browse/SPARK-30602SPIP doc: https://docs.google.com/document/d/1mYzKVZllA5Flw8AtoX7JUcXBOnNIDADWRbJ7GI6Y71Q/editPOC against master and  results summary : https://docs.google.com/document/d/1Q5m7YAp0HyG_TNFL4p_bjQgzzw33ik5i49Vr86UNZgg/editActive discussions on the jira and SPIP document have settled.I will leave the vote open until Friday (the 18th September 2020), 5pm CST.[ ] +1: Accept the proposal as an official SPIP[ ] +0[ ] -1: I don't think this is a good idea because ...Thanks,Mridul


-- Sincerely,DB Tsai--Web: https://www.dbtsai.comPGP Key ID: 42E5B25A8F7A82C1






Re: Welcoming some new Apache Spark committers

2020-07-15 Thread angers . zhu






Congratulations !






  










angers.zhu




angers@gmail.com








签名由
网易邮箱大师
定制

 


On 07/15/2020 14:53,Wenchen Fan wrote: 


Congrats and welcome!On Wed, Jul 15, 2020 at 2:18 PM Mridul Muralidharan  wrote:Congratulations !Regards,MridulOn Tue, Jul 14, 2020 at 12:37 PM Matei Zaharia  wrote:Hi all,

The Spark PMC recently voted to add several new committers. Please join me in welcoming them to their new roles! The new committers are:

- Huaxin Gao
- Jungtaek Lim
- Dilip Biswal

All three of them contributed to Spark 3.0 and we’re excited to have them join the project.

Matei and the Spark PMC
-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org








Re: Spark Thrift Server java vm problem need help

2020-03-23 Thread angers . zhu






If -Xmx is bigger then 32g, vm will not to use  UseCompressedOops as default, We can see a case, If we set spark.driver.memory is 64g, set -XX:+UseCompressedOops in spark.executor.extralJavaOptions, and set SPARK_DAEMON_MEMORY = 6g, Use current code , vm will got command like with  -Xmx6g and -XX:+UseCompressedOops , then vm will be -XX:+UseCompressedOops  and use Oops compressedBut since we set spark.driver.memory=64g, our jvm’s max heap size will be 64g,  but we will use compressed Oops ,  Wouldn't that be a problem?






  










angers.zhu




angers@gmail.com








签名由
网易邮箱大师
定制

 


On 03/23/2020 22:32,Sean Owen wrote: 


I'm still not sure if you are trying to enable it or disable it, and what the issue is?There is no logic in Spark that sets or disables this flag that I can see.On Mon, Mar 23, 2020 at 9:27 AM angers.zhu  wrote:







Hi Sean,
Yea,  I set  -XX:+UseCompressedOops in driver(you can see in command line) and these days, we have more user and I set spark.driver.memory to 64g, in Non-default VM flags it should be +XX:-UseCompressdOops , but it’s still +XX:-UseCompressdOops. I have find the reason , in SparkSubmitCommandBuilder.buildSparkSubmitCommand, have logic like belowif (isClientMode) {  // Figuring out where the memory value come from is a little tricky due to precedence.  // Precedence is observed in the following order:  // - explicit configuration (setConf()), which also covers --driver-memory cli argument.  // - properties file.  // - SPARK_DRIVER_MEMORY env variable  // - SPARK_MEM env variable  // - default value (1g)  // Take Thrift Server as daemon  String tsMemory =isThriftServer(mainClass) ? System.getenv("SPARK_DAEMON_MEMORY") : null;  String memory = firstNonEmpty(tsMemory, config.get(SparkLauncher.DRIVER_MEMORY),System.getenv("SPARK_DRIVER_MEMORY"), System.getenv("SPARK_MEM"), DEFAULT_MEM);  cmd.add("-Xmx" + memory);  addOptionString(cmd, driverDefaultJavaOptions);  addOptionString(cmd, driverExtraJavaOptions);  mergeEnvPathList(env, getLibPathEnvName(),config.get(SparkLauncher.DRIVER_EXTRA_LIBRARY_PATH));}For Spark Thrift Server, use SPARK_DAEMON_MEMORY First, it’s really reasonable, I am confused, if spark.driver.memory is bigger then 32gAnd   SPARK_DAEMON_MEMORY is less then 32g,  UseCompressedOops will also be open, it’s right?If we need to modify this logic for case >32g.By the way, I meet problem like https://issues.apache.org/jira/browse/SPARK-27097, caused by these strange case.Thanks






  










angers.zhu




angers@gmail.com








签名由
网易邮箱大师
定制

 


On 03/23/2020 21:43,Sean Owen wrote: 


I don't think Spark sets UseCompressedOops in any defaults; are you setting it?It can't be used with heaps >= 32GB. It doesn't seem to cause an error if you set it with large heaps, just a warning.What's the problem?On Mon, Mar 23, 2020 at 6:21 AM angers.zhu  wrote:








Hi developers, These day I meet a strange problem and I can’t find whyWhen I start a spark thrift server with  spark.driver.memory 64g, then use jdk8/bin/jinfo pid to see vm flags got below information,In 64g vm, UseCompressedOops should be closed by default, why spark thrift server is -XX: +UseCompressedOopsNon-default VM flags: -XX:CICompilerCount=15 -XX:-CMSClassUnloadingEnabled -XX:CMSFullGCsBeforeCompaction=0 -XX:CMSInitiatingOccupancyFraction=75 -XX:+CMSParallelRemarkEnabled -XX:-ClassUnloading -XX:+DisableExplicitGC -XX:ErrorFile=null -XX:-ExplicitGCInvokesConcurrentAndUnloadsClasses -XX:InitialHeapSize=2116026368 -XX:+ManagementServer -XX:MaxDirectMemorySize=8589934592 -XX:MaxHeapSize=6442450944 -XX:MaxNewSize=2147483648 -XX:MaxTenuringThreshold=6 -XX:MinHeapDeltaBytes=196608 -XX:NewSize=705298432 -XX:OldPLABSize=16 -XX:OldSize=1410727936 -XX:+PrintGC -XX:+PrintGCDateStamps 

Re: Spark Thrift Server java vm problem need help

2020-03-23 Thread angers . zhu






Hi Sean,
Yea,  I set  -XX:+UseCompressedOops in driver(you can see in command line) and these days, we have more user and I set spark.driver.memory to 64g, in Non-default VM flags it should be +XX:-UseCompressdOops , but it’s still +XX:-UseCompressdOops. I have find the reason , in SparkSubmitCommandBuilder.buildSparkSubmitCommand, have logic like belowif (isClientMode) {  // Figuring out where the memory value come from is a little tricky due to precedence.  // Precedence is observed in the following order:  // - explicit configuration (setConf()), which also covers --driver-memory cli argument.  // - properties file.  // - SPARK_DRIVER_MEMORY env variable  // - SPARK_MEM env variable  // - default value (1g)  // Take Thrift Server as daemon  String tsMemory =isThriftServer(mainClass) ? System.getenv("SPARK_DAEMON_MEMORY") : null;  String memory = firstNonEmpty(tsMemory, config.get(SparkLauncher.DRIVER_MEMORY),System.getenv("SPARK_DRIVER_MEMORY"), System.getenv("SPARK_MEM"), DEFAULT_MEM);  cmd.add("-Xmx" + memory);  addOptionString(cmd, driverDefaultJavaOptions);  addOptionString(cmd, driverExtraJavaOptions);  mergeEnvPathList(env, getLibPathEnvName(),config.get(SparkLauncher.DRIVER_EXTRA_LIBRARY_PATH));}For Spark Thrift Server, use SPARK_DAEMON_MEMORY First, it’s really reasonable, I am confused, if spark.driver.memory is bigger then 32gAnd   SPARK_DAEMON_MEMORY is less then 32g,  UseCompressedOops will also be open, it’s right?If we need to modify this logic for case >32g.By the way, I meet problem like https://issues.apache.org/jira/browse/SPARK-27097, caused by these strange case.Thanks






  










angers.zhu




angers@gmail.com








签名由
网易邮箱大师
定制

 


On 03/23/2020 21:43,Sean Owen wrote: 


I don't think Spark sets UseCompressedOops in any defaults; are you setting it?It can't be used with heaps >= 32GB. It doesn't seem to cause an error if you set it with large heaps, just a warning.What's the problem?On Mon, Mar 23, 2020 at 6:21 AM angers.zhu  wrote:








Hi developers, These day I meet a strange problem and I can’t find whyWhen I start a spark thrift server with  spark.driver.memory 64g, then use jdk8/bin/jinfo pid to see vm flags got below information,In 64g vm, UseCompressedOops should be closed by default, why spark thrift server is -XX: +UseCompressedOopsNon-default VM flags: -XX:CICompilerCount=15 -XX:-CMSClassUnloadingEnabled -XX:CMSFullGCsBeforeCompaction=0 -XX:CMSInitiatingOccupancyFraction=75 -XX:+CMSParallelRemarkEnabled -XX:-ClassUnloading -XX:+DisableExplicitGC -XX:ErrorFile=null -XX:-ExplicitGCInvokesConcurrentAndUnloadsClasses -XX:InitialHeapSize=2116026368 -XX:+ManagementServer -XX:MaxDirectMemorySize=8589934592 -XX:MaxHeapSize=6442450944 -XX:MaxNewSize=2147483648 -XX:MaxTenuringThreshold=6 -XX:MinHeapDeltaBytes=196608 -XX:NewSize=705298432 -XX:OldPLABSize=16 -XX:OldSize=1410727936 -XX:+PrintGC -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution -XX:-TraceClassUnloading -XX:+UseCMSCompactAtFullCollection -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseCompressedClassPointers -XX:+UseCompressedOops -XX:+UseConcMarkSweepGC -XX:+UseFastUnorderedTimeStamps -XX:+UseParNewGC
Command line:  -Xmx6g -Djava.library.path=/home/hadoop/hadoop/lib/native -Djavax.security.auth.useSubjectCredsOnly=false -Dcom.sun.management.jmxremote.port=9021 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -XX:MaxPermSize=1024m -XX:PermSize=256m -XX:MaxDirectMemorySize=8192m -XX:-TraceClassUnloading -XX:+UseCompressedOops -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled -XX:+UseCMSCompactAtFullCollection -XX:CMSFullGCsBeforeCompaction=0 -XX:+CMSParallelRemarkEnabled -XX:+DisableExplicitGC -XX:+PrintTenuringDistribution -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=75 -Xnoclassgc -XX:+PrintGCDetails -XX:+PrintGCDateStamps Since I am not a professor in VM, hope for some help






  










angers.zhu



  

Spark Thrift Server java vm problem need help

2020-03-23 Thread angers . zhu







Hi developers, These day I meet a strange problem and I can’t find whyWhen I start a spark thrift server with  spark.driver.memory 64g, then use jdk8/bin/jinfo pid to see vm flags got below information,In 64g vm, UseCompressedOops should be closed by default, why spark thrift server is -XX: +UseCompressedOopsNon-default VM flags: -XX:CICompilerCount=15 -XX:-CMSClassUnloadingEnabled -XX:CMSFullGCsBeforeCompaction=0 -XX:CMSInitiatingOccupancyFraction=75 -XX:+CMSParallelRemarkEnabled -XX:-ClassUnloading -XX:+DisableExplicitGC -XX:ErrorFile=null -XX:-ExplicitGCInvokesConcurrentAndUnloadsClasses -XX:InitialHeapSize=2116026368 -XX:+ManagementServer -XX:MaxDirectMemorySize=8589934592 -XX:MaxHeapSize=6442450944 -XX:MaxNewSize=2147483648 -XX:MaxTenuringThreshold=6 -XX:MinHeapDeltaBytes=196608 -XX:NewSize=705298432 -XX:OldPLABSize=16 -XX:OldSize=1410727936 -XX:+PrintGC -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution -XX:-TraceClassUnloading -XX:+UseCMSCompactAtFullCollection -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseCompressedClassPointers -XX:+UseCompressedOops -XX:+UseConcMarkSweepGC -XX:+UseFastUnorderedTimeStamps -XX:+UseParNewGC
Command line:  -Xmx6g -Djava.library.path=/home/hadoop/hadoop/lib/native -Djavax.security.auth.useSubjectCredsOnly=false -Dcom.sun.management.jmxremote.port=9021 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -XX:MaxPermSize=1024m -XX:PermSize=256m -XX:MaxDirectMemorySize=8192m -XX:-TraceClassUnloading -XX:+UseCompressedOops -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled -XX:+UseCMSCompactAtFullCollection -XX:CMSFullGCsBeforeCompaction=0 -XX:+CMSParallelRemarkEnabled -XX:+DisableExplicitGC -XX:+PrintTenuringDistribution -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=75 -Xnoclassgc -XX:+PrintGCDetails -XX:+PrintGCDateStamps Since I am not a professor in VM, hope for some help






  










angers.zhu




angers@gmail.com








签名由
网易邮箱大师
定制

 






Re: Fw: [VOTE][SPARK-29018][SPIP]:Build spark thrift server based on protocol v11

2020-01-01 Thread angers . zhu







Hi all,Blow is our working repohttps://github.com/spark-thriftserver/spark-thriftserver




Hope for your good suggestions.

  










angers.zhu




angers@gmail.com








签名由
网易邮箱大师
定制

 


On 12/30/2019 14:38,Sandeep Katta wrote: 


+1On Mon, 30 Dec 2019 at 10:24, Gengliang  wrote:+1On Sun, Dec 29, 2019 at 8:33 PM Wenchen Fan  wrote:+1 for the new thrift server to get rid of the Hive dependencies!On Mon, Dec 23, 2019 at 7:55 PM Yuming Wang  wrote:I'm  +1 for this SPIP for these two reasons:1. The current thriftserver has some issues that are not easy to solve, such as: SPARK-28636.2. The difference between the version of ORC we are using and the built-in Hive is using is getting bigger and bigger. We can't ensure that there will be no compatibility issues in the future. If thriftserver does not depend on Hive, it will be much easier to upgrade the built-in Hive in the future.On Sat, Dec 21, 2019 at 9:28 PM angers.zhu  wrote:







Hi all, I have complete a Design doc about how to use and config this new thrift server, and some design detail about change and impersonation. Hope for your suggestions and ideas.SPIP DOC : https://docs.google.com/document/d/1ug4K5e2okF5Q2Pzi3qJiUILwwqkn0fVQaQ-Q95HEcJQ/edit#heading=h.x97c6tj78zo0Design DOC : https://docs.google.com/document/d/1UKE9QTtHqSZBq0V_vEn54PlWaWPiRAKf_JvcT0skaSo/edit#heading=h.q1ed5q1ldh14Thrift server about configurations https://docs.google.com/document/d/1uI35qJmQO4FKE6pr0h3zetZqww-uI8QsQjxaYY_qb1s/edit?usp=drive_web=110963191229426834922




Best Regards

  










angers.zhu




angers@gmail.com








签名由
网易邮箱大师
定制

 


- Forwarded Message -



From: 


 angers.zhu



 

Date: 
12/18/2019 22:29

 

To: 


 dev-ow...@spark.apache.org

 


  

Subject: 
Re: [VOTE][SPARK-29018][SPIP]:Build spark thrift server based on protocol v11













Add spark-dev group access privilege to google.

  










angers.zhu




angers@gmail.com








签名由
网易邮箱大师
定制

 


On 12/18/2019 22:02,Sandeep Katta wrote: 


I couldn't access the doc, please give permission to the spark-dev groupOn Wed, 18 Dec 2019 at 18:05, angers.zhu  wrote:











With the development of Spark and Hive,in current sql/hive-thriftserver module, we need to do a lot of work to solve code conflicts for different built-in hive versions.It's an annoying and unending work in current ways. And these issues have limited our ability and convenience to develop new features for Spark’s thrift server. We suppose to implement a new thrift server and JDBC 

Fw:Re: [VOTE][SPARK-29018][SPIP]:Build spark thrift server based on protocol v11

2019-12-21 Thread angers . zhu






Hi all, I have complete a Design doc about how to use and config this new thrift server, and some design detail about change and impersonation. Hope for your suggestions and ideas.SPIP DOC : https://docs.google.com/document/d/1ug4K5e2okF5Q2Pzi3qJiUILwwqkn0fVQaQ-Q95HEcJQ/edit#heading=h.x97c6tj78zo0Design DOC : https://docs.google.com/document/d/1UKE9QTtHqSZBq0V_vEn54PlWaWPiRAKf_JvcT0skaSo/edit#heading=h.q1ed5q1ldh14Thrift server about configurations https://docs.google.com/document/d/1uI35qJmQO4FKE6pr0h3zetZqww-uI8QsQjxaYY_qb1s/edit?usp=drive_web=110963191229426834922




Best Regards

  










angers.zhu




angers@gmail.com








签名由
网易邮箱大师
定制

 


- Forwarded Message -



From: 


 angers.zhu



 

Date: 
12/18/2019 22:29

 

To: 


 dev-ow...@spark.apache.org

 


  

Subject: 
Re: [VOTE][SPARK-29018][SPIP]:Build spark thrift server based on protocol v11













Add spark-dev group access privilege to google.

  










angers.zhu




angers@gmail.com








签名由
网易邮箱大师
定制

 


On 12/18/2019 22:02,Sandeep Katta wrote: 


I couldn't access the doc, please give permission to the spark-dev groupOn Wed, 18 Dec 2019 at 18:05, angers.zhu  wrote:











With the development of Spark and Hive,in current sql/hive-thriftserver module, we need to do a lot of work to solve code conflicts for different built-in hive versions.It's an annoying and unending work in current ways. And these issues have limited our ability and convenience to develop new features for Spark’s thrift server. We suppose to implement a new thrift server and JDBC driver based on Hive’s latest v11 TCLService.thrift thrift protocol. Finally, the new thrift server have below feature:Build new module spark-service as spark’s thrift server Don't need as much reflection and inherited code as `hive-thriftser` modulesSupport all functions current `sql/hive-thriftserver` supportUse all code maintained by spark itself, won’t depend on HiveSupport origin functions use spark’s own way, won't limited by Hive's codeSupport running without hive metastore or with hive metastoreSupport user impersonation by Multi-tenant splited hive authentication and DFS authenticationSupport session hook for with spark’s own codeAdd a new jdbc driver spark-jdbc, with spark’s own connection url  “jdbc:spark::/”Support both hive-jdbc and spark-jdbc client, then we can support most clients and BI platform



https://issues.apache.org/jira/browse/SPARK-29018 Google Doc:  https://docs.google.com/document/d/1ug4K5e2okF5Q2Pzi3qJiUILwwqkn0fVQaQ-Q95HEcJQ/edit#[ ] +1: Accept the proposal as an official SPIP[ ] +0[ ] -1: I don't think this is a good idea because ...I'll start with my +1

  










angers.zhu




angers@gmail.com


  

[VOTE][SPARK-29018][SPIP]:Build spark thrift server based on protocol v11

2019-12-18 Thread angers . zhu










With the development of Spark and Hive,in current sql/hive-thriftserver module, we need to do a lot of work to solve code conflicts for different built-in hive versions.It's an annoying and unending work in current ways. And these issues have limited our ability and convenience to develop new features for Spark’s thrift server. We suppose to implement a new thrift server and JDBC driver based on Hive’s latest v11 TCLService.thrift thrift protocol. Finally, the new thrift server have below feature:Build new module spark-service as spark’s thrift server Don't need as much reflection and inherited code as `hive-thriftser` modulesSupport all functions current `sql/hive-thriftserver` supportUse all code maintained by spark itself, won’t depend on HiveSupport origin functions use spark’s own way, won't limited by Hive's codeSupport running without hive metastore or with hive metastoreSupport user impersonation by Multi-tenant splited hive authentication and DFS authenticationSupport session hook for with spark’s own codeAdd a new jdbc driver spark-jdbc, with spark’s own connection url  “jdbc:spark::/”Support both hive-jdbc and spark-jdbc client, then we can support most clients and BI platform



https://issues.apache.org/jira/browse/SPARK-29018 Google Doc:  https://docs.google.com/document/d/1ug4K5e2okF5Q2Pzi3qJiUILwwqkn0fVQaQ-Q95HEcJQ/edit#[ ] +1: Accept the proposal as an official SPIP[ ] +0[ ] -1: I don't think this is a good idea because ...I'll start with my +1

  










angers.zhu




angers@gmail.com








签名由
网易邮箱大师
定制

 







Re: A question about broadcast nest loop join

2019-10-23 Thread angers . zhu












where not in ( query block)condition will been change to LeftSemi join in optimizer rule RewritePredicateSubquery. Then as cloud-fan said,  it will be change to a BroadCastNestLoopJoin




  










angers.zhu




angers@gmail.com








签名由
网易邮箱大师
定制

 


On 10/23/2019 20:55,Wenchen Fan wrote: 


I haven't looked into your query yet, just want to let you know that: Spark can only pick BroadcastNestedLoopJoin to implement left/right join. If the table is very big, then OOM happens.Maybe there is an algorithm to implement left/right join in a distributed environment without broadcast, but currently Spark is only able to deal with it using broadcast. On Wed, Oct 23, 2019 at 6:02 PM zhangliyun  wrote:Hi all: i want to ask a question about broadcast nestloop join? from google i know, that  left outer/semi join and right outer/semi join will use broadcast nestloop.  and in some cases, when the input data is very small, it is suitable to use. so here  how to define the input data very small? what parameter decides the threshold?  I just want to disable it ( i found that   set spark.sql.autoBroadcastJoinThreshold= -1 is no work for sql:select a.key1  from testdata1 as a where a.key1 not in (select key3 from testdata3) )```explain cost select a.key1  from testdata1 as a where a.key1 not in (select key3 from testdata3);== Physical Plan ==*(1) Project [key1#90]+- BroadcastNestedLoopJoin BuildRight, LeftAnti, ((key1#90 = key3#92) || isnull((key1#90 = key3#92)))   :- HiveTableScan [key1#90], HiveTableRelation `default`.`testdata1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key1#90, value1#91]   +- BroadcastExchange IdentityBroadcastMode      +- HiveTableScan [key3#92], HiveTableRelation `default`.`testdata3`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key3#92, value3#93]```  my question is   1. why in not in subquery , BroadcastNestedLoopJoin is still used even i set spark.sql.autoBroadcastJoinThreshold= -1   2. which spark parameter  decides enable/disable BroadcastNestedLoopJoin.Appreciate if you have suggestionBest RegardsKelly Zhang 





Re: Welcoming some new committers and PMC members

2019-09-09 Thread angers . zhu






Congratulations


 










angers.zhu




angers@gmail.com








签名由
网易邮箱大师
定制

 

On 9/10/2019 08:32,Matei Zaharia wrote: 


Hi all,The Spark PMC recently voted to add several new committers and one PMC member. Join me in welcoming them to their new roles!New PMC member: Dongjoon HyunNew committers: Ryan Blue, Liang-Chi Hsieh, Gengliang Wang, Yuming Wang, Weichen Xu, Ruifeng ZhengThe new committers cover lots of important areas including ML, SQL, and data sources, so it’s great to have them here. All the best,Matei and the Spark PMC-To unsubscribe e-mail: dev-unsubscr...@spark.apache.org




[Spark SQL] Any intersted in do SQL between two hive metastore.

2019-09-07 Thread angers . zhu










Best RegardsAngers






  










angers.zhu




angers@gmail.com








签名由
网易邮箱大师
定制