Re: Re: [VOTE] SPARK-44444: Use ANSI SQL mode by default

2024-04-15 Thread Takuya UESHIN
+1

On Mon, Apr 15, 2024 at 11:17 AM Rui Wang  wrote:

> +1, non-binding.
>
> Thanks Dongjoon to drive this!
>
>
> -Rui
>
> On Mon, Apr 15, 2024 at 10:10 AM Xinrong Meng  wrote:
>
>> +1
>>
>> Thank you @Dongjoon Hyun  !
>>
>> On Mon, Apr 15, 2024 at 6:33 AM beliefer  wrote:
>>
>>> +1
>>>
>>>
>>> 在 2024-04-15 15:54:07,"Peter Toth"  写道:
>>>
>>> +1
>>>
>>> Wenchen Fan  ezt írta (időpont: 2024. ápr. 15., H,
>>> 9:08):
>>>
>>>> +1
>>>>
>>>> On Sun, Apr 14, 2024 at 6:28 AM Dongjoon Hyun 
>>>> wrote:
>>>>
>>>>> I'll start from my +1.
>>>>>
>>>>> Dongjoon.
>>>>>
>>>>> On 2024/04/13 22:22:05 Dongjoon Hyun wrote:
>>>>> > Please vote on SPARK-4 to use ANSI SQL mode by default.
>>>>> > The technical scope is defined in the following PR which is
>>>>> > one line of code change and one line of migration guide.
>>>>> >
>>>>> > - DISCUSSION:
>>>>> > https://lists.apache.org/thread/ztlwoz1v1sn81ssks12tb19x37zozxlz
>>>>> > - JIRA: https://issues.apache.org/jira/browse/SPARK-4
>>>>> > - PR: https://github.com/apache/spark/pull/46013
>>>>> >
>>>>> > The vote is open until April 17th 1AM (PST) and passes
>>>>> > if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>>>> >
>>>>> > [ ] +1 Use ANSI SQL mode by default
>>>>> > [ ] -1 Do not use ANSI SQL mode by default because ...
>>>>> >
>>>>> > Thank you in advance.
>>>>> >
>>>>> > Dongjoon
>>>>> >
>>>>>
>>>>> -
>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>
>>>>>

-- 
Takuya UESHIN


Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)

2024-03-31 Thread Takuya UESHIN
+1

On Sun, Mar 31, 2024 at 6:16 PM Hyukjin Kwon  wrote:

> Hi all,
>
> I'd like to start the vote for SPIP: Pure Python Package in PyPI (Spark
> Connect)
>
> JIRA <https://issues.apache.org/jira/browse/SPARK-47540>
> Prototype <https://github.com/apache/spark/pull/45053>
> SPIP doc
> <https://docs.google.com/document/d/1Pund40wGRuB72LX6L7cliMDVoXTPR-xx4IkPmMLaZXk/edit?usp=sharing>
>
> Please vote on the SPIP for the next 72 hours:
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don’t think this is a good idea because …
>
> Thanks.
>


-- 
Takuya UESHIN


Re: [VOTE][SPIP] Python Data Source API

2023-07-10 Thread Takuya UESHIN
+1

On Sun, Jul 9, 2023 at 10:05 PM Ruifeng Zheng  wrote:

> +1
>
> On Mon, Jul 10, 2023 at 8:20 AM Jungtaek Lim 
> wrote:
>
>> +1
>>
>> On Sat, Jul 8, 2023 at 4:13 AM Reynold Xin 
>> wrote:
>>
>>> +1!
>>>
>>>
>>> On Fri, Jul 7 2023 at 11:58 AM, Holden Karau 
>>> wrote:
>>>
>>>> +1
>>>>
>>>> On Fri, Jul 7, 2023 at 9:55 AM huaxin gao 
>>>> wrote:
>>>>
>>>>> +1
>>>>>
>>>>> On Fri, Jul 7, 2023 at 8:59 AM Mich Talebzadeh <
>>>>> mich.talebza...@gmail.com> wrote:
>>>>>
>>>>>> +1 for me
>>>>>>
>>>>>> Mich Talebzadeh,
>>>>>> Solutions Architect/Engineering Lead
>>>>>> Palantir Technologies Limited
>>>>>> London
>>>>>> United Kingdom
>>>>>>
>>>>>>
>>>>>>view my Linkedin profile
>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>
>>>>>>
>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>> for any loss, damage or destruction of data or any other property which 
>>>>>> may
>>>>>> arise from relying on this email's technical content is explicitly
>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>> arising from such loss, damage or destruction.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, 7 Jul 2023 at 11:05, Martin Grund
>>>>>>  wrote:
>>>>>>
>>>>>>> +1 (non-binding)
>>>>>>>
>>>>>>> On Fri, Jul 7, 2023 at 12:05 AM Denny Lee 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> +1 (non-binding)
>>>>>>>>
>>>>>>>> On Fri, Jul 7, 2023 at 00:50 Maciej  wrote:
>>>>>>>>
>>>>>>>>> +0
>>>>>>>>>
>>>>>>>>> Best regards,
>>>>>>>>> Maciej Szymkiewicz
>>>>>>>>>
>>>>>>>>> Web: https://zero323.net
>>>>>>>>> PGP: A30CEF0C31A501EC
>>>>>>>>>
>>>>>>>>> On 7/6/23 17:41, Xiao Li wrote:
>>>>>>>>>
>>>>>>>>> +1
>>>>>>>>>
>>>>>>>>> Xiao
>>>>>>>>>
>>>>>>>>> Hyukjin Kwon  于2023年7月5日周三 17:28写道:
>>>>>>>>>
>>>>>>>>>> +1.
>>>>>>>>>>
>>>>>>>>>> See https://youtu.be/yj7XlTB1Jvc?t=604 :-).
>>>>>>>>>>
>>>>>>>>>> On Thu, 6 Jul 2023 at 09:15, Allison Wang
>>>>>>>>>> 
>>>>>>>>>>  wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi all,
>>>>>>>>>>>
>>>>>>>>>>> I'd like to start the vote for SPIP: Python Data Source API.
>>>>>>>>>>>
>>>>>>>>>>> The high-level summary for the SPIP is that it aims to
>>>>>>>>>>> introduce a simple API in Python for Data Sources. The idea is to 
>>>>>>>>>>> enable
>>>>>>>>>>> Python developers to create data sources without learning Scala or 
>>>>>>>>>>> dealing
>>>>>>>>>>> with the complexities of the current data source APIs. This would 
>>>>>>>>>>> make
>>>>>>>>>>> Spark more accessible to the wider Python developer community.
>>>>>>>>>>>
>>>>>>>>>>> References:
>>>>>>>>>>>
>>>>>>>>>>>- SPIP doc
>>>>>>>>>>>
>>>>>>>>>>> <https://docs.google.com/document/d/1oYrCKEKHzznljYfJO4kx5K_Npcgt1Slyfph3NEk7JRU/edit?usp=sharing>
>>>>>>>>>>>- JIRA ticket
>>>>>>>>>>><https://issues.apache.org/jira/browse/SPARK-44076>
>>>>>>>>>>>- Discussion thread
>>>>>>>>>>>
>>>>>>>>>>> <https://lists.apache.org/thread/w621zn14ho4rw61b0s139klnqh900s8y>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Please vote on the SPIP for the next 72 hours:
>>>>>>>>>>>
>>>>>>>>>>> [ ] +1: Accept the proposal as an official SPIP
>>>>>>>>>>> [ ] +0
>>>>>>>>>>> [ ] -1: I don’t think this is a good idea because __.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Allison
>>>>>>>>>>>
>>>>>>>>>>
>>>>
>>>> --
>>>> Twitter: https://twitter.com/holdenkarau
>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>
>>>

-- 
Takuya UESHIN


Re: Welcoming three new PMC members

2022-08-09 Thread Takuya UESHIN
Congratulations!

On Tue, Aug 9, 2022 at 4:57 PM Hyukjin Kwon  wrote:

> Congrats everybody!
>
> On Wed, 10 Aug 2022 at 05:50, Mridul Muralidharan 
> wrote:
>
>>
>> Congratulations !
>> Great to have you join the PMC !!
>>
>> Regards,
>> Mridul
>>
>> On Tue, Aug 9, 2022 at 11:57 AM vaquar khan 
>> wrote:
>>
>>> Congratulations
>>>
>>> On Tue, Aug 9, 2022, 11:40 AM Xiao Li  wrote:
>>>
>>>> Hi all,
>>>>
>>>> The Spark PMC recently voted to add three new PMC members. Join me in
>>>> welcoming them to their new roles!
>>>>
>>>> New PMC members: Huaxin Gao, Gengliang Wang and Maxim Gekk
>>>>
>>>> The Spark PMC
>>>>
>>>

-- 
Takuya UESHIN


Re: Welcome Xinrong Meng as a Spark committer

2022-08-09 Thread Takuya UESHIN
Congratulations, Xinrong!

On Tue, Aug 9, 2022 at 10:07 AM Gengliang Wang  wrote:

> Congratulations, Xinrong! Well deserved.
>
>
> On Tue, Aug 9, 2022 at 7:09 AM Yi Wu  wrote:
>
>> Congrats Xinrong!!
>>
>>
>> On Tue, Aug 9, 2022 at 7:07 PM Maxim Gekk
>>  wrote:
>>
>>> Congratulations, Xinrong!
>>>
>>> Maxim Gekk
>>>
>>> Software Engineer
>>>
>>> Databricks, Inc.
>>>
>>>
>>> On Tue, Aug 9, 2022 at 3:15 PM Weichen Xu
>>>  wrote:
>>>
>>>> Congrats!
>>>>
>>>> On Tue, Aug 9, 2022 at 5:55 PM Jungtaek Lim <
>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>
>>>>> Congrats Xinrong! Well deserved.
>>>>>
>>>>> 2022년 8월 9일 (화) 오후 5:13, Hyukjin Kwon 님이 작성:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> The Spark PMC recently added Xinrong Meng as a committer on the
>>>>>> project. Xinrong is the major contributor of PySpark especially Pandas 
>>>>>> API
>>>>>> on Spark. She has guided a lot of new contributors enthusiastically. 
>>>>>> Please
>>>>>> join me in welcoming Xinrong!
>>>>>>
>>>>>>

-- 
Takuya UESHIN


Re: [VOTE] Release Spark 3.3.0 (RC3)

2022-05-27 Thread Takuya Ueshin
-1

I found a correctness issue of ArrayAggregate and the fix was merged after
the RC3 cut.

- https://issues.apache.org/jira/browse/SPARK-39293
- https://github.com/apache/spark/pull/36674

Thanks.


On Tue, May 24, 2022 at 10:21 AM Maxim Gekk
 wrote:

> Please vote on releasing the following candidate as
> Apache Spark version 3.3.0.
>
> The vote is open until 11:59pm Pacific time May 27th and passes if a
> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.3.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v3.3.0-rc3 (commit
> a7259279d07b302a51456adb13dc1e41a6fd06ed):
> https://github.com/apache/spark/tree/v3.3.0-rc3
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc3-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1404
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc3-docs/
>
> The list of bug fixes going into 3.3.0 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12350369
>
> This release is using the release script of the tag v3.3.0-rc3.
>
>
> FAQ
>
> =
> How can I help test this release?
> =
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.3.0?
> ===
> The current list of open tickets targeted at 3.3.0 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.3.0
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
> Maxim Gekk
>
> Software Engineer
>
> Databricks, Inc.
>


Re: PySpark Dynamic DataFrame for easier inheritance

2021-12-29 Thread Takuya Ueshin
I'm afraid I'm also against the proposal so far.

What's wrong with going with "1. Functions" and using transform which allows
chaining functions?
I was not sure what you mean by "manage the namespaces", though.


def with_price(df, factor: float = 2.0):
return df.withColumn("price", F.col("price") * factor)

df.transform(with_price).show()


I have to admit that the current transform is a bit annoying when the
function takes parameters:


df.transform(lambda input_df: with_price(input_df, 100)).show()


but we can improve transform to take the parameters for the function.


Or, I'd also recommend using a wrapper as Maciej suggested, but without
delegating all functions.
I'd expose only functions which are really necessary; otherwise management
of the dataframe would be rather more difficult.

For example, with a MyBusinessDataFrame

,


base_dataframe = spark.createDataFrame(
data=[['product_1', 2], ['product_2', 4]],
schema=["name", "price"],
)
dyn_business = MyBusinessDataFrame(base_dataframe)
dyn_business.select("name").my_business_query(2.0)


will raise an AnalysisException because there is not the price column
anymore.
We should manage the dataframe in the wrapper properly.


Thanks.


On Wed, Dec 29, 2021 at 8:49 AM Maciej  wrote:

> On 12/29/21 16:18, Pablo Alcain wrote:
> > Hey Maciej! Thanks for your answer and the comments :)
> >
> > On Wed, Dec 29, 2021 at 3:06 PM Maciej  > > wrote:
> >
> > This seems like a lot of trouble for not so common use case that has
> > viable alternatives. Once you assume that class is intended for
> > inheritance (which, arguably we neither do or imply a the moment)
> you're
> > even more restricted that we are right now, according to the project
> > policy and need for keeping things synchronized across all languages.
> >
> > By "this" you mean the modification of the DataFrame, the implementation
> > of a new pyspark class (DynamicDataFrame in this case) or the approach
> > in general?
>
> I mean promoting DataFrame as extensible in general. It is a risk of
> getting stuck with specific API, even more than we are right now, with
> little reward at the end.
>
> Additionally:
>
> - As far as I am aware nothing suggests that it is widely requested
> feature (corresponding SO questions didn't get much traffic over the
> years and I don't think we have any preceding JIRA tickets).
> - It can be addressed outside the project (within user codebase or as a
> standalone package) with minimal or no overhead.
>
> That being said ‒ if we're going to rewrite Python DataFrame methods to
> return instance type, I strongly believe that the existing methods
> should be marked as final.
>
> >
> >
> >
> > On Scala side, I would rather expect to see type classes than direct
> > inheritance so this might be a dead feature from the start.
> >
> > As of Python (sorry if I missed something in the preceding
> discussion),
> > quite natural approach would be to wrap DataFrame instance in your
> > business class and delegate calls to the wrapped object. A very naive
> > implementation could look like this
> >
> > from functools import wraps
> >
> > class BusinessModel:
> > @classmethod
> > def delegate(cls, a):
> > def _(*args, **kwargs):
> > result = a(*args, **kwargs)
> > if isinstance(result, DataFrame):
> > return  cls(result)
> > else:
> > return result
> >
> > if callable(a):
> > return wraps(a)(_)
> > else:
> > return a
> >
> > def __init__(self, df):
> > self._df = df
> >
> > def __getattr__(self, name):
> > return BusinessModel.delegate(getattr(self._df, name))
> >
> > def with_price(self, price=42):
> > return self.selectExpr("*", f"{price} as price")
> >
> >
> >
> > Yes, effectively the solution is very similar to this one. I believe
> > that the advantage of doing it without hijacking with the decorator the
> > delegation is that you can still maintain static typing.
>
> You can maintain type checker compatibility (it is easier with stubs,
> but you can do it with inline hints as well, if I recall correctly) here
> as well.
>
> > On the other
> > hand (and this is probably a minor issue), when following this approach
> > with the `isinstance` checking for the casting you might end up casting
> > the `.summary()` and `.describe()` methods that probably you want still
> > to keep as "pure" DataFrames. If you see it from this perspective, then
> > "DynamicDataFrame" would be the boilerplate code that allows you to
> > decide more granularly what methods you want to delegate.
>
> You can do it with `__getattr__` as well. There are 

Re: [VOTE] SPIP: Support pandas API layer on PySpark

2021-03-29 Thread Takuya UESHIN
+1

On Mon, Mar 29, 2021 at 3:35 AM Ismaël Mejía  wrote:

> +1 (non-binding)
>
> On Mon, Mar 29, 2021 at 7:54 AM Wenchen Fan  wrote:
> >
> > +1
> >
> > On Mon, Mar 29, 2021 at 1:45 PM Holden Karau 
> wrote:
> >>
> >> +1
> >>
> >> On Sun, Mar 28, 2021 at 10:25 PM sarutak 
> wrote:
> >>>
> >>> +1 (non-binding)
> >>>
> >>> - Kousuke
> >>>
> >>> > +1 (non-binding)
> >>> >
> >>> > On Sun, Mar 28, 2021 at 9:06 PM 郑瑞峰 
> >>> > wrote:
> >>> >
> >>> >> +1 (non-binding)
> >>> >>
> >>> >> -- 原始邮件 --
> >>> >>
> >>> >> 发件人: "Maxim Gekk" ;
> >>> >> 发送时间: 2021年3月29日(星期一) 凌晨2:08
> >>> >> 收件人: "Matei Zaharia";
> >>> >> 抄送: "Gengliang Wang";"Mridul
> >>> >> Muralidharan";"Xiao
> >>> >> Li";"Spark dev
> >>> >> list";"Takeshi
> >>> >> Yamamuro";
> >>> >> 主题: Re: [VOTE] SPIP: Support pandas API layer on PySpark
> >>> >>
> >>> >> +1 (non-binding)
> >>> >>
> >>> >> On Sun, Mar 28, 2021 at 8:53 PM Matei Zaharia
> >>> >>  wrote:
> >>> >>
> >>> >> +1
> >>> >>
> >>> >> Matei
> >>> >>
> >>> >> On Mar 28, 2021, at 1:45 AM, Gengliang Wang 
> >>> >> wrote:
> >>> >>
> >>> >> +1 (non-binding)
> >>> >>
> >>> >> On Sun, Mar 28, 2021 at 11:12 AM Mridul Muralidharan
> >>> >>  wrote:
> >>> >>
> >>> >> +1
> >>> >>
> >>> >> Regards,
> >>> >> Mridul
> >>> >>
> >>> >> On Sat, Mar 27, 2021 at 6:09 PM Xiao Li 
> >>> >> wrote:
> >>> >>
> >>> >> +1
> >>> >>
> >>> >> Xiao
> >>> >>
> >>> >> Takeshi Yamamuro  于2021年3月26日周五
> >>> >> 下午4:14写道:
> >>> >>
> >>> >> +1 (non-binding)
> >>> >>
> >>> >> On Sat, Mar 27, 2021 at 4:53 AM Liang-Chi Hsieh 
> >>> >> wrote:
> >>> >> +1 (non-binding)
> >>> >>
> >>> >> rxin wrote
> >>> >>> +1. Would open up a huge persona for Spark.
> >>> >>>
> >>> >>> On Fri, Mar 26 2021 at 11:30 AM, Bryan Cutler <
> >>> >>
> >>> >>> cutlerb@
> >>> >>
> >>> >>>> wrote:
> >>> >>>
> >>> >>>>
> >>> >>>> +1 (non-binding)
> >>> >>>>
> >>> >>>>
> >>> >>>> On Fri, Mar 26, 2021 at 9:49 AM Maciej <
> >>> >>
> >>> >>> mszymkiewicz@
> >>> >>
> >>> >>>> wrote:
> >>> >>>>
> >>> >>>>
> >>> >>>>> +1 (nonbinding)
> >>> >>
> >>> >> --
> >>> >> Sent from:
> >>> >> http://apache-spark-developers-list.1001551.n3.nabble.com/
> >>> >>
> >>> >>
> >>> > -
> >>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>> >>
> >>> >> --
> >>> >>
> >>> >> ---
> >>> >> Takeshi Yamamuro
> >>>
> >>> -
> >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>>
> >> --
> >> Twitter: https://twitter.com/holdenkarau
> >> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9
> >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
Takuya UESHIN


Re: Welcoming some new Apache Spark committers

2020-07-14 Thread Takuya UESHIN
Congrats and welcome!

On Tue, Jul 14, 2020 at 1:07 PM Bryan Cutler  wrote:

> Congratulations and welcome!
>
> On Tue, Jul 14, 2020 at 12:36 PM Xingbo Jiang 
> wrote:
>
>> Welcome, Huaxin, Jungtaek, and Dilip!
>>
>> Congratulations!
>>
>> On Tue, Jul 14, 2020 at 10:37 AM Matei Zaharia 
>> wrote:
>>
>>> Hi all,
>>>
>>> The Spark PMC recently voted to add several new committers. Please join
>>> me in welcoming them to their new roles! The new committers are:
>>>
>>> - Huaxin Gao
>>> - Jungtaek Lim
>>> - Dilip Biswal
>>>
>>> All three of them contributed to Spark 3.0 and we’re excited to have
>>> them join the project.
>>>
>>> Matei and the Spark PMC
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>

-- 
Takuya UESHIN


[OSS DIGEST] The major changes of Apache Spark from May 6 to May 19

2020-06-09 Thread Takuya Ueshin
Hi all,

This is the bi-weekly Apache Spark digest from the Databricks OSS team.
For each API/configuration/behavior change, there will be an *[API]* tag in
the title.

CORE
[3.0][SPARK-31559][YARN]
Re-obtain tokens at the startup of AM for yarn cluster mode if principal
and keytab are available (+14, -1)>


Re-obtain tokens at the start of AM for yarn cluster mode, if principal and
keytab are available. It basically transfers the credentials from the
original user, so this patch puts the new tokens into credentials from the
original user via overwriting.

Submitter will obtain delegation tokens for yarn-cluster mode, and add
these credentials to the launch context. AM will be launched with these
credentials, and AM and driver are able to leverage these tokens.

In Yarn cluster mode, driver is launched in AM, which in turn initializes
token manager (while initializing SparkContext) and obtain delegation
tokens (+ schedule to renew) if both principal and keytab are available.
[2.4][SPARK-31399][CORE]
Support indylambda Scala closure in ClosureCleaner (+434, -47)>


There had been previous efforts to extend Spark's ClosureCleaner to support
"indylambda" Scala closures, which is necessary for proper Scala 2.12
support. Most notably the work is done at SPARK-14540
.

But the previous efforts had missed one import scenario: a Scala closure
declared in a Scala REPL, and it captures the enclosing this -- a REPL line
object.

This PR proposes to enhance Spark's ClosureCleaner to support "indylambda"
style of Scala closures to the same level as the existing implementation
for the old (inner class) style ones. The goal is to reach feature parity
with the support of the old style Scala closures, with as close to
bug-for-bug compatibility as possible.
[3.0][SPARK-31743][CORE]
Add spark_info metric into PrometheusResource (+2, -0)>


Add spark_info metric into PrometheusResource.

$ bin/spark-shell --driver-memory 4G -c spark.ui.prometheus.enabled=true

$ curl -s http://localhost:4041/metrics/executors/prometheus/ | head -n1
spark_info{version="3.1.0",
revision="097d5098cca987e5f7bbb8394783c01517ebed0f"} 1.0

[API][3.1][SPARK-20732][CORE]
Decommission cache blocks to other executors when an executor is
decommissioned (+409, -13)>


After changes in SPARK-20628
,
CoarseGrainedSchedulerBackend can decommission an executor and stop
assigning new tasks on it. We should also decommission the corresponding
blockmanagers in the same way. i.e. Move the cached RDD blocks from those
executors to other active executors. It introduces 3 new configurations:
Config NameDescriptionDefault Value
spark.storage.decommission.enabled Whether to decommission the block
manager when decommissioning executor false
spark.storage.decommission.maxReplicationFailuresPerBlock Maximum number of
failures which can be handled for the replication of one RDD block when
block manager is decommissioning and trying to move its existing blocks. 3
spark.storage.decommission.replicationReattemptInterval The interval of
time between consecutive cache block replication reattempts happening on
each decommissioning executor (due to storage decommissioning). 30s
SQL
[API][3.0][SPARK-31365][SQL]
Enable nested predicate pushdown per data sources (+186, -100)>


Replaces a config spark.sql.optimizer.nestedPredicatePushdown.enabled with
spark.sql.optimizer.nestedPredicatePushdown.supportedFileSources which can
configure which v1 data sources are enabled with nested predicate pushdown,
but the previous config is an all or nothing config, and applies on all the
data sources.

In order to not introduce an unexpected API breaking change after enabling
nested predicate pushdown, we'd like to set nested predicate pushdown per
data 

Re: [VOTE] Amend Spark's Semantic Versioning Policy

2020-03-09 Thread Takuya UESHIN
aking an API
>>>>>>>> >> >>
>>>>>>>> >> >> Breaking an API almost always has a non-trivial cost to the
>>>>>>>> users of Spark. A broken API means that Spark programs need to be 
>>>>>>>> rewritten
>>>>>>>> before they can be upgraded. However, there are a few considerations 
>>>>>>>> when
>>>>>>>> thinking about what the cost will be:
>>>>>>>> >> >>
>>>>>>>> >> >> Usage - an API that is actively used in many different
>>>>>>>> places, is always very costly to break. While it is hard to know usage 
>>>>>>>> for
>>>>>>>> sure, there are a bunch of ways that we can estimate:
>>>>>>>> >> >>
>>>>>>>> >> >> How long has the API been in Spark?
>>>>>>>> >> >>
>>>>>>>> >> >> Is the API common even for basic programs?
>>>>>>>> >> >>
>>>>>>>> >> >> How often do we see recent questions in JIRA or mailing lists?
>>>>>>>> >> >>
>>>>>>>> >> >> How often does it appear in StackOverflow or blogs?
>>>>>>>> >> >>
>>>>>>>> >> >> Behavior after the break - How will a program that works
>>>>>>>> today, work after the break? The following are listed roughly in order 
>>>>>>>> of
>>>>>>>> increasing severity:
>>>>>>>> >> >>
>>>>>>>> >> >> Will there be a compiler or linker error?
>>>>>>>> >> >>
>>>>>>>> >> >> Will there be a runtime exception?
>>>>>>>> >> >>
>>>>>>>> >> >> Will that exception happen after significant processing has
>>>>>>>> been done?
>>>>>>>> >> >>
>>>>>>>> >> >> Will we silently return different answers? (very hard to
>>>>>>>> debug, might not even notice!)
>>>>>>>> >> >>
>>>>>>>> >> >>
>>>>>>>> >> >> Cost of Maintaining an API
>>>>>>>> >> >>
>>>>>>>> >> >> Of course, the above does not mean that we will never break
>>>>>>>> any APIs. We must also consider the cost both to the project and to our
>>>>>>>> users of keeping the API in question.
>>>>>>>> >> >>
>>>>>>>> >> >> Project Costs - Every API we have needs to be tested and
>>>>>>>> needs to keep working as other parts of the project changes. These 
>>>>>>>> costs
>>>>>>>> are significantly exacerbated when external dependencies change (the 
>>>>>>>> JVM,
>>>>>>>> Scala, etc). In some cases, while not completely technically 
>>>>>>>> infeasible,
>>>>>>>> the cost of maintaining a particular API can become too high.
>>>>>>>> >> >>
>>>>>>>> >> >> User Costs - APIs also have a cognitive cost to users
>>>>>>>> learning Spark or trying to understand Spark programs. This cost 
>>>>>>>> becomes
>>>>>>>> even higher when the API in question has confusing or undefined 
>>>>>>>> semantics.
>>>>>>>> >> >>
>>>>>>>> >> >>
>>>>>>>> >> >> Alternatives to Breaking an API
>>>>>>>> >> >>
>>>>>>>> >> >> In cases where there is a "Bad API", but where the cost of
>>>>>>>> removal is also high, there are alternatives that should be considered 
>>>>>>>> that
>>>>>>>> do not hurt existing users but do address some of the maintenance 
>>>>>>>> costs.
>>>>>>>> >> >>
>>>>>>>> >> >>
>>>>>>>> >> >> Avoid Bad APIs - While this is a bit obvious, it is an
>>>>>>>> important point. Anytime we are adding a new interface to Spark we 
>>>>>>>> should
>>>>>>>> consider that we might be stuck with this API forever. Think deeply 
>>>>>>>> about
>>>>>>>> how new APIs relate to existing ones, as well as how you expect them to
>>>>>>>> evolve over time.
>>>>>>>> >> >>
>>>>>>>> >> >> Deprecation Warnings - All deprecation warnings should point
>>>>>>>> to a clear alternative and should never just say that an API is 
>>>>>>>> deprecated.
>>>>>>>> >> >>
>>>>>>>> >> >> Updated Docs - Documentation should point to the "best"
>>>>>>>> recommended way of performing a given task. In the cases where we 
>>>>>>>> maintain
>>>>>>>> legacy documentation, we should clearly point to newer APIs and 
>>>>>>>> suggest to
>>>>>>>> users the "right" way.
>>>>>>>> >> >>
>>>>>>>> >> >> Community Work - Many people learn Spark by reading blogs and
>>>>>>>> other sites such as StackOverflow. However, many of these resources 
>>>>>>>> are out
>>>>>>>> of date. Update them, to reduce the cost of eventually removing 
>>>>>>>> deprecated
>>>>>>>> APIs.
>>>>>>>> >> >>
>>>>>>>> >> >>
>>>>>>>> >> >> 
>>>>>>>> >>
>>>>>>>> >>
>>>>>>>> -
>>>>>>>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>>>> >>
>>>>>>>>
>>>>>>>>
>>>>>>>> -
>>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>>> --
>>>>> ---
>>>>> Takeshi Yamamuro
>>>>>
>>>>
>>
>> --
>> <https://databricks.com/sparkaisummit/north-america>
>>
>

-- 
Takuya UESHIN

http://twitter.com/ueshin


Re: [DISCUSS] Remove sorting of fields in PySpark SQL Row construction

2019-11-07 Thread Takuya UESHIN
+1

On Thu, Nov 7, 2019 at 6:54 PM Shane Knapp  wrote:

> +1
>
> On Thu, Nov 7, 2019 at 6:08 PM Hyukjin Kwon  wrote:
> >
> > +1
> >
> > 2019년 11월 6일 (수) 오후 11:38, Wenchen Fan 님이 작성:
> >>
> >> Sounds reasonable to me. We should make the behavior consistent within
> Spark.
> >>
> >> On Tue, Nov 5, 2019 at 6:29 AM Bryan Cutler  wrote:
> >>>
> >>> Currently, when a PySpark Row is created with keyword arguments, the
> fields are sorted alphabetically. This has created a lot of confusion with
> users because it is not obvious (although it is stated in the pydocs) that
> they will be sorted alphabetically. Then later when applying a schema and
> the field order does not match, an error will occur. Here is a list of some
> of the JIRAs that I have been tracking all related to this issue:
> SPARK-24915, SPARK-22232, SPARK-27939, SPARK-27712, and relevant discussion
> of the issue [1].
> >>>
> >>> The original reason for sorting fields is because kwargs in python <
> 3.6 are not guaranteed to be in the same order that they were entered [2].
> Sorting alphabetically ensures a consistent order. Matters are further
> complicated with the flag _from_dict_ that allows the Row fields to to be
> referenced by name when made by kwargs, but this flag is not serialized
> with the Row and leads to inconsistent behavior. For instance:
> >>>
> >>> >>> spark.createDataFrame([Row(A="1", B="2")], "B string, A
> string").first()
> >>> Row(B='2', A='1')
> >>> >>> spark.createDataFrame(spark.sparkContext.parallelize([Row(A="1",
> B="2")]), "B string, A string").first()
> >>> Row(B='1', A='2')
> >>>
> >>> I think the best way to fix this is to remove the sorting of fields
> when constructing a Row. For users with Python 3.6+, nothing would change
> because these versions of Python ensure that the kwargs stays in the
> ordered entered. For users with Python < 3.6, using kwargs would check a
> conf to either raise an error or fallback to a LegacyRow that sorts the
> fields as before. With Python < 3.6 being deprecated now, this LegacyRow
> can also be removed at the same time. There are also other ways to create
> Rows that will not be affected. I have opened a JIRA [3] to capture this,
> but I am wondering what others think about fixing this for Spark 3.0?
> >>>
> >>> [1] https://github.com/apache/spark/pull/20280
> >>> [2] https://www.python.org/dev/peps/pep-0468/
> >>> [3] https://issues.apache.org/jira/browse/SPARK-29748
>
>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

-- 
Takuya UESHIN
Tokyo, Japan

http://twitter.com/ueshin


Re: [DISCUSS] Deprecate Python < 3.6 in Spark 3.0

2019-10-31 Thread Takuya UESHIN
+1

On Thu, Oct 31, 2019 at 11:21 AM Bryan Cutler  wrote:

> +1 for deprecating
>
> On Wed, Oct 30, 2019 at 2:46 PM Shane Knapp  wrote:
>
>> sure.  that shouldn't be too hard, but we've historically given very
>> little support to it.
>>
>> On Wed, Oct 30, 2019 at 2:31 PM Maciej Szymkiewicz <
>> mszymkiew...@gmail.com> wrote:
>>
>>> Could we upgrade to PyPy3.6 v7.2.0?
>>> On 10/30/19 9:45 PM, Shane Knapp wrote:
>>>
>>> one quick thing:  we currently test against python2.7, 3.6 *and*
>>> pypy2.5.1 (python2.7).
>>>
>>> what are our plans for pypy?
>>>
>>>
>>> On Wed, Oct 30, 2019 at 12:26 PM Dongjoon Hyun 
>>> wrote:
>>>
>>>> Thank you all. I made a PR for that.
>>>>
>>>> https://github.com/apache/spark/pull/26326
>>>>
>>>> On Tue, Oct 29, 2019 at 5:45 AM Takeshi Yamamuro 
>>>> wrote:
>>>>
>>>>> +1, too.
>>>>>
>>>>> On Tue, Oct 29, 2019 at 4:16 PM Holden Karau 
>>>>> wrote:
>>>>>
>>>>>> +1 to deprecating but not yet removing support for 3.6
>>>>>>
>>>>>> On Tue, Oct 29, 2019 at 3:47 AM Shane Knapp 
>>>>>> wrote:
>>>>>>
>>>>>>> +1 to testing the absolute minimum number of python variants as
>>>>>>> possible.  ;)
>>>>>>>
>>>>>>> On Mon, Oct 28, 2019 at 7:46 PM Hyukjin Kwon 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> +1 from me as well.
>>>>>>>>
>>>>>>>> 2019년 10월 29일 (화) 오전 5:34, Xiangrui Meng 님이
>>>>>>>> 작성:
>>>>>>>>
>>>>>>>>> +1. And we should start testing 3.7 and maybe 3.8 in Jenkins.
>>>>>>>>>
>>>>>>>>> On Thu, Oct 24, 2019 at 9:34 AM Dongjoon Hyun <
>>>>>>>>> dongjoon.h...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Thank you for starting the thread.
>>>>>>>>>>
>>>>>>>>>> In addition to that, we currently are testing Python 3.6 only in
>>>>>>>>>> Apache Spark Jenkins environment.
>>>>>>>>>>
>>>>>>>>>> Given that Python 3.8 is already out and Apache Spark 3.0.0 RC1
>>>>>>>>>> will start next January
>>>>>>>>>> (https://spark.apache.org/versioning-policy.html), I'm +1 for
>>>>>>>>>> the deprecation (Python < 3.6) at Apache Spark 3.0.0.
>>>>>>>>>>
>>>>>>>>>> It's just a deprecation to prepare the next-step development
>>>>>>>>>> cycle.
>>>>>>>>>> Bests,
>>>>>>>>>> Dongjoon.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Oct 24, 2019 at 1:10 AM Maciej Szymkiewicz <
>>>>>>>>>> mszymkiew...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>
>>>>>>>>>>> While deprecation of Python 2 in 3.0.0 has been announced
>>>>>>>>>>> <https://spark.apache.org/news/plan-for-dropping-python-2-support.html>,
>>>>>>>>>>> there is no clear statement about specific continuing support of 
>>>>>>>>>>> different
>>>>>>>>>>> Python 3 version.
>>>>>>>>>>>
>>>>>>>>>>> Specifically:
>>>>>>>>>>>
>>>>>>>>>>>- Python 3.4 has been retired this year.
>>>>>>>>>>>- Python 3.5 is already in the "security fixes only" mode
>>>>>>>>>>>and should be retired in the middle of 2020.
>>>>>>>>>>>
>>>>>>>>>>> Continued support of these two blocks adoption of many new
>>>>>>>>>>> Python features (PEP 468)  and it is hard to justify beyond 2020.
>>>>>>>>>>>
>>>>>>>>>>> Should these two be deprecated in 3.0.0 as well?
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Best regards,
>>>>>>>>>>> Maciej
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Shane Knapp
>>>>>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>>>>>> https://rise.cs.berkeley.edu
>>>>>>>
>>>>>> --
>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> ---
>>>>> Takeshi Yamamuro
>>>>>
>>>>
>>>
>>> --
>>> Shane Knapp
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>> --
>>> Best regards,
>>> Maciej
>>>
>>>
>>
>> --
>> Shane Knapp
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>

-- 
Takuya UESHIN
Tokyo, Japan

http://twitter.com/ueshin


Re: Welcoming some new committers and PMC members

2019-09-09 Thread Takuya UESHIN
Congratulations!

On Mon, Sep 9, 2019 at 5:40 PM Xiao Li  wrote:

> Congratulations to all of you!
>
> Xiao
>
> On Mon, Sep 9, 2019 at 5:32 PM Matei Zaharia 
> wrote:
>
>> Hi all,
>>
>> The Spark PMC recently voted to add several new committers and one PMC
>> member. Join me in welcoming them to their new roles!
>>
>> New PMC member: Dongjoon Hyun
>>
>> New committers: Ryan Blue, Liang-Chi Hsieh, Gengliang Wang, Yuming Wang,
>> Weichen Xu, Ruifeng Zheng
>>
>> The new committers cover lots of important areas including ML, SQL, and
>> data sources, so it’s great to have them here. All the best,
>>
>> Matei and the Spark PMC
>>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>> --
> [image: Databricks Summit - Watch the talks]
> <https://databricks.com/sparkaisummit/north-america>
>


-- 
Takuya UESHIN
Tokyo, Japan

http://twitter.com/ueshin


Re: Welcome Jose Torres as a Spark committer

2019-01-29 Thread Takuya UESHIN
Congrats, Jose!

On Wed, Jan 30, 2019 at 11:10 AM Yuanjian Li  wrote:

> Congrats Jose!
>
> Best,
> Yuanjian
>
> Takeshi Yamamuro  于2019年1月30日周三 上午8:21写道:
>
>> Congrats, Jose!
>>
>> Best,
>> Takeshi
>>
>> On Wed, Jan 30, 2019 at 6:10 AM Jungtaek Lim  wrote:
>>
>>> Congrats Jose! Well deserved.
>>>
>>> - Jungtaek Lim (HeartSaVioR)
>>>
>>> 2019년 1월 30일 (수) 오전 5:19, Dongjoon Hyun 님이 작성:
>>>
>>>> Congrats, Jose! :)
>>>>
>>>> Bests,
>>>> Dongjoon.
>>>>
>>>> On Tue, Jan 29, 2019 at 11:41 AM Arun Mahadevan 
>>>> wrote:
>>>>
>>>>> Congrats Jose! Well deserved.
>>>>>
>>>>> On Tue, 29 Jan 2019 at 11:15, Jules Damji  wrote:
>>>>>
>>>>>> Congrats Jose!
>>>>>>
>>>>>> Sent from my iPhone
>>>>>> Pardon the dumb thumb typos :)
>>>>>>
>>>>>> On Jan 29, 2019, at 11:07 AM, shane knapp 
>>>>>> wrote:
>>>>>>
>>>>>> congrats, and welcome!
>>>>>>
>>>>>> On Tue, Jan 29, 2019 at 11:07 AM Dean Wampler 
>>>>>> wrote:
>>>>>>
>>>>>>> Congrats, Jose!
>>>>>>>
>>>>>>>
>>>>>>> *Dean Wampler, Ph.D.*
>>>>>>>
>>>>>>> *VP, Fast Data Engineering at Lightbend*
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Jan 29, 2019 at 12:52 PM Burak Yavuz 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Congrats Jose!
>>>>>>>>
>>>>>>>> On Tue, Jan 29, 2019 at 10:50 AM Xiao Li 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Congratulations!
>>>>>>>>>
>>>>>>>>> Xiao
>>>>>>>>>
>>>>>>>>> Shixiong Zhu  于2019年1月29日周二 上午10:48写道:
>>>>>>>>>
>>>>>>>>>> Hi all,
>>>>>>>>>>
>>>>>>>>>> The Apache Spark PMC recently added Jose Torres as a committer on
>>>>>>>>>> the project. Jose has been a major contributor to Structured 
>>>>>>>>>> Streaming.
>>>>>>>>>> Please join me in welcoming him!
>>>>>>>>>>
>>>>>>>>>> Best Regards,
>>>>>>>>>>
>>>>>>>>>> Shixiong Zhu
>>>>>>>>>>
>>>>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Shane Knapp
>>>>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>>>>> https://rise.cs.berkeley.edu
>>>>>>
>>>>>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>

-- 
Takuya UESHIN
Tokyo, Japan

http://twitter.com/ueshin


Re: welcome a new batch of committers

2018-10-03 Thread Takuya UESHIN
Congratulations!


On Wed, Oct 3, 2018 at 6:35 PM Marco Gaido  wrote:

> Congrats you all!
>
> Il giorno mer 3 ott 2018 alle ore 11:29 Liang-Chi Hsieh 
> ha scritto:
>
>>
>> Congratulations to all new committers!
>>
>>
>> rxin wrote
>> > Hi all,
>> >
>> > The Apache Spark PMC has recently voted to add several new committers to
>> > the project, for their contributions:
>> >
>> > - Shane Knapp (contributor to infra)
>> > - Dongjoon Hyun (contributor to ORC support and other parts of Spark)
>> > - Kazuaki Ishizaki (contributor to Spark SQL)
>> > - Xingbo Jiang (contributor to Spark Core and SQL)
>> > - Yinan Li (contributor to Spark on Kubernetes)
>> > - Takeshi Yamamuro (contributor to Spark SQL)
>> >
>> > Please join me in welcoming them!
>>
>>
>>
>>
>>
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

-- 
Takuya UESHIN
Tokyo, Japan

http://twitter.com/ueshin


Re: array_contains in package org.apache.spark.sql.functions

2018-06-14 Thread Takuya UESHIN
Hi Chongguang,

Thanks for the report!

That makes sense and the proposition should work, or we can add something
like `def array_contains(column: Column, value: Column)`.
Maybe other functions, such as `array_position`, `element_at`, are the same
situation.

Could you file a JIRA, and submit a PR if possible?
We can have a discussion more about the issue there.

Btw, I guess you can use `expr("array_contains(columnA, columnB)")` as a
workaround.

Thanks.


On Thu, Jun 14, 2018 at 2:15 AM, 刘崇光  wrote:

>
> -- Forwarded message --
> From: 刘崇光 
> Date: Thu, Jun 14, 2018 at 11:08 AM
> Subject: array_contains in package org.apache.spark.sql.functions
> To: u...@spark.apache.org
>
>
> Hello all,
>
> I ran into a use case in project with spark sql and want to share with you
> some thoughts about the function array_contains.
>
> Say I have a Dataframe containing 2 columns. Column A of type "Array of
> String" and Column B of type "String". I want to determine if the value of
> column B is contained in the value of column A, without using a udf of
> course.
> The function array_contains came into my mind naturally:
>
> def array_contains(column: Column, value: Any): Column = withExpr {
>   ArrayContains(column.expr, Literal(value))
> }
>
> However the function takes the column B and does a "Literal" of column B,
> which yields a runtime exception: RuntimeException("Unsupported literal
> type " + v.getClass + " " + v).
>
> Then after discussion with my friends, we fund a solution without using
> udf:
>
> new Column(ArrayContains(df("ColumnA").expr, df("ColumnB").expr)
>
>
> With this solution, I think of empowering a little bit more the function,
> by doing like this:
>
> def array_contains(column: Column, value: Any): Column = withExpr {
>   value match {
> case c: Column => ArrayContains(column.expr, c.expr)
> case _ => ArrayContains(column.expr, Literal(value))
>   }
> }
>
>
> It does a pattern matching to detect if value is of type Column. If yes,
> it will use the .expr of the column, otherwise it will work as it used to.
>
> Any suggestion or opinion on the proposition?
>
>
> Kind regards,
> Chongguang LIU
>
>


-- 
Takuya UESHIN
Tokyo, Japan

http://twitter.com/ueshin


Re: Welcome Zhenhua Wang as a Spark committer

2018-04-02 Thread Takuya UESHIN
Congratulations!

On Mon, Apr 2, 2018 at 10:34 AM, Dongjoon Hyun <dongjoon.h...@gmail.com>
wrote:

> Congratulations!
>
> Bests,
> Dongjoon.
>
> On Mon, Apr 2, 2018 at 07:57 Cody Koeninger <c...@koeninger.org> wrote:
>
>> Congrats!
>>
>> On Mon, Apr 2, 2018 at 12:28 AM, Wenchen Fan <cloud0...@gmail.com> wrote:
>> > Hi all,
>> >
>> > The Spark PMC recently added Zhenhua Wang as a committer on the project.
>> > Zhenhua is the major contributor of the CBO project, and has been
>> > contributing across several areas of Spark for a while, focusing
>> especially
>> > on analyzer, optimizer in Spark SQL. Please join me in welcoming
>> Zhenhua!
>> >
>> > Wenchen
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


-- 
Takuya UESHIN
Tokyo, Japan

http://twitter.com/ueshin


Re: Welcoming some new committers

2018-03-02 Thread Takuya UESHIN
Congratulations and welcome!

On Sat, Mar 3, 2018 at 10:21 AM, Xingbo Jiang <jiangxb1...@gmail.com> wrote:

> Congratulations to everyone!
>
> 2018-03-03 8:51 GMT+08:00 Ilan Filonenko <i...@cornell.edu>:
>
>> Congrats to everyone! :)
>>
>> On Fri, Mar 2, 2018 at 7:34 PM Felix Cheung <felixcheun...@hotmail.com>
>> wrote:
>>
>>> Congrats and welcome!
>>>
>>> --
>>> *From:* Dongjoon Hyun <dongjoon.h...@gmail.com>
>>> *Sent:* Friday, March 2, 2018 4:27:10 PM
>>> *To:* Spark dev list
>>> *Subject:* Re: Welcoming some new committers
>>>
>>> Congrats to all!
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>> On Fri, Mar 2, 2018 at 4:13 PM, Wenchen Fan <cloud0...@gmail.com> wrote:
>>>
>>>> Congratulations to everyone and welcome!
>>>>
>>>> On Sat, Mar 3, 2018 at 7:26 AM, Cody Koeninger <c...@koeninger.org>
>>>> wrote:
>>>>
>>>>> Congrats to the new committers, and I appreciate the vote of
>>>>> confidence.
>>>>>
>>>>> On Fri, Mar 2, 2018 at 4:41 PM, Matei Zaharia <matei.zaha...@gmail.com>
>>>>> wrote:
>>>>> > Hi everyone,
>>>>> >
>>>>> > The Spark PMC has recently voted to add several new committers to
>>>>> the project, based on their contributions to Spark 2.3 and other past 
>>>>> work:
>>>>> >
>>>>> > - Anirudh Ramanathan (contributor to Kubernetes support)
>>>>> > - Bryan Cutler (contributor to PySpark and Arrow support)
>>>>> > - Cody Koeninger (contributor to streaming and Kafka support)
>>>>> > - Erik Erlandson (contributor to Kubernetes support)
>>>>> > - Matt Cheah (contributor to Kubernetes support and other parts of
>>>>> Spark)
>>>>> > - Seth Hendrickson (contributor to MLlib and PySpark)
>>>>> >
>>>>> > Please join me in welcoming Anirudh, Bryan, Cody, Erik, Matt and
>>>>> Seth as committers!
>>>>> >
>>>>> > Matei
>>>>> > 
>>>>> -
>>>>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>> >
>>>>>
>>>>> -
>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>
>>>>>
>>>>
>>>
>


-- 
Takuya UESHIN
Tokyo, Japan

http://twitter.com/ueshin


Re: [VOTE] Spark 2.3.0 (RC5)

2018-02-22 Thread Takuya UESHIN
+1

On Fri, Feb 23, 2018 at 12:24 PM, Wenchen Fan <cloud0...@gmail.com> wrote:

> +1
>
> On Fri, Feb 23, 2018 at 6:23 AM, Sameer Agarwal <samee...@apache.org>
> wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.3.0. The vote is open until Tuesday February 27, 2018 at 8:00:00 am UTC
>> and passes if a majority of at least 3 PMC +1 votes are cast.
>>
>>
>> [ ] +1 Release this package as Apache Spark 2.3.0
>>
>> [ ] -1 Do not release this package because ...
>>
>>
>> To learn more about Apache Spark, please see https://spark.apache.org/
>>
>> The tag to be voted on is v2.3.0-rc5: https://github.com/apache/spar
>> k/tree/v2.3.0-rc5 (992447fb30ee9ebb3cf794f2d06f4d63a2d792db)
>>
>> List of JIRA tickets resolved in this release can be found here:
>> https://issues.apache.org/jira/projects/SPARK/versions/12339551
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc5-bin/
>>
>> Release artifacts are signed with the following key:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1266/
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc5-docs
>> /_site/index.html
>>
>>
>> FAQ
>>
>> ===
>> What are the unresolved issues targeted for 2.3.0?
>> ===
>>
>> Please see https://s.apache.org/oXKi. At the time of writing, there are
>> currently no known release blockers.
>>
>> =
>> How can I help test this release?
>> =
>>
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install the
>> current RC and see if anything important breaks, in the Java/Scala you can
>> add the staging repository to your projects resolvers and test with the RC
>> (make sure to clean up the artifact cache before/after so you don't end up
>> building with a out of date RC going forward).
>>
>> ===
>> What should happen to JIRA tickets still targeting 2.3.0?
>> ===
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should be
>> worked on immediately. Everything else please retarget to 2.3.1 or 2.4.0 as
>> appropriate.
>>
>> ===
>> Why is my bug not fixed?
>> ===
>>
>> In order to make timely releases, we will typically not hold the release
>> unless the bug in question is a regression from 2.2.0. That being said, if
>> there is something which is a regression from 2.2.0 and has not been
>> correctly targeted please ping me or a committer to help target the issue
>> (you can see the open issues listed as impacting Spark 2.3.0 at
>> https://s.apache.org/WmoI).
>>
>
>


-- 
Takuya UESHIN
Tokyo, Japan

http://twitter.com/ueshin


Re: [VOTE] Spark 2.3.0 (RC4)

2018-02-19 Thread Takuya UESHIN
+1


On Tue, Feb 20, 2018 at 2:14 PM, Xingbo Jiang <jiangxb1...@gmail.com> wrote:

> +1
>
>
> Wenchen Fan <cloud0...@gmail.com>于2018年2月20日 周二下午1:09写道:
>
>> +1
>>
>> On Tue, Feb 20, 2018 at 12:53 PM, Reynold Xin <r...@databricks.com>
>> wrote:
>>
>>> +1
>>>
>>> On Feb 20, 2018, 5:51 PM +1300, Sameer Agarwal <sameer.a...@gmail.com>,
>>> wrote:
>>>
>>> this file shouldn't be included? https://dist.apache.org/repos/
>>>> dist/dev/spark/v2.3.0-rc4-bin/spark-parent_2.11.iml
>>>>
>>>
>>> I've now deleted this file
>>>
>>> *From:* Sameer Agarwal <sameer.a...@gmail.com>
>>>> *Sent:* Saturday, February 17, 2018 1:43:39 PM
>>>> *To:* Sameer Agarwal
>>>> *Cc:* dev
>>>> *Subject:* Re: [VOTE] Spark 2.3.0 (RC4)
>>>>
>>>> I'll start with a +1 once again.
>>>>
>>>> All blockers reported against RC3 have been resolved and the builds are
>>>> healthy.
>>>>
>>>> On 17 February 2018 at 13:41, Sameer Agarwal <samee...@apache.org>
>>>> wrote:
>>>>
>>>>> Please vote on releasing the following candidate as Apache Spark
>>>>> version 2.3.0. The vote is open until Thursday February 22, 2018 at 
>>>>> 8:00:00
>>>>> am UTC and passes if a majority of at least 3 PMC +1 votes are cast.
>>>>>
>>>>>
>>>>> [ ] +1 Release this package as Apache Spark 2.3.0
>>>>>
>>>>> [ ] -1 Do not release this package because ...
>>>>>
>>>>>
>>>>> To learn more about Apache Spark, please see https://spark.apache.org/
>>>>>
>>>>> The tag to be voted on is v2.3.0-rc4: https://github.com/apache/
>>>>> spark/tree/v2.3.0-rc4 (44095cb65500739695b0324c177c19dfa1471472)
>>>>>
>>>>> List of JIRA tickets resolved in this release can be found here:
>>>>> https://issues.apache.org/jira/projects/SPARK/versions/12339551
>>>>>
>>>>> The release files, including signatures, digests, etc. can be found at:
>>>>> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc4-bin/
>>>>>
>>>>> Release artifacts are signed with the following key:
>>>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>>>
>>>>> The staging repository for this release can be found at:
>>>>> https://repository.apache.org/content/repositories/
>>>>> orgapachespark-1265/
>>>>>
>>>>> The documentation corresponding to this release can be found at:
>>>>> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc4-
>>>>> docs/_site/index.html
>>>>>
>>>>>
>>>>> FAQ
>>>>>
>>>>> ===
>>>>> What are the unresolved issues targeted for 2.3.0?
>>>>> ===
>>>>>
>>>>> Please see https://s.apache.org/oXKi. At the time of writing, there
>>>>> are currently no known release blockers.
>>>>>
>>>>> =
>>>>> How can I help test this release?
>>>>> =
>>>>>
>>>>> If you are a Spark user, you can help us test this release by taking
>>>>> an existing Spark workload and running on this release candidate, then
>>>>> reporting any regressions.
>>>>>
>>>>> If you're working in PySpark you can set up a virtual env and install
>>>>> the current RC and see if anything important breaks, in the Java/Scala you
>>>>> can add the staging repository to your projects resolvers and test with 
>>>>> the
>>>>> RC (make sure to clean up the artifact cache before/after so you don't end
>>>>> up building with a out of date RC going forward).
>>>>>
>>>>> ===
>>>>> What should happen to JIRA tickets still targeting 2.3.0?
>>>>> ===
>>>>>
>>>>> Committers should look at those and triage. Extremely important bug
>>>>> fixes, documentation, and API tweaks that impact compatibility should be
>>>>> worked on immediately. Everything else please retarget to 2.3.1 or 2.4.0 
>>>>> as
>>>>> appropriate.
>>>>>
>>>>> ===
>>>>> Why is my bug not fixed?
>>>>> ===
>>>>>
>>>>> In order to make timely releases, we will typically not hold the
>>>>> release unless the bug in question is a regression from 2.2.0. That being
>>>>> said, if there is something which is a regression from 2.2.0 and has not
>>>>> been correctly targeted please ping me or a committer to help target the
>>>>> issue (you can see the open issues listed as impacting Spark 2.3.0 at
>>>>> https://s.apache.org/WmoI).
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Sameer Agarwal
>>>> Computer Science | UC Berkeley
>>>> http://cs.berkeley.edu/~sameerag
>>>>
>>>
>>>
>>>
>>> --
>>> Sameer Agarwal
>>> Computer Science | UC Berkeley
>>> http://cs.berkeley.edu/~sameerag
>>>
>>>
>>


-- 
Takuya UESHIN
Tokyo, Japan

http://twitter.com/ueshin


Re: [discuss][PySpark] Can we drop support old Pandas (<0.19.2) or what version should we support?

2017-11-15 Thread Takuya UESHIN
Thanks for feedback.

Hyukjin Kwon:
> My only worry is, users who depends on lower pandas versions

That's what I worried and one of the reasons I moved this discussion here.

Li Jin:
> how complicated it is to support pandas < 0.19.2 with old non-Arrow
interops

In my original PR (https://github.com/apache/spark/pull/19607) we will fix
the behavior of timestamp values for Pandas.
If we need to support old Pandas, we will need at least some workarounds
like in the following link:
https://github.com/apache/spark/blob/e919ed55758f75733d56287d5a49326b1067a44c/python/pyspark/sql/types.py#L1718-L1774


Thanks.


On Wed, Nov 15, 2017 at 12:59 AM, Li Jin <ice.xell...@gmail.com> wrote:

> I think this makes sense. PySpark/Pandas interops in 2.3 are new anyway, I
> don't think we need to support the new functionality with older version of
> pandas (Takuya's reason 3)
>
> One thing I am not sure is how complicated it is to support pandas <
> 0.19.2 with old non-Arrow interops and require pandas >= 0.19.2 for new
> Arrow interops. Maybe it makes sense to allow user keep using their PySpark
> code if they don't want to use any of the new stuff. If this is still
> complicated, I would be leaning towards not supporting < 0.19.2.
>
>
> On Tue, Nov 14, 2017 at 6:04 AM, Hyukjin Kwon <gurwls...@gmail.com> wrote:
>
>> +0 to drop it as I said in the PR. I am seeing It brings a lot of hard
>> time to get the cool changes through, and is slowing down them to get
>> pushed.
>>
>> My only worry is, users who depends on lower pandas versions (Pandas
>> 0.19.2 seems released less then a year before. In the similar time, Spark
>> 2.1.0 was released).
>>
>> If this worry is less than I expected, I definitely support it. It should
>> speed up those cool changes.
>>
>>
>> On 14 Nov 2017 7:14 pm, "Takuya UESHIN" <ues...@happy-camper.st> wrote:
>>
>> Hi all,
>>
>> I'd like to raise a discussion about Pandas version.
>> Originally we are discussing it at https://github.com/apache/s
>> park/pull/19607 but we'd like to ask for feedback from community.
>>
>>
>> Currently we don't explicitly specify the Pandas version we are
>> supporting but we need to decide what version we should support because:
>>
>>   - There have been a number of API evolutions around extension dtypes
>> that make supporting pandas 0.18.x and lower challenging.
>>
>>   - Sometimes Pandas older than 0.19.2 doesn't handle timestamp values
>> properly. We want to provide properer support for timestamp values.
>>
>>   - If users want to use vectorized UDFs, or toPandas / createDataFrame
>> from Pandas DataFrame with Arrow which will be released in Spark 2.3, users
>> have to upgrade Pandas 0.19.2 or upper anyway because we need pyarrow
>> internally, which supports only 0.19.2 or upper.
>>
>>
>> The point I'd like to ask is:
>>
>> Can we drop support old Pandas (<0.19.2)?
>> If not, what version should we support?
>>
>>
>> References:
>>
>> - vectorized UDF
>>   - https://github.com/apache/spark/pull/18659
>>   - https://github.com/apache/spark/pull/18732
>> - toPandas with Arrow
>>   - https://github.com/apache/spark/pull/18459
>> - createDataFrame from pandas DataFrame with Arrow
>>   - https://github.com/apache/spark/pull/19646
>>
>>
>> Any comments are welcome!
>>
>> Thanks.
>>
>> --
>> Takuya UESHIN
>> Tokyo, Japan
>>
>> http://twitter.com/ueshin
>>
>>
>>
>


-- 
Takuya UESHIN
Tokyo, Japan

http://twitter.com/ueshin


[discuss][PySpark] Can we drop support old Pandas (<0.19.2) or what version should we support?

2017-11-14 Thread Takuya UESHIN
Hi all,

I'd like to raise a discussion about Pandas version.
Originally we are discussing it at
https://github.com/apache/spark/pull/19607 but we'd like to ask for
feedback from community.


Currently we don't explicitly specify the Pandas version we are supporting
but we need to decide what version we should support because:

  - There have been a number of API evolutions around extension dtypes that
make supporting pandas 0.18.x and lower challenging.

  - Sometimes Pandas older than 0.19.2 doesn't handle timestamp values
properly. We want to provide properer support for timestamp values.

  - If users want to use vectorized UDFs, or toPandas / createDataFrame
from Pandas DataFrame with Arrow which will be released in Spark 2.3, users
have to upgrade Pandas 0.19.2 or upper anyway because we need pyarrow
internally, which supports only 0.19.2 or upper.


The point I'd like to ask is:

Can we drop support old Pandas (<0.19.2)?
If not, what version should we support?


References:

- vectorized UDF
  - https://github.com/apache/spark/pull/18659
  - https://github.com/apache/spark/pull/18732
- toPandas with Arrow
  - https://github.com/apache/spark/pull/18459
- createDataFrame from pandas DataFrame with Arrow
  - https://github.com/apache/spark/pull/19646


Any comments are welcome!

Thanks.

-- 
Takuya UESHIN
Tokyo, Japan

http://twitter.com/ueshin


Re: Welcoming Tejas Patil as a Spark committer

2017-10-03 Thread Takuya UESHIN
Congratulations!


On Tue, Oct 3, 2017 at 2:47 AM, Tejas Patil <tejas.patil...@gmail.com>
wrote:

> Thanks everyone !!! It's a great privilege to be part of the Spark
> community.
>
> ~tejasp
>
> On Sat, Sep 30, 2017 at 2:27 PM, Jacek Laskowski <ja...@japila.pl> wrote:
>
>> Hi,
>>
>> Oh, yeah. Seen Tejas here and there in the commits. Well deserved.
>>
>> Jacek
>>
>> On 29 Sep 2017 9:58 pm, "Matei Zaharia" <matei.zaha...@gmail.com> wrote:
>>
>> Hi all,
>>
>> The Spark PMC recently added Tejas Patil as a committer on the
>> project. Tejas has been contributing across several areas of Spark for
>> a while, focusing especially on scalability issues and SQL. Please
>> join me in welcoming Tejas!
>>
>> Matei
>>
>> -----
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>>
>


-- 
Takuya UESHIN
Tokyo, Japan

http://twitter.com/ueshin


Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

2017-09-12 Thread Takuya UESHIN
This vote passes with 4 binding +1 votes, 6 non-binding votes, no +0 vote,
and no -1 votes.

Thanks all!

+1 votes (binding):
Reynold Xin
Wenchen Fan
Yin Huai
Matei Zaharia


+1 votes (non-binding):
Felix Cheung
Bryan Cutler
Sameer Agarwal
Hyukjin Kwon
Xiao Li
Liang-Chi Hsieh



On Tue, Sep 12, 2017 at 11:46 AM, Liang-Chi Hsieh <vii...@gmail.com> wrote:

> +1
>
>
> Xiao Li wrote
> > +1
> >
> > Xiao
> > On Mon, 11 Sep 2017 at 6:44 PM Matei Zaharia 
>
> > matei.zaharia@
>
> > 
> > wrote:
> >
> >> +1 (binding)
> >>
> >> > On Sep 11, 2017, at 5:54 PM, Hyukjin Kwon 
>
> > gurwls223@
>
> >  wrote:
> >> >
> >> > +1 (non-binding)
> >> >
> >> >
> >> > 2017-09-12 9:52 GMT+09:00 Yin Huai 
>
> > yhuai@
>
> > :
> >> > +1
> >> >
> >> > On Mon, Sep 11, 2017 at 5:47 PM, Sameer Agarwal 
>
> > sameer@
>
> > 
> >> wrote:
> >> > +1 (non-binding)
> >> >
> >> > On Thu, Sep 7, 2017 at 9:10 PM, Bryan Cutler 
>
> > cutlerb@
>
> >  wrote:
> >> > +1 (non-binding) for the goals and non-goals of this SPIP.  I think
> >> it's
> >> fine to work out the minor details of the API during review.
> >> >
> >> > Bryan
> >> >
> >> > On Wed, Sep 6, 2017 at 5:17 AM, Takuya UESHIN 
>
> > ueshin@
>
> > 
> >> wrote:
> >> > Hi all,
> >> >
> >> > Thank you for voting and suggestions.
> >> >
> >> > As Wenchen mentioned and also we're discussing at JIRA, we need to
> >> discuss the size hint for the 0-parameter UDF.
> >> > But I believe we got a consensus about the basic APIs except for the
> >> size hint, I'd like to submit a pr based on the current proposal and
> >> continue discussing in its review.
> >> >
> >> > https://github.com/apache/spark/pull/19147
> >> >
> >> > I'd keep this vote open to wait for more opinions.
> >> >
> >> > Thanks.
> >> >
> >> >
> >> > On Wed, Sep 6, 2017 at 9:48 AM, Wenchen Fan 
>
> > cloud0fan@
>
> >  wrote:
> >> > +1 on the design and proposed API.
> >> >
> >> > One detail I'd like to discuss is the 0-parameter UDF, how we can
> >> specify the size hint. This can be done in the PR review though.
> >> >
> >> > On Sat, Sep 2, 2017 at 2:07 AM, Felix Cheung 
>
> > felixcheung_m@
>
> > 
> >> wrote:
> >> > +1 on this and like the suggestion of type in string form.
> >> >
> >> > Would it be correct to assume there will be data type check, for
> >> example
> >> the returned pandas data frame column data types match what are
> >> specified.
> >> We have seen quite a bit of issues/confusions with that in R.
> >> >
> >> > Would it make sense to have a more generic decorator name so that it
> >> could also be useable for other efficient vectorized format in the
> >> future?
> >> Or do we anticipate the decorator to be format specific and will have
> >> more
> >> in the future?
> >> >
> >> > From: Reynold Xin 
>
> > rxin@
>
> > 
> >> > Sent: Friday, September 1, 2017 5:16:11 AM
> >> > To: Takuya UESHIN
> >> > Cc: spark-dev
> >> > Subject: Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python
> >> >
> >> > Ok, thanks.
> >> >
> >> > +1 on the SPIP for scope etc
> >> >
> >> >
> >> > On API details (will deal with in code reviews as well but leaving a
> >> note here in case I forget)
> >> >
> >> > 1. I would suggest having the API also accept data type specification
> >> in
> >> string form. It is usually simpler to say "long" then "LongType()".
> >> >
> >> > 2. Think about what error message to show when the rows numbers don't
> >> match at runtime.
> >> >
> >> >
> >> > On Fri, Sep 1, 2017 at 12:29 PM Takuya UESHIN 
>
> > ueshin@
>
> > 
> >> wrote:
> >> > Yes, the aggregation is out of scope for now.
> >> > I think we should continue discussing the aggregation at JIRA and we
> >> will be adding those later separately.
> >> >
> >> > Thanks.
> >> >
> >>

Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

2017-09-06 Thread Takuya UESHIN
Hi all,

Thank you for voting and suggestions.

As Wenchen mentioned and also we're discussing at JIRA, we need to discuss
the size hint for the 0-parameter UDF.
But I believe we got a consensus about the basic APIs except for the size
hint, I'd like to submit a pr based on the current proposal and continue
discussing in its review.

https://github.com/apache/spark/pull/19147

I'd keep this vote open to wait for more opinions.

Thanks.


On Wed, Sep 6, 2017 at 9:48 AM, Wenchen Fan <cloud0...@gmail.com> wrote:

> +1 on the design and proposed API.
>
> One detail I'd like to discuss is the 0-parameter UDF, how we can specify
> the size hint. This can be done in the PR review though.
>
> On Sat, Sep 2, 2017 at 2:07 AM, Felix Cheung <felixcheun...@hotmail.com>
> wrote:
>
>> +1 on this and like the suggestion of type in string form.
>>
>> Would it be correct to assume there will be data type check, for example
>> the returned pandas data frame column data types match what are specified.
>> We have seen quite a bit of issues/confusions with that in R.
>>
>> Would it make sense to have a more generic decorator name so that it
>> could also be useable for other efficient vectorized format in the future?
>> Or do we anticipate the decorator to be format specific and will have more
>> in the future?
>>
>> --
>> *From:* Reynold Xin <r...@databricks.com>
>> *Sent:* Friday, September 1, 2017 5:16:11 AM
>> *To:* Takuya UESHIN
>> *Cc:* spark-dev
>> *Subject:* Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python
>>
>> Ok, thanks.
>>
>> +1 on the SPIP for scope etc
>>
>>
>> On API details (will deal with in code reviews as well but leaving a note
>> here in case I forget)
>>
>> 1. I would suggest having the API also accept data type specification in
>> string form. It is usually simpler to say "long" then "LongType()".
>>
>> 2. Think about what error message to show when the rows numbers don't
>> match at runtime.
>>
>>
>> On Fri, Sep 1, 2017 at 12:29 PM Takuya UESHIN <ues...@happy-camper.st>
>> wrote:
>>
>>> Yes, the aggregation is out of scope for now.
>>> I think we should continue discussing the aggregation at JIRA and we
>>> will be adding those later separately.
>>>
>>> Thanks.
>>>
>>>
>>> On Fri, Sep 1, 2017 at 6:52 PM, Reynold Xin <r...@databricks.com> wrote:
>>>
>>>> Is the idea aggregate is out of scope for the current effort and we
>>>> will be adding those later?
>>>>
>>>> On Fri, Sep 1, 2017 at 8:01 AM Takuya UESHIN <ues...@happy-camper.st>
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> We've been discussing to support vectorized UDFs in Python and we
>>>>> almost got a consensus about the APIs, so I'd like to summarize and
>>>>> call for a vote.
>>>>>
>>>>> Note that this vote should focus on APIs for vectorized UDFs, not APIs
>>>>> for vectorized UDAFs or Window operations.
>>>>>
>>>>> https://issues.apache.org/jira/browse/SPARK-21190
>>>>>
>>>>>
>>>>> *Proposed API*
>>>>>
>>>>> We introduce a @pandas_udf decorator (or annotation) to define
>>>>> vectorized UDFs which takes one or more pandas.Series or one integer
>>>>> value meaning the length of the input value for 0-parameter UDFs. The
>>>>> return value should be pandas.Series of the specified type and the
>>>>> length of the returned value should be the same as input value.
>>>>>
>>>>> We can define vectorized UDFs as:
>>>>>
>>>>>   @pandas_udf(DoubleType())
>>>>>   def plus(v1, v2):
>>>>>   return v1 + v2
>>>>>
>>>>> or we can define as:
>>>>>
>>>>>   plus = pandas_udf(lambda v1, v2: v1 + v2, DoubleType())
>>>>>
>>>>> We can use it similar to row-by-row UDFs:
>>>>>
>>>>>   df.withColumn('sum', plus(df.v1, df.v2))
>>>>>
>>>>> As for 0-parameter UDFs, we can define and use as:
>>>>>
>>>>>   @pandas_udf(LongType())
>>>>>   def f0(size):
>>>>>   return pd.Series(1).repeat(size)
>>>>>
>>>>>   df.select(f0())
>>>>>
>>>>>
>>>>>
>>>>> The vote will be up for the next 72 hours. Please reply with your vote:
>>>>>
>>>>> +1: Yeah, let's go forward and implement the SPIP.
>>>>> +0: Don't really care.
>>>>> -1: I don't think this is a good idea because of the following technical
>>>>> reasons.
>>>>>
>>>>> Thanks!
>>>>>
>>>>> --
>>>>> Takuya UESHIN
>>>>> Tokyo, Japan
>>>>>
>>>>> http://twitter.com/ueshin
>>>>>
>>>>
>>>
>>>
>>> --
>>> Takuya UESHIN
>>> Tokyo, Japan
>>>
>>> http://twitter.com/ueshin
>>>
>>
>


-- 
Takuya UESHIN
Tokyo, Japan

http://twitter.com/ueshin


Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

2017-09-01 Thread Takuya UESHIN
Yes, the aggregation is out of scope for now.
I think we should continue discussing the aggregation at JIRA and we will
be adding those later separately.

Thanks.


On Fri, Sep 1, 2017 at 6:52 PM, Reynold Xin <r...@databricks.com> wrote:

> Is the idea aggregate is out of scope for the current effort and we will
> be adding those later?
>
> On Fri, Sep 1, 2017 at 8:01 AM Takuya UESHIN <ues...@happy-camper.st>
> wrote:
>
>> Hi all,
>>
>> We've been discussing to support vectorized UDFs in Python and we almost
>> got a consensus about the APIs, so I'd like to summarize and call for a
>> vote.
>>
>> Note that this vote should focus on APIs for vectorized UDFs, not APIs
>> for vectorized UDAFs or Window operations.
>>
>> https://issues.apache.org/jira/browse/SPARK-21190
>>
>>
>> *Proposed API*
>>
>> We introduce a @pandas_udf decorator (or annotation) to define
>> vectorized UDFs which takes one or more pandas.Series or one integer
>> value meaning the length of the input value for 0-parameter UDFs. The
>> return value should be pandas.Series of the specified type and the
>> length of the returned value should be the same as input value.
>>
>> We can define vectorized UDFs as:
>>
>>   @pandas_udf(DoubleType())
>>   def plus(v1, v2):
>>   return v1 + v2
>>
>> or we can define as:
>>
>>   plus = pandas_udf(lambda v1, v2: v1 + v2, DoubleType())
>>
>> We can use it similar to row-by-row UDFs:
>>
>>   df.withColumn('sum', plus(df.v1, df.v2))
>>
>> As for 0-parameter UDFs, we can define and use as:
>>
>>   @pandas_udf(LongType())
>>   def f0(size):
>>   return pd.Series(1).repeat(size)
>>
>>   df.select(f0())
>>
>>
>>
>> The vote will be up for the next 72 hours. Please reply with your vote:
>>
>> +1: Yeah, let's go forward and implement the SPIP.
>> +0: Don't really care.
>> -1: I don't think this is a good idea because of the following technical
>> reasons.
>>
>> Thanks!
>>
>> --
>> Takuya UESHIN
>> Tokyo, Japan
>>
>> http://twitter.com/ueshin
>>
>


-- 
Takuya UESHIN
Tokyo, Japan

http://twitter.com/ueshin


[VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

2017-09-01 Thread Takuya UESHIN
Hi all,

We've been discussing to support vectorized UDFs in Python and we almost
got a consensus about the APIs, so I'd like to summarize and call for a
vote.

Note that this vote should focus on APIs for vectorized UDFs, not APIs for
vectorized UDAFs or Window operations.

https://issues.apache.org/jira/browse/SPARK-21190


*Proposed API*

We introduce a @pandas_udf decorator (or annotation) to define vectorized
UDFs which takes one or more pandas.Series or one integer value meaning the
length of the input value for 0-parameter UDFs. The return value should be
pandas.Series of the specified type and the length of the returned value
should be the same as input value.

We can define vectorized UDFs as:

  @pandas_udf(DoubleType())
  def plus(v1, v2):
  return v1 + v2

or we can define as:

  plus = pandas_udf(lambda v1, v2: v1 + v2, DoubleType())

We can use it similar to row-by-row UDFs:

  df.withColumn('sum', plus(df.v1, df.v2))

As for 0-parameter UDFs, we can define and use as:

  @pandas_udf(LongType())
  def f0(size):
  return pd.Series(1).repeat(size)

  df.select(f0())



The vote will be up for the next 72 hours. Please reply with your vote:

+1: Yeah, let's go forward and implement the SPIP.
+0: Don't really care.
-1: I don't think this is a good idea because of the following technical
reasons.

Thanks!

-- 
Takuya UESHIN
Tokyo, Japan

http://twitter.com/ueshin


Re: Welcoming Saisai (Jerry) Shao as a committer

2017-08-28 Thread Takuya UESHIN
Congratulations, Jerry!


On Tue, Aug 29, 2017 at 2:14 PM, Suresh Thalamati <
suresh.thalam...@gmail.com> wrote:

> Congratulations, Jerry
>
> > On Aug 28, 2017, at 6:28 PM, Matei Zaharia <matei.zaha...@gmail.com>
> wrote:
> >
> > Hi everyone,
> >
> > The PMC recently voted to add Saisai (Jerry) Shao as a committer. Saisai
> has been contributing to many areas of the project for a long time, so it’s
> great to see him join. Join me in thanking and congratulating him!
> >
> > Matei
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
>
>
> ---------
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


-- 
Takuya UESHIN
Tokyo, Japan

http://twitter.com/ueshin


Re: Welcoming Hyukjin Kwon and Sameer Agarwal as committers

2017-08-07 Thread Takuya UESHIN
Congrats!

On Tue, Aug 8, 2017 at 11:38 AM, Felix Cheung <felixcheun...@hotmail.com>
wrote:

> Congrats!!
>
> --
> *From:* Kevin Kim (Sangwoo) <ke...@between.us>
> *Sent:* Monday, August 7, 2017 7:30:01 PM
> *To:* Hyukjin Kwon; dev
> *Cc:* Bryan Cutler; Mridul Muralidharan; Matei Zaharia; Holden Karau
> *Subject:* Re: Welcoming Hyukjin Kwon and Sameer Agarwal as committers
>
> Thanks for all of your hard work, Hyukjin and Sameer. Congratulations!!
>
>
> 2017년 8월 8일 (화) 오전 9:44, Hyukjin Kwon <gurwls...@gmail.com>님이 작성:
>
>> Thank you all. Will do my best!
>>
>> 2017-08-08 8:53 GMT+09:00 Holden Karau <hol...@pigscanfly.ca>:
>>
>>> Congrats!
>>>
>>> On Mon, Aug 7, 2017 at 3:54 PM Bryan Cutler <cutl...@gmail.com> wrote:
>>>
>>>> Great work Hyukjin and Sameer!
>>>>
>>>> On Mon, Aug 7, 2017 at 10:22 AM, Mridul Muralidharan <mri...@gmail.com>
>>>> wrote:
>>>>
>>>>> Congratulations Hyukjin, Sameer !
>>>>>
>>>>> Regards,
>>>>> Mridul
>>>>>
>>>>> On Mon, Aug 7, 2017 at 8:53 AM, Matei Zaharia <matei.zaha...@gmail.com>
>>>>> wrote:
>>>>> > Hi everyone,
>>>>> >
>>>>> > The Spark PMC recently voted to add Hyukjin Kwon and Sameer Agarwal
>>>>> as committers. Join me in congratulating both of them and thanking them 
>>>>> for
>>>>> their contributions to the project!
>>>>> >
>>>>> > Matei
>>>>> > 
>>>>> -
>>>>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>> >
>>>>>
>>>>> -
>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>
>>>>>
>>>> --
>>> Cell : 425-233-8271 <(425)%20233-8271>
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>
>>


-- 
Takuya UESHIN
Tokyo, Japan

http://twitter.com/ueshin


Re: welcoming Takuya Ueshin as a new Apache Spark committer

2017-02-13 Thread Takuya UESHIN
Thank you very much everyone!
I really look forward to working with you!


On Tue, Feb 14, 2017 at 9:47 AM, Yanbo Liang <yblia...@gmail.com> wrote:

> Congratulations!
>
> On Mon, Feb 13, 2017 at 3:29 PM, Kazuaki Ishizaki <ishiz...@jp.ibm.com>
> wrote:
>
>> Congrats!
>>
>> Kazuaki Ishizaki
>>
>>
>>
>> From:Reynold Xin <r...@databricks.com>
>> To:"dev@spark.apache.org" <dev@spark.apache.org>
>> Date:2017/02/14 04:18
>> Subject:welcoming Takuya Ueshin as a new Apache Spark committer
>> --
>>
>>
>>
>> Hi all,
>>
>> Takuya-san has recently been elected an Apache Spark committer. He's been
>> active in the SQL area and writes very small, surgical patches that are
>> high quality. Please join me in congratulating Takuya-san!
>>
>>
>>
>>
>


-- 
Takuya UESHIN
Tokyo, Japan

http://twitter.com/ueshin


Re: [vote] Apache Spark 2.0.0-preview release (rc1)

2016-05-19 Thread Takuya UESHIN
-1 (non-binding)

I filed 2 major bugs of Spark SQL:

SPARK-15308 <https://issues.apache.org/jira/browse/SPARK-15308>: RowEncoder
should preserve nested column name.
SPARK-15313 <https://issues.apache.org/jira/browse/SPARK-15313>:
EmbedSerializerInFilter
rule should keep exprIds of output of surrounded SerializeFromObject.

I've sent PRs for those, please check them.

Thanks.




2016-05-18 14:40 GMT+09:00 Reynold Xin <r...@apache.org>:

> Hi,
>
> In the past the Apache Spark community have created preview packages (not
> official releases) and used those as opportunities to ask community members
> to test the upcoming versions of Apache Spark. Several people in the Apache
> community have suggested we conduct votes for these preview packages and
> turn them into formal releases by the Apache foundation's standard. Preview
> releases are not meant to be functional, i.e. they can and highly likely
> will contain critical bugs or documentation errors, but we will be able to
> post them to the project's website to get wider feedback. They should
> satisfy the legal requirements of Apache's release policy (
> http://www.apache.org/dev/release.html) such as having proper licenses.
>
>
> Please vote on releasing the following candidate as Apache Spark version
> 2.0.0-preview. The vote is open until Friday, May 20, 2015 at 11:00 PM PDT
> and passes if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.0.0-preview
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is 2.0.0-preview
> (8f5a04b6299e3a47aca13cbb40e72344c0114860)
>
> The release files, including signatures, digests, etc. can be found at:
> http://home.apache.org/~pwendell/spark-releases/spark-2.0.0-preview-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The documentation corresponding to this release can be found at:
> http://home.apache.org/~pwendell/spark-releases/spark-2.0.0-preview-docs/
>
> The list of resolved issues are:
> https://issues.apache.org/jira/browse/SPARK-15351?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.0.0
>
>
> If you are a Spark user, you can help us test this release by taking an
> existing Apache Spark workload and running on this candidate, then
> reporting any regressions.
>
>


-- 
Takuya UESHIN
Tokyo, Japan

http://twitter.com/ueshin


Re: What is the correct Spark version of master/branch-1.0?

2014-06-04 Thread Takuya UESHIN
Thank you for your reply.

I've sent pull requests.


Thanks.


2014-06-05 3:16 GMT+09:00 Patrick Wendell pwend...@gmail.com:
 It should be 1.1-SNAPSHOT. Feel free to submit a PR to clean up any
 inconsistencies.

 On Tue, Jun 3, 2014 at 8:33 PM, Takuya UESHIN ues...@happy-camper.st wrote:
 Hi all,

 I'm wondering what is the correct Spark version of each HEAD of master
 and branch-1.0.

 current master HEAD (e8d93ee5284cb6a1d4551effe91ee8d233323329):
 - pom.xml: 1.0.0-SNAPSHOT
 - SparkBuild.scala: 1.1.0-SNAPSHOT

 It should be 1.1.0-SNAPSHOT?


 current branch-1.0 HEAD (d96794132e37cf57f8dd945b9d11f8adcfc30490):
 - pom.xml: 1.0.1-SNAPSHOT
 - SparkBuild.scala: 1.0.0

 It should be 1.0.1-SNAPSHOT?


 Thanks.

 --
 Takuya UESHIN
 Tokyo, Japan

 http://twitter.com/ueshin



-- 
Takuya UESHIN
Tokyo, Japan

http://twitter.com/ueshin


What is the correct Spark version of master/branch-1.0?

2014-06-03 Thread Takuya UESHIN
Hi all,

I'm wondering what is the correct Spark version of each HEAD of master
and branch-1.0.

current master HEAD (e8d93ee5284cb6a1d4551effe91ee8d233323329):
- pom.xml: 1.0.0-SNAPSHOT
- SparkBuild.scala: 1.1.0-SNAPSHOT

It should be 1.1.0-SNAPSHOT?


current branch-1.0 HEAD (d96794132e37cf57f8dd945b9d11f8adcfc30490):
- pom.xml: 1.0.1-SNAPSHOT
- SparkBuild.scala: 1.0.0

It should be 1.0.1-SNAPSHOT?


Thanks.

-- 
Takuya UESHIN
Tokyo, Japan

http://twitter.com/ueshin