Re: [VOTE] Release Spark 3.4.1 (RC1)

2023-06-22 Thread Driesprong, Fokko
Thank you for running the release Dongjoon

+1

Tested against Iceberg and it looks good.


Op do 22 jun 2023 om 18:03 schreef yangjie01 :

> +1
>
>
>
> *发件人**: *Dongjoon Hyun 
> *日期**: *2023年6月22日 星期四 23:35
> *收件人**: *Chao Sun 
> *抄送**: *Yuming Wang , Jacek Laskowski ,
> dev 
> *主题**: *Re: [VOTE] Release Spark 3.4.1 (RC1)
>
>
>
> Thank you everyone for your participation.
>
> The vote is open until June 23rd 1AM (PST) and I'll conclude this vote
> after that.
>
> Dongjoon.
>
>
>
>
>
>
>
> On Thu, Jun 22, 2023 at 8:29 AM Chao Sun  wrote:
>
> +1
>
> On Thu, Jun 22, 2023 at 6:52 AM Yuming Wang  wrote:
> >
> > +1.
> >
> > On Thu, Jun 22, 2023 at 4:41 PM Jacek Laskowski  wrote:
> >>
> >> +1
> >>
> >> Builds and runs fine on Java 17, macOS.
> >>
> >> $ ./dev/change-scala-version.sh 2.13
> >> $ mvn \
> >>
> -Pkubernetes,hadoop-cloud,hive,hive-thriftserver,scala-2.13,volcano,connect
> \
> >> -DskipTests \
> >> clean install
> >>
> >> $ python/run-tests --parallelism=1 --testnames 'pyspark.sql.session
> SparkSession.sql'
> >> ...
> >> Tests passed in 28 second
> >>
> >> Pozdrawiam,
> >> Jacek Laskowski
> >> 
> >> "The Internals Of" Online Books
> >> Follow me on https://twitter.com/jaceklaskowski
> 
> >>
> >>
> >>
> >> On Tue, Jun 20, 2023 at 4:41 AM Dongjoon Hyun 
> wrote:
> >>>
> >>> Please vote on releasing the following candidate as Apache Spark
> version 3.4.1.
> >>>
> >>> The vote is open until June 23rd 1AM (PST) and passes if a majority +1
> PMC votes are cast, with a minimum of 3 +1 votes.
> >>>
> >>> [ ] +1 Release this package as Apache Spark 3.4.1
> >>> [ ] -1 Do not release this package because ...
> >>>
> >>> To learn more about Apache Spark, please see https://spark.apache.org/
> 
> >>>
> >>> The tag to be voted on is v3.4.1-rc1 (commit
> 6b1ff22dde1ead51cbf370be6e48a802daae58b6)
> >>> https://github.com/apache/spark/tree/v3.4.1-rc1
> 
> >>>
> >>> The release files, including signatures, digests, etc. can be found at:
> >>> https://dist.apache.org/repos/dist/dev/spark/v3.4.1-rc1-bin/
> 
> >>>
> >>> Signatures used for Spark RCs can be found in this file:
> >>> https://dist.apache.org/repos/dist/dev/spark/KEYS
> 
> >>>
> >>> The staging repository for this release can be found at:
> >>>
> https://repository.apache.org/content/repositories/orgapachespark-1443/
> 
> >>>
> >>> The documentation corresponding to this release can be found at:
> >>> https://dist.apache.org/repos/dist/dev/spark/v3.4.1-rc1-docs/
> 
> >>>
> >>> The list of bug fixes going into 3.4.1 can be found at the following
> URL:
> >>> https://issues.apache.org/jira/projects/SPARK/versions/12352874
> 
> >>>
> >>> This release is using the release script of the tag v3.4.1-rc1.
> >>>
> >>> FAQ
> >>>
> >>> =
> >>> How can I help test this release?
> >>> =
> >>>
> >>> If you are a Spark user, you can help us test this release by taking
> >>> an existing Spark workload and running on this release candidate, then
> >>> reporting any regressions.
> >>>
> >>> If you're working in PySpark you can set up a virtual env and install
> >>> the current RC and see if anything important breaks, in the Java/Scala
> >>> you can add the staging repository to your projects resolvers and test
> >>> with the RC (make sure to clean up the artifact cache before/after so
> >>> you don't end up building with a out of date RC going forward).
> >>>
> >>> ===
> >>> What should happen to JIRA tickets still targeting 3.4.1?
> >>> ===
> >>>
> >>> The current list of open tickets targeted at 3.4.1 can be found at:
> >>> https://issues.apache.org/jira/projects/SPARK
> 
> and search for "Target Version/s" = 3.4.1
> >>>
> >>> Committers should look at those and triage. Extremely important bug
> >>> fixes, documentation, and API tweaks that impact 

Re: Welcoming six new Apache Spark committers

2021-03-26 Thread Driesprong, Fokko
Well deserved all! Welcome!

Op vr 26 mrt. 2021 om 21:21 schreef Matei Zaharia 

> Hi all,
>
> The Spark PMC recently voted to add several new committers. Please join me
> in welcoming them to their new role! Our new committers are:
>
> - Maciej Szymkiewicz (contributor to PySpark)
> - Max Gekk (contributor to Spark SQL)
> - Kent Yao (contributor to Spark SQL)
> - Attila Zsolt Piros (contributor to decommissioning and Spark on
> Kubernetes)
> - Yi Wu (contributor to Spark Core and SQL)
> - Gabor Somogyi (contributor to Streaming and security)
>
> All six of them contributed to Spark 3.1 and we’re very excited to have
> them join as committers.
>
> Matei and the Spark PMC
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: I'm going to be out starting Nov 5th

2020-11-01 Thread Driesprong, Fokko
Hope everything goes well, and see you soon Holden! Take care and stay
strong!

Cheers, Fokko

Op zo 1 nov. 2020 om 18:09 schreef rahul kumar :

>
> Come back strong and healthy Holden!
> On Sun, Nov 1, 2020 at 9:01 AM Holden Karau  wrote:
>
>> Thanks everyone, these kind words mean a lot :) I hope everyone is doing
>> as well as possible with our very challenging 2020. Community is what keeps
>> me going in open source :)
>>
>> On Sun, Nov 1, 2020 at 7:58 AM Xiao Li  wrote:
>>
>>> Take care, Holden!
>>>
>>> Bests,
>>>
>>> Xiao
>>>
>>> On Sat, Oct 31, 2020 at 9:53 PM 郑瑞峰  wrote:
>>>
 Take care, Holden! Best wishes!


 -- 原始邮件 --
 *发件人:* "Hyukjin Kwon" ;
 *发送时间:* 2020年11月1日(星期天) 上午10:24
 *收件人:* "Denny Lee";
 *抄送:* "Dongjoon Hyun";"Holden Karau"<
 hol...@pigscanfly.ca>;"dev";
 *主题:* Re: I'm going to be out starting Nov 5th

 Oh, take care Holden!

 On Sun, 1 Nov 2020, 03:04 Denny Lee,  wrote:

> Best wishes Holden! :)
>
> On Sat, Oct 31, 2020 at 11:00 Dongjoon Hyun 
> wrote:
>
>> Take care, Holden! I believe everything goes well.
>>
>> Bests,
>> Dongjoon.
>>
>> On Sat, Oct 31, 2020 at 10:24 AM Reynold Xin 
>> wrote:
>>
>>> Take care Holden and best of luck with everything!
>>>
>>>
>>> On Sat, Oct 31 2020 at 10:21 AM, Holden Karau 
>>> wrote:
>>>
 Hi Folks,

 Just a heads up so folks working on decommissioning or other areas
 I've been active in don't block on me, I'm going to be out for at 
 least a
 week and possibly more starting on November 5th. If there is anything 
 that
 folks want me to review before then please let me know and I'll make 
 the
 time for it. If you are curious I've got more details at
 http://blog.holdenkarau.com/2020/10/taking-break-surgery.html

 Happy Sparking Everyone,

 Holden :)

 --
 Twitter: https://twitter.com/holdenkarau
 Books (Learning Spark, High Performance Spark, etc.):
 https://amzn.to/2MaRAG9  
 YouTube Live Streams: https://www.youtube.com/user/holdenkarau

>>>
>>>
>>> --
>>>
>>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>


Re: [PySpark] Revisiting PySpark type annotations

2020-08-27 Thread Driesprong, Fokko
Looking at it a second time, I think it is only mypy that's complaining:

MacBook-Pro-van-Fokko:spark fokkodriesprong$ git diff

*diff --git a/python/pyspark/accumulators.pyi
b/python/pyspark/accumulators.pyi*

*index f60de25704..6eafe46a46 100644*

*--- a/python/pyspark/accumulators.pyi*

*+++ b/python/pyspark/accumulators.pyi*

@@ -30,7 +30,7 @@ U = TypeVar("U", bound=SupportsIAdd)



 import socketserver as SocketServer



-_accumulatorRegistry: Dict = {}

+# _accumulatorRegistry: Dict = {}



 class Accumulator(Generic[T]):

 aid: int


MacBook-Pro-van-Fokko:spark fokkodriesprong$ ./dev/lint-python

starting python compilation test...

python compilation succeeded.


starting pycodestyle test...

pycodestyle checks passed.


starting flake8 test...

flake8 checks passed.


starting mypy test...

mypy checks failed:

python/pyspark/worker.py:34: error: Module 'pyspark.accumulators' has no
attribute '_accumulatorRegistry'

Found 1 error in 1 file (checked 185 source files)

1


Sorry for the noise, just my excitement to see this happen. Any action
points that we can define and that I can help on? I'm fine with taking the
route that Hyukjin suggests :)

Cheers, Fokko

Op do 27 aug. 2020 om 18:45 schreef Maciej :

> Well, technically speaking annotation and actual are not the same thing.
> Many parts of Spark API might require heavy overloads to either capture
> relationships between arguments (for example in case of ML) or to capture
> at least rudimentary relationships between inputs and outputs (i.e. udfs).
>
> Just saying...
>
>
>
> On 8/27/20 6:09 PM, Driesprong, Fokko wrote:
>
> Also, it is very cumbersome to add everything to the pyi file. In
> practice, this means copying the method definition from the py file and
> paste it into the pyi file. This hurts my developers' heart, as it
> violates the DRY principle.
>
>
>
> I see many big projects using regular annotations:
> - Pandas:
> https://github.com/pandas-dev/pandas/blob/master/pandas/io/parquet.py#L51
>
> That's probably not a good example, unless something changed significantly
> lately. The last time I participated in the discussion Pandas didn't type
> check and had no clear timeline for advertising annotations.
>
>
> --
> Best regards,
> Maciej Szymkiewicz
>
> Web: https://zero323.net
> Keybase: https://keybase.io/zero323
> Gigs: https://www.codementor.io/@zero323
> PGP: A30CEF0C31A501EC
>
>


Re: [PySpark] Revisiting PySpark type annotations

2020-08-27 Thread Driesprong, Fokko
Thanks for sharing Hyukjin, however, I'm not sure if we're taking the right
direction.

Today I've updated [SPARK-17333][PYSPARK] Enable mypy on the repository
<https://github.com/apache/spark/pull/29180/> and while doing so I've
noticed that all the methods that aren't in the pyi file are *unable to be
called from other python files*. I was unaware of this effect of the pyi
files. As soon as you create the files, all the methods are shielded from
external access. Feels like going back to cpp :'(

With the current stubs as-is, I already had to add a few public methods to
the serializers.pyi
<https://github.com/zero323/pyspark-stubs/pull/464/files#diff-92c0b8f614297ea5a12f0491ccb4b316>,
not to mention the private methods in the utils.pyi
<https://github.com/zero323/pyspark-stubs/pull/464/files#diff-9dc1ff7de58fe85eead4416952e78b2e>.
This made me nervous, it is easy to forget methods that are solely being
used by external API. If we forget or miss a function, then
the functionality will be unusable by the end-user. Of course, this can be
captured by adding that tests that cover the full public API, but I think
this is very error-prone.

Also, it is very cumbersome to add everything to the pyi file. In practice,
this means copying the method definition from the py file and paste it into
the pyi file. This hurts my developers' heart, as it violates the DRY
principle. Just adding the signature to the py files is so much more sense
to me, as we don't have to copy all the signatures to a separate file, and
we don't forget to make methods public. Not to mention, it would
potentially breaks most of the open PR's to the PySpark codebase.

I see many big projects using regular annotations:
- Pandas:
https://github.com/pandas-dev/pandas/blob/master/pandas/io/parquet.py#L51
- Koalas:
https://github.com/databricks/koalas/blob/master/databricks/koalas/indexes.py#L469
- Apache Airflow:
https://github.com/apache/airflow/blob/master/airflow/executors/celery_executor.py#L278
- Data Build Tool:
https://github.com/fishtown-analytics/dbt/blob/dev/marian-anderson/core/dbt/deps/registry.py#L74

Now knowing the effect of the pyi file, I'm still high in favor of moving
the type definitions inline:

   - This will make the adoption of the type definitions much more smoothly
   since we can do it in smaller iterations, instead of per file (yes, I'm
   looking at you features.py
   <https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py>,
   5kloc+)
   - Not having to worry that we forgot a function and potentially screw up
   a release
   - Ability to both add type hints to public and private methods

Can we add https://issues.apache.org/jira/browse/SPARK-17333 as a sub
ticket of SPARK-32681 <https://issues.apache.org/jira/browse/SPARK-32681>?

Cheers, Fokko




Op do 27 aug. 2020 om 13:43 schreef Hyukjin Kwon :

> Okay, it took me a while because I had to check the options and
> feasibility we discussed here.
>
> TL;DR: I think we can just port directly pyi files as are into PySpark
> main repository.
>
> I would like to share only the key points here because it looks like I,
> Maciej and people here agree with this direction.
>
> - The stability in PySpark stubs seems pretty okay enough to port directly
> into the main repository.
> At least it covers the most of user-facing APIs. So there won't be
> many advantages by running it separately, (vs the overhead to make a repo
> and maintain separately)
> - There's a possibility that the type hinting way can be changed
> drastically but it will be manageable given that it will be handled within
> the same pyi files.
> - We'll need some tests for that.
> - We'll make sure there's no external user app breakage by this.
>
> There will likely be some other meta works such as adding tests and/or
> documentation works. So I filed an umbrella JIRA for that SPARK-32681
> <https://issues.apache.org/jira/browse/SPARK-32681>.
> If there's no objections in this direction, I think hopefully we can
> start. Let me know if you guys have thoughts on this.
>
> Thanks!
>
>
>
> 2020년 8월 20일 (목) 오후 8:39, Driesprong, Fokko 님이 작성:
>
>> No worries, thanks for the update!
>>
>> Op do 20 aug. 2020 om 12:50 schreef Hyukjin Kwon 
>>
>>> Yeah, we had a short meeting. I had to check a few other things so some
>>> delays happened. I will share soon.
>>>
>>> 2020년 8월 20일 (목) 오후 7:14, Driesprong, Fokko 님이 작성:
>>>
>>>> Hi Maciej, Hyukjin,
>>>>
>>>> Did you find any time to discuss adding the types to the Python
>>>> repository? Would love to know what came out of it.
>>>>
>>>> Cheers, Fokko
>>>>
>>>> Op wo 5 aug. 2020 om 10:14 schreef Driesprong, Fokko
>>>> :
>>>>
>>>

Re: [PySpark] Revisiting PySpark type annotations

2020-08-20 Thread Driesprong, Fokko
No worries, thanks for the update!

Op do 20 aug. 2020 om 12:50 schreef Hyukjin Kwon 

> Yeah, we had a short meeting. I had to check a few other things so some
> delays happened. I will share soon.
>
> 2020년 8월 20일 (목) 오후 7:14, Driesprong, Fokko 님이 작성:
>
>> Hi Maciej, Hyukjin,
>>
>> Did you find any time to discuss adding the types to the Python
>> repository? Would love to know what came out of it.
>>
>> Cheers, Fokko
>>
>> Op wo 5 aug. 2020 om 10:14 schreef Driesprong, Fokko > >:
>>
>>> Mostly echoing stuff that we've discussed in
>>> https://github.com/apache/spark/pull/29180, but good to have this also
>>> on the dev-list.
>>>
>>> > So IMO maintaining outside in a separate repo is going to be harder.
>>> That was why I asked.
>>>
>>> I agree with Felix, having this inside of the project would make it much
>>> easier to maintain. Having it inside of the ASF might be easier to port the
>>> pyi files to the actual Spark repository.
>>>
>>> > FWIW, NumPy took this approach. they made a separate repo, and merged
>>> it into the main repo after it became stable.
>>>
>>> As Maciej pointed out:
>>>
>>> > As of POC ‒ we have stubs, which have been maintained over three years
>>> now and cover versions between 2.3 (though these are fairly limited) to,
>>> with some lag, current master.
>>>
>>> What would be required to mark it as stable?
>>>
>>> > I guess all depends on how we envision the future of annotations
>>> (including, but not limited to, how conservative we want to be in the
>>> future). Which is probably something that should be discussed here.
>>>
>>> I'm happy to motivate people to contribute type hints, and I believe it
>>> is a very accessible way to get more people involved in the Python
>>> codebase. Using the ASF model we can ensure that we require committers/PMC
>>> to sign off on the annotations.
>>>
>>> > Indeed, though the possible advantage is that in theory, you can have
>>> different release cycle than for the main repo (I am not sure if that's
>>> feasible in practice or if that was the intention).
>>>
>>> Personally, I don't think we need a different cycle if the type
>>> hints are part of the code itself.
>>>
>>> > If my understanding is correct, pyspark-stubs is still incomplete and
>>> does not annotate types in some other APIs (by using Any). Correct me if I
>>> am wrong, Maciej.
>>>
>>> For me, it is a bit like code coverage. You want this to be high to make
>>> sure that you cover most of the APIs, but it will take some time to make it
>>> complete.
>>>
>>> For me, it feels a bit like a chicken and egg problem. Because the type
>>> hints are in a separate repository, they will always lag behind. Also, it
>>> is harder to spot where the gaps are.
>>>
>>> Cheers, Fokko
>>>
>>>
>>>
>>> Op wo 5 aug. 2020 om 05:51 schreef Hyukjin Kwon :
>>>
>>>> Oh I think I caused some confusion here.
>>>> Just for clarification, I wasn’t saying we must port this into a
>>>> separate repo now. I was saying it can be one of the options we can
>>>> consider.
>>>>
>>>>
>>>> For a bit of more context:
>>>> This option was considered as, roughly speaking, an invalid option and
>>>> it might need an incubation process as a separate project.
>>>> After some investigations, I found that this is still a valid option
>>>> and we can take this as the part of Apache Spark but in a separate repo.
>>>>
>>>>
>>>> FWIW, NumPy took this approach. they made a separate repo
>>>> <https://github.com/numpy/numpy-stubs>, and merged it into the main
>>>> repo <https://github.com/numpy/numpy-stubs> after it became stable.
>>>>
>>>>
>>>>
>>>> My only major concerns are:
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>- the possibility to fundamentally change the approach in
>>>>pyspark-stubs <https://github.com/zero323/pyspark-stubs>. It’s not
>>>>because how it was done is wrong but because how Python type hinting 
>>>> itself
>>>>evolves.
>>>>
>>>>- If my understanding is correct, pyspark-stubs
>>>><https://g

Re: [PySpark] Revisiting PySpark type annotations

2020-08-20 Thread Driesprong, Fokko
Hi Maciej, Hyukjin,

Did you find any time to discuss adding the types to the Python repository?
Would love to know what came out of it.

Cheers, Fokko

Op wo 5 aug. 2020 om 10:14 schreef Driesprong, Fokko :

> Mostly echoing stuff that we've discussed in
> https://github.com/apache/spark/pull/29180, but good to have this also on
> the dev-list.
>
> > So IMO maintaining outside in a separate repo is going to be harder.
> That was why I asked.
>
> I agree with Felix, having this inside of the project would make it much
> easier to maintain. Having it inside of the ASF might be easier to port the
> pyi files to the actual Spark repository.
>
> > FWIW, NumPy took this approach. they made a separate repo, and merged it
> into the main repo after it became stable.
>
> As Maciej pointed out:
>
> > As of POC ‒ we have stubs, which have been maintained over three years
> now and cover versions between 2.3 (though these are fairly limited) to,
> with some lag, current master.
>
> What would be required to mark it as stable?
>
> > I guess all depends on how we envision the future of annotations
> (including, but not limited to, how conservative we want to be in the
> future). Which is probably something that should be discussed here.
>
> I'm happy to motivate people to contribute type hints, and I believe it is
> a very accessible way to get more people involved in the Python codebase.
> Using the ASF model we can ensure that we require committers/PMC to sign
> off on the annotations.
>
> > Indeed, though the possible advantage is that in theory, you can have
> different release cycle than for the main repo (I am not sure if that's
> feasible in practice or if that was the intention).
>
> Personally, I don't think we need a different cycle if the type hints are
> part of the code itself.
>
> > If my understanding is correct, pyspark-stubs is still incomplete and
> does not annotate types in some other APIs (by using Any). Correct me if I
> am wrong, Maciej.
>
> For me, it is a bit like code coverage. You want this to be high to make
> sure that you cover most of the APIs, but it will take some time to make it
> complete.
>
> For me, it feels a bit like a chicken and egg problem. Because the type
> hints are in a separate repository, they will always lag behind. Also, it
> is harder to spot where the gaps are.
>
> Cheers, Fokko
>
>
>
> Op wo 5 aug. 2020 om 05:51 schreef Hyukjin Kwon :
>
>> Oh I think I caused some confusion here.
>> Just for clarification, I wasn’t saying we must port this into a separate
>> repo now. I was saying it can be one of the options we can consider.
>>
>> For a bit of more context:
>> This option was considered as, roughly speaking, an invalid option and it
>> might need an incubation process as a separate project.
>> After some investigations, I found that this is still a valid option and
>> we can take this as the part of Apache Spark but in a separate repo.
>>
>> FWIW, NumPy took this approach. they made a separate repo
>> <https://github.com/numpy/numpy-stubs>, and merged it into the main repo
>> <https://github.com/numpy/numpy-stubs> after it became stable.
>>
>>
>> My only major concerns are:
>>
>>- the possibility to fundamentally change the approach in
>>pyspark-stubs <https://github.com/zero323/pyspark-stubs>. It’s not
>>because how it was done is wrong but because how Python type hinting 
>> itself
>>evolves.
>>- If my understanding is correct, pyspark-stubs
>><https://github.com/zero323/pyspark-stubs> is still incomplete and
>>does not annotate types in some other APIs (by using Any). Correct me if I
>>am wrong, Maciej.
>>
>> I’ll have a short sync with him and share to understand better since he’d
>> probably know the context best in PySpark type hints and I know some
>> contexts in ASF and Apache Spark.
>>
>>
>>
>> 2020년 8월 5일 (수) 오전 6:31, Maciej Szymkiewicz 님이
>> 작성:
>>
>>> Indeed, though the possible advantage is that in theory, you can have
>>> different release cycle than for the main repo (I am not sure if that's
>>> feasible in practice or if that was the intention).
>>>
>>> I guess all depends on how we envision the future of annotations
>>> (including, but not limited to, how conservative we want to be in the
>>> future). Which is probably something that should be discussed here.
>>> On 8/4/20 11:06 PM, Felix Cheung wrote:
>>>
>>> So IMO maintaining outside in a separate repo is going to be harder.
>>> That was why I 

Allow average out of a Date

2020-08-19 Thread Driesprong, Fokko
Hi all,

Personally, I'm a big fan of the .summary() function to compute statistics
of a dataframe. I often use this for debugging pipelines, and check what
the impact of the RDD is after changing code.

I've noticed that not all datatypes are in this summary. Currently, there
is a list

of types that allowed to be included in the summary, and I love to extend
that list.

The first one is the date type. It is important to define this together
with the community, and that we get consensus, as this will be part of the
public API. Changing this will be costly (or maybe impossible) to do.

I've checked what other DBMS'es do with averages out of dates:

Postgres

Unsupported:

postgres@366ecc8a0fb9:/$ psql
psql (12.3 (Debian 12.3-1.pgdg100+1))
Type "help" for help.

postgres=# SELECT CAST(CAST('2020-01-01' AS DATE) AS decimal);
ERROR:  cannot cast type date to numeric
LINE 1: SELECT CAST(CAST('2020-01-01' AS DATE) AS decimal);
   ^

postgres=# SELECT CAST(CAST('2020-01-01' AS DATE) AS integer);
ERROR:  cannot cast type date to integer
LINE 1: SELECT CAST(CAST('2020-01-01' AS DATE) AS integer);
   ^

The way to get the epoch in days is:

postgres=# SELECT EXTRACT(DAYS FROM (now() - '1970-01-01'));
date_part
---
18422
(1 row)


MySQL

Converts to a MMDD format:

mysql> SELECT CAST(CAST('2020-01-01' AS DATE) AS decimal);
+-+
| CAST(CAST('2020-01-01' AS DATE) AS decimal) |
+-+
|20200101 |
+-+
1 row in set (0.00 sec)

However, converting to an int, isn't allowed:

mysql> SELECT CAST(CAST('2020-01-01' AS DATE) AS int);
ERROR 1064 (42000): You have an error in your SQL syntax; check the
manual that corresponds to your MySQL server version for the right
syntax to use near 'int)' at line 1

mysql> SELECT CAST(CAST('2020-01-01' AS DATE) AS bigint);
ERROR 1064 (42000): You have an error in your SQL syntax; check the
manual that corresponds to your MySQL server version for the right
syntax to use near 'bigint)' at line 1

Bigquery

Unsupported:

[image: image.png]

Excel

Converts it to the days since epoch. This feels weird, but I can see it, as
it is being used as a physical format internally in many data formats.

[image: image.png]

For me, returning a Date as the output of avg(date) seems like a logical
choice. Internally it is handled as dates since epoch, which makes sense as
well:

*Avro* it is milliseconds since epoch:
https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/reflect/DateAsLongEncoding.java

*Parquet* it is days since epoch:
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#date

*ORC* is based around days since Epoch:
https://github.com/apache/orc/blob/master/java/core/src/java/org/threeten/extra/chrono/HybridDate.java

Also with this, we keep parity with the Catalyst type :)

Any further thoughts on this before moving forward?

Kind regards, Fokko


Re: [PySpark] Revisiting PySpark type annotations

2020-08-05 Thread Driesprong, Fokko
Mostly echoing stuff that we've discussed in
https://github.com/apache/spark/pull/29180, but good to have this also on
the dev-list.

> So IMO maintaining outside in a separate repo is going to be harder. That
was why I asked.

I agree with Felix, having this inside of the project would make it much
easier to maintain. Having it inside of the ASF might be easier to port the
pyi files to the actual Spark repository.

> FWIW, NumPy took this approach. they made a separate repo, and merged it
into the main repo after it became stable.

As Maciej pointed out:

> As of POC ‒ we have stubs, which have been maintained over three years
now and cover versions between 2.3 (though these are fairly limited) to,
with some lag, current master.

What would be required to mark it as stable?

> I guess all depends on how we envision the future of annotations
(including, but not limited to, how conservative we want to be in the
future). Which is probably something that should be discussed here.

I'm happy to motivate people to contribute type hints, and I believe it is
a very accessible way to get more people involved in the Python codebase.
Using the ASF model we can ensure that we require committers/PMC to sign
off on the annotations.

> Indeed, though the possible advantage is that in theory, you can have
different release cycle than for the main repo (I am not sure if that's
feasible in practice or if that was the intention).

Personally, I don't think we need a different cycle if the type hints are
part of the code itself.

> If my understanding is correct, pyspark-stubs is still incomplete and
does not annotate types in some other APIs (by using Any). Correct me if I
am wrong, Maciej.

For me, it is a bit like code coverage. You want this to be high to make
sure that you cover most of the APIs, but it will take some time to make it
complete.

For me, it feels a bit like a chicken and egg problem. Because the type
hints are in a separate repository, they will always lag behind. Also, it
is harder to spot where the gaps are.

Cheers, Fokko



Op wo 5 aug. 2020 om 05:51 schreef Hyukjin Kwon :

> Oh I think I caused some confusion here.
> Just for clarification, I wasn’t saying we must port this into a separate
> repo now. I was saying it can be one of the options we can consider.
>
> For a bit of more context:
> This option was considered as, roughly speaking, an invalid option and it
> might need an incubation process as a separate project.
> After some investigations, I found that this is still a valid option and
> we can take this as the part of Apache Spark but in a separate repo.
>
> FWIW, NumPy took this approach. they made a separate repo
> <https://github.com/numpy/numpy-stubs>, and merged it into the main repo
> <https://github.com/numpy/numpy-stubs> after it became stable.
>
>
> My only major concerns are:
>
>- the possibility to fundamentally change the approach in pyspark-stubs
><https://github.com/zero323/pyspark-stubs>. It’s not because how it
>was done is wrong but because how Python type hinting itself evolves.
>- If my understanding is correct, pyspark-stubs
><https://github.com/zero323/pyspark-stubs> is still incomplete and
>does not annotate types in some other APIs (by using Any). Correct me if I
>am wrong, Maciej.
>
> I’ll have a short sync with him and share to understand better since he’d
> probably know the context best in PySpark type hints and I know some
> contexts in ASF and Apache Spark.
>
>
>
> 2020년 8월 5일 (수) 오전 6:31, Maciej Szymkiewicz 님이 작성:
>
>> Indeed, though the possible advantage is that in theory, you can have
>> different release cycle than for the main repo (I am not sure if that's
>> feasible in practice or if that was the intention).
>>
>> I guess all depends on how we envision the future of annotations
>> (including, but not limited to, how conservative we want to be in the
>> future). Which is probably something that should be discussed here.
>> On 8/4/20 11:06 PM, Felix Cheung wrote:
>>
>> So IMO maintaining outside in a separate repo is going to be harder. That
>> was why I asked.
>>
>>
>>
>> --
>> *From:* Maciej Szymkiewicz 
>> 
>> *Sent:* Tuesday, August 4, 2020 12:59 PM
>> *To:* Sean Owen
>> *Cc:* Felix Cheung; Hyukjin Kwon; Driesprong, Fokko; Holden Karau; Spark
>> Dev List
>> *Subject:* Re: [PySpark] Revisiting PySpark type annotations
>>
>>
>> On 8/4/20 9:35 PM, Sean Owen wrote
>> > Yes, but the general argument you make here is: if you tie this
>> > project to the main project, it will _have_ to be maintained by
>> > everyone. That's good, but also exactly I think the downside we want
>> 

Re: [PySpark] Revisiting PySpark type annotations

2020-08-03 Thread Driesprong, Fokko
Cool stuff! Moving it to the ASF would be a great first step.

I think you might want to check the IP Clearance template:
http://incubator.apache.org/ip-clearance/ip-clearance-template.html

This is the one being used when donating the Airflow Kubernetes operator
from Google to the ASF:
http://mail-archives.apache.org/mod_mbox/airflow-dev/201909.mbox/%3cca+aakm-ahq7wni6+nazfnrxfnfh1wy34gcvyavsq4xlcwh2...@mail.gmail.com%3e

I don't expect anything weird, but it might be a good idea to check if the
licenses are in the files:
https://github.com/zero323/pyspark-stubs/pull/458 And
check if there are any dependencies with licenses that are in conflict with
the Apache 2.0 license, but it looks good to me.

Looking forward, are we going to keep this as a separate repository? While
adding the licenses I've noticed that there is a lingering annotation:
https://github.com/zero323/pyspark-stubs/pull/459 This file has been
removed in Spark upstream because we've bumped the Python version. As
mentioned in the Pull Request earlier, I would be a big fan of putting the
annotations and the code in the same repository. I'm fine with keeping them
separate in a .pyi as well. Otherwise, it is very easy for them to run out
of sync.

Please let me know what comes out of the meeting.

Cheers, Fokko

Op ma 3 aug. 2020 om 10:59 schreef Hyukjin Kwon :

> Okay, seems like we can create a separate repo as apache/spark? e.g.)
> https://issues.apache.org/jira/browse/INFRA-20470
> We can also think about porting the files as are.
> I will try to have a short sync with the author Maciej, and share what we
> discussed offline.
>
>
> 2020년 7월 22일 (수) 오후 10:43, Maciej Szymkiewicz 님이
> 작성:
>
>>
>>
>> W dniu środa, 22 lipca 2020 Driesprong, Fokko 
>> napisał(a):
>>
>>> That's probably one-time overhead so it is not a big issue.  In my
>>> opinion, a bigger one is possible complexity. Annotations tend to introduce
>>> a lot of cyclic dependencies in Spark codebase. This can be addressed, but
>>> don't look great.
>>>
>>>
>>> This is not true (anymore). With Python 3.6 you can add string
>>> annotations -> 'DenseVector', and in the future with Python 3.7 this is
>>> fixed by having postponed evaluation:
>>> https://www.python.org/dev/peps/pep-0563/
>>>
>>
>> As far as I recall linked PEP addresses backrferences not cyclic
>> dependencies, which weren't a big issue in the first place
>>
>> What I mean is a actually cyclic stuff - for example pyspark.context
>> depends on pyspark.rdd and the other way around. These dependencies are not
>> explicit at he moment.
>>
>>
>>
>>> Merging stubs into project structure from the other hand has almost no
>>> overhead.
>>>
>>>
>>> This feels awkward to me, this is like having the docstring in a
>>> separate file. In my opinion you want to have the signatures and the
>>> functions together for transparency and maintainability.
>>>
>>>
>> I guess that's the matter of preference. From maintainability perspective
>> it is actually much easier to have separate objects.
>>
>> For example there are different types of objects that are required for
>> meaningful checking, which don't really exist in real code (protocols,
>> aliases, code generated signatures fo let complex overloads) as well as
>> some monkey patched entities
>>
>> Additionally it is often easier to see inconsistencies when typing is
>> separate.
>>
>> However, I am not implying that this should be a persistent state.
>>
>> In general I see two non breaking paths here.
>>
>>  - Merge pyspark-stubs a separate subproject within main spark repo and
>> keep it in-sync there with common CI pipeline and transfer ownership of
>> pypi package to ASF
>> - Move stubs directly into python/pyspark and then apply individual stubs
>> to .modules of choice.
>>
>> Of course, the first proposal could be an initial step for the latter one.
>>
>>
>>>
>>> I think DBT is a very nice project where they use annotations very well:
>>> https://github.com/fishtown-analytics/dbt/blob/dev/marian-anderson/core/dbt/graph/queue.py
>>>
>>> Also, they left out the types in the docstring, since they are available
>>> in the annotations itself.
>>>
>>>
>>
>>> In practice, the biggest advantage is actually support for completion,
>>> not type checking (which works in simple cases).
>>>
>>>
>>> Agreed.
>>>
>>> Would you be interested in writing up the Outreachy proposal for work on
>>> this?
>>

Re: Python xmlrunner being used?

2020-07-24 Thread Driesprong, Fokko
I found this ticket: https://issues.apache.org/jira/browse/SPARK-7021

Is anybody actually using this?

Cheers, Fokko

Op vr 24 jul. 2020 om 16:27 schreef Driesprong, Fokko :

> Hi all,
>
> Does anyone know if the xmlrunner package is still being used?
>
> We're working on enforcing some static code analysis checks on the Python
> codebase, and the imports of the xmlrunner generates quite some noise:
> https://github.com/apache/spark/pull/29121
>
> It looks like the entry point for a lot of tests:
> https://github.com/apache/spark/search?p=1=xmlrunner_q=xmlrunner 
> This
> will only run when the testfile is explicitly invoked.
>
> However, looking at the coverage report generation:
> https://github.com/apache/spark/blob/master/python/run-tests-with-coverage 
> This
> is being generated using coverage.
>
> I also can't find where it is being installed. Anyone any
> historical knowledge on this?
>
> Kind regards, Fokko
>
>
>
>
>


Python xmlrunner being used?

2020-07-24 Thread Driesprong, Fokko
Hi all,

Does anyone know if the xmlrunner package is still being used?

We're working on enforcing some static code analysis checks on the Python
codebase, and the imports of the xmlrunner generates quite some noise:
https://github.com/apache/spark/pull/29121

It looks like the entry point for a lot of tests:
https://github.com/apache/spark/search?p=1=xmlrunner_q=xmlrunner
This
will only run when the testfile is explicitly invoked.

However, looking at the coverage report generation:
https://github.com/apache/spark/blob/master/python/run-tests-with-coverage This
is being generated using coverage.

I also can't find where it is being installed. Anyone any
historical knowledge on this?

Kind regards, Fokko


Re: [PySpark] Revisiting PySpark type annotations

2020-07-22 Thread Driesprong, Fokko
That's probably one-time overhead so it is not a big issue.  In my opinion,
a bigger one is possible complexity. Annotations tend to introduce a lot of
cyclic dependencies in Spark codebase. This can be addressed, but don't
look great.


This is not true (anymore). With Python 3.6 you can add string annotations
-> 'DenseVector', and in the future with Python 3.7 this is fixed by having
postponed evaluation: https://www.python.org/dev/peps/pep-0563/

Merging stubs into project structure from the other hand has almost no
overhead.


This feels awkward to me, this is like having the docstring in a separate
file. In my opinion you want to have the signatures and the functions
together for transparency and maintainability.

I think DBT is a very nice project where they use annotations very well:
https://github.com/fishtown-analytics/dbt/blob/dev/marian-anderson/core/dbt/graph/queue.py

Also, they left out the types in the docstring, since they are available in
the annotations itself.

In practice, the biggest advantage is actually support for completion, not
type checking (which works in simple cases).


Agreed.

Would you be interested in writing up the Outreachy proposal for work on
this?


I would be, and also happy to mentor. But, I think we first need to agree
as a Spark community if we want to add the annotations to the code, and in
which extend.

At some point (in general when things are heavy in generics, which is the
case here), annotations become somewhat painful to write.


That's true, but that might also be a pointer that it is time to refactor
the function/code :)

For now, I tend to think adding type hints to the codes make it difficult
to backport or revert and more difficult to discuss about typing only
especially considering typing is arguably premature yet.


This feels a bit weird to me, since you want to keep this in sync right? Do
you provide different stubs for different versions of Python? I had to look
up the literals: https://www.python.org/dev/peps/pep-0586/

Cheers, Fokko

Op wo 22 jul. 2020 om 09:40 schreef Maciej Szymkiewicz <
mszymkiew...@gmail.com>:

>
> On 7/22/20 3:45 AM, Hyukjin Kwon wrote:
> > For now, I tend to think adding type hints to the codes make it
> > difficult to backport or revert and
> > more difficult to discuss about typing only especially considering
> > typing is arguably premature yet.
>
> About being premature ‒ since typing ecosystem evolves much faster than
> Spark it might be preferable to keep annotations as a separate project
> (preferably under AST / Spark umbrella). It allows for faster iterations
> and supporting new features (for example Literals proved to be very
> useful), without waiting for the next Spark release.
>
> --
> Best regards,
> Maciej Szymkiewicz
>
> Web: https://zero323.net
> Keybase: https://keybase.io/zero323
> Gigs: https://www.codementor.io/@zero323
> PGP: A30CEF0C31A501EC
>
>
>


Re: [PySpark] Revisiting PySpark type annotations

2020-07-21 Thread Driesprong, Fokko
Fully agree Holden, would be great to include the Outreachy project. Adding
annotations is a very friendly way to get familiar with the codebase.

I've also created a PR to see what's needed to get mypy in:
https://github.com/apache/spark/pull/29180 From there on we can start
adding annotations.

Cheers, Fokko


Op di 21 jul. 2020 om 21:40 schreef Holden Karau :

> Yeah I think this could be a great project now that we're only Python
> 3.5+. One potential is making this an Outreachy project to get more folks
> from different backgrounds involved in Spark.
>
> On Tue, Jul 21, 2020 at 12:33 PM Driesprong, Fokko 
> wrote:
>
>> Since we've recently dropped support for Python <=3.5
>> <https://github.com/apache/spark/pull/28957>, I think it would be nice
>> to add support for type annotations. Having this in the main repository
>> allows us to do type checking using MyPy <http://mypy-lang.org/> in the
>> CI itself. <http://mypy-lang.org/>
>>
>> This is now handled by the Stub file:
>> https://www.python.org/dev/peps/pep-0484/#stub-files However I think it
>> is nicer to integrate the types with the code itself to keep everything in
>> sync, and make it easier for the people who work on the codebase itself. A
>> first step would be to move the stubs into the codebase. First step would
>> be to cover the public API which is the most important one. Having the
>> types with the code itself makes it much easier to understand. For example,
>> if you can supply a str or column here:
>> https://github.com/apache/spark/pull/29122/files#diff-f5295f69bfbdbf6e161aed54057ea36dR2486
>>
>> One of the implications would be that future PR's on Python should cover
>> annotations on the public API's. Curious what the rest of the community
>> thinks.
>>
>> Cheers, Fokko
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> Op di 21 jul. 2020 om 20:04 schreef zero323 :
>>
>>> Given a discussion related to  SPARK-32320 PR
>>> <https://github.com/apache/spark/pull/29122>   I'd like to resurrect
>>> this
>>> thread. Is there any interest in migrating annotations to the main
>>> repository?
>>>
>>>
>>>
>>> --
>>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


Re: [PySpark] Revisiting PySpark type annotations

2020-07-21 Thread Driesprong, Fokko
Since we've recently dropped support for Python <=3.5
, I think it would be nice to
add support for type annotations. Having this in the main repository allows
us to do type checking using MyPy  in the CI itself.


This is now handled by the Stub file:
https://www.python.org/dev/peps/pep-0484/#stub-files However I think it is
nicer to integrate the types with the code itself to keep everything in
sync, and make it easier for the people who work on the codebase itself. A
first step would be to move the stubs into the codebase. First step would
be to cover the public API which is the most important one. Having the
types with the code itself makes it much easier to understand. For example,
if you can supply a str or column here:
https://github.com/apache/spark/pull/29122/files#diff-f5295f69bfbdbf6e161aed54057ea36dR2486

One of the implications would be that future PR's on Python should cover
annotations on the public API's. Curious what the rest of the community
thinks.

Cheers, Fokko









Op di 21 jul. 2020 om 20:04 schreef zero323 :

> Given a discussion related to  SPARK-32320 PR
>    I'd like to resurrect this
> thread. Is there any interest in migrating annotations to the main
> repository?
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Use /usr/bin/env python3 in scripts?

2020-07-17 Thread Driesprong, Fokko
+1 I'm in favor of using python3

Cheers, Fokko

Op vr 17 jul. 2020 om 19:49 schreef Sean Owen :

> Yeah I figured it's a best practice, so I'll raise a PR unless
> somebody tells me not to. This is about build scripts, not Pyspark
> itself, and half the scripts already specify python3.
>
> On Fri, Jul 17, 2020 at 12:36 PM Oli McCormack  wrote:
> >
> > [Warning: not spark+python specific information]
> >
> > It's recommended that you should explicitly call out python3 in a case
> like this (see PEP-0394, and SO). Your environment is typical: python is
> often a pointer to python2 for tooling compatibility reasons (other tools
> or scripts that expect they're going to get python2 when they call python),
> and you should use python3 to use the new version. What python points to
> will change over time, so it's recommended to use python2 if explicitly
> depending on that.
> >
> > More generally: It's common/recommended to use a virtual environment +
> explicitly stated versions of Python and dependencies, rather than system
> Python, so that python means exactly what you intend it to. I know very
> little about the Spark python dev stack and how challenging it may be to do
> this, so please take this with a dose of naiveté.
> >
> > - Oli
> >
> >
> > On Fri, Jul 17, 2020 at 9:58 AM Sean Owen  wrote:
> >>
> >> So, we are on Python 3 entirely now right?
> >> It might be just my local Mac env, but "/usr/bin/env python" uses
> >> Python 2 on my mac.
> >> Some scripts write "/usr/bin/env python3" now. Should that be the case
> >> in all scripts?
> >> Right now the merge script doesn't work for me b/c it was just updated
> >> to be Python 3 only.
> >>
> >> -
> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Welcoming some new Apache Spark committers

2020-07-14 Thread Driesprong, Fokko
Welcome!

Op di 14 jul. 2020 om 19:53 schreef shane knapp ☠ :

> welcome, all!
>
> On Tue, Jul 14, 2020 at 10:37 AM Matei Zaharia 
> wrote:
>
>> Hi all,
>>
>> The Spark PMC recently voted to add several new committers. Please join
>> me in welcoming them to their new roles! The new committers are:
>>
>> - Huaxin Gao
>> - Jungtaek Lim
>> - Dilip Biswal
>>
>> All three of them contributed to Spark 3.0 and we’re excited to have them
>> join the project.
>>
>> Matei and the Spark PMC
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


Re: [VOTE] Decommissioning SPIP

2020-07-03 Thread Driesprong, Fokko
+1 (non-binding)


Cheers, Fokko

Op vr 3 jul. 2020 om 09:16 schreef Xin Jinhan <18183124...@163.com>:

> +1
> this really make sense!!
>
> Regards,
> Jinhan
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: How to implement a "saveAsBinaryFile" function?

2020-01-16 Thread Driesprong, Fokko
Hi Bing,

Good question and the answer is; it depends on what your use-case is.

If you really just want to write raw bytes, then you could create a
.foreach where you open an OutputStream and write it to some file. But this
is probably not what you want, and in practice not very handy since you
want to keep the records.

My suggestion would be to write it as Parquet or Avro, and write it to a
binary field. With Avro you have the bytes primitive which converts in
Spark to Array[Byte]: https://avro.apache.org/docs/1.9.1/spec.html Similar
to Parquet where you have the BYTE_ARRAY:
https://github.com/apache/parquet-format/blob/master/Encodings.md#plain-plain--0

In the words of Linus Torvalds; *Talk is cheap, show me the code*:

MacBook-Pro-van-Fokko:~ fokkodriesprong$ spark-shell
20/01/16 10:58:44 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile:
org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
setLogLevel(newLevel).
Spark context Web UI available at http://172.20.10.3:4040
Spark context available as 'sc' (master = local[*], app id =
local-1579168731763).
Spark session available as 'spark'.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.4
  /_/

Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java
1.8.0_172)
Type in expressions to have them evaluated.
Type :help for more information.

scala> val data: Array[Array[Byte]] = Array(
 |   Array(0x19.toByte, 0x25.toByte)
 | )
data: Array[Array[Byte]] = Array(Array(25, 37))

scala> val rdd = sc.parallelize(data, 1);
rdd: org.apache.spark.rdd.RDD[Array[Byte]] = ParallelCollectionRDD[0] at
parallelize at :26

scala> rdd.toDF("byte")
res1: org.apache.spark.sql.DataFrame = [byte: binary]

scala> val df = rdd.toDF("byte")
df: org.apache.spark.sql.DataFrame = [byte: binary]

scala> df.write.parquet("/tmp/bytes/")



MacBook-Pro-van-Fokko:~ fokkodriesprong$ ls -lah /tmp/bytes/
total 24
drwxr-xr-x   6 fokkodriesprong  wheel   192B 16 jan 11:01 .
drwxrwxrwt  16 root wheel   512B 16 jan 11:01 ..
-rw-r--r--   1 fokkodriesprong  wheel 8B 16 jan 11:01 ._SUCCESS.crc
-rw-r--r--   1 fokkodriesprong  wheel12B 16 jan 11:01
.part-0-d0d684bb-2371-4947-b2f3-6fca4ead69a7-c000.snappy.parquet.crc
-rw-r--r--   1 fokkodriesprong  wheel 0B 16 jan 11:01 _SUCCESS
-rw-r--r--   1 fokkodriesprong  wheel   384B 16 jan 11:01
part-0-d0d684bb-2371-4947-b2f3-6fca4ead69a7-c000.snappy.parquet

MacBook-Pro-van-Fokko:~ fokkodriesprong$ parquet-tools schema
/tmp/bytes/part-0-d0d684bb-2371-4947-b2f3-6fca4ead69a7-c000.snappy.parquet

message spark_schema {
  optional binary byte;
}

Hope this helps.

Cheers, Fokko


Op do 16 jan. 2020 om 09:34 schreef Duan,Bing :

> Hi all:
>
> I read binary data(protobuf format) from filesystem by binaryFiles
> function to a RDD[Array[Byte]]   it works fine. But when I save the it to
> filesystem by saveAsTextFile, the quotation mark was be escaped like this:
> "\"20192_1\"",1,24,0,2,"\"S66.000x001\””,which  should
> be "20192_1",1,24,0,2,”S66.000x001”.
>
> Anyone could give me some tip to implement a function
> like saveAsBinaryFile to persist the RDD[Array[Byte]]?
>
> Bests!
>
> Bing
>


Re:

2019-12-27 Thread Driesprong, Fokko
Anyone any opinion on this? A link to the PR:
https://github.com/apache/spark/pull/26644

Cheers, Fokko


Op vr 20 dec. 2019 om 16:00 schreef Driesprong, Fokko :

> Folks,
>
> I've opened a PR a while ago with a PR to merge the possibility to merge
> a custom data type, into a native data type. This is something new because
> of the introduction of Delta.
>
> To have some background, I'm having a DataSet that has fields of the type
> XMLGregorianCalendarType. I don't care about this type and would like to
> convert this to a standard data type. Mainly because, if I'm reading the
> data again using another job, it needs to have the customer data type being
> registered, which is not possible in the SQL API. The magic bit here is
> that I'm overriding the jsonValue to lose the information about the custom
> data type. In this case, you have to make sure that it is serialized as the
> normal timestamp.
>
> Before Delta, when appending to the table, everything would go fine
> because it would not check compatibility on write. Now with Delta, things
> are different. When writing, it will check if the two structures can be
> merged:
>
> OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=512m;
> support was removed in 8.0
> Warning: Ignoring non-spark config property:
> eventLog.rolloverIntervalSeconds=3600
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Failed
> to merge fields 'EventTimestamp' and 'EventTimestamp'. Failed to merge
> incompatible data types TimestampType and
> org.apache.spark.sql.types.CustomXMLGregorianCalendarType@6334178e;;
> at
> com.databricks.sql.transaction.tahoe.schema.SchemaUtils$$anonfun$18.apply(SchemaUtils.scala:685)
> at
> com.databricks.sql.transaction.tahoe.schema.SchemaUtils$$anonfun$18.apply(SchemaUtils.scala:674)
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
> at
> com.databricks.sql.transaction.tahoe.schema.SchemaUtils$.com$databricks$sql$transaction$tahoe$schema$SchemaUtils$$merge$1(SchemaUtils.scala:674)
> at
> com.databricks.sql.transaction.tahoe.schema.SchemaUtils$.mergeSchemas(SchemaUtils.scala:750)
> at
> com.databricks.sql.transaction.tahoe.schema.ImplicitMetadataOperation$class.updateMetadata(ImplicitMetadataOperation.scala:63)
> at
> com.databricks.sql.transaction.tahoe.commands.WriteIntoDelta.updateMetadata(WriteIntoDelta.scala:50)
> at
> com.databricks.sql.transaction.tahoe.commands.WriteIntoDelta.write(WriteIntoDelta.scala:90)
> at
> com.databricks.sql.transaction.tahoe.commands.CreateDeltaTableCommand$$anonfun$run$2.apply(CreateDeltaTableCommand.scala:119)
> at
> com.databricks.sql.transaction.tahoe.commands.CreateDeltaTableCommand$$anonfun$run$2.apply(CreateDeltaTableCommand.scala:93)
> at
> com.databricks.logging.UsageLogging$$anonfun$recordOperation$1.apply(UsageLogging.scala:405)
> at
> com.databricks.logging.UsageLogging$$anonfun$withAttributionContext$1.apply(UsageLogging.scala:235)
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
> at
> com.databricks.logging.UsageLogging$class.withAttributionContext(UsageLogging.scala:230)
> at
> com.databricks.spark.util.PublicDBLogging.withAttributionContext(DatabricksSparkUsageLogger.scala:18)
> at
> com.databricks.logging.UsageLogging$class.withAttributionTags(UsageLogging.scala:272)
> at
> com.databricks.spark.util.PublicDBLogging.withAttributionTags(DatabricksSparkUsageLogger.scala:18)
> at
> com.databricks.logging.UsageLogging$class.recordOperation(UsageLogging.scala:386)
> at
> com.databricks.spark.util.PublicDBLogging.recordOperation(DatabricksSparkUsageLogger.scala:18)
> at
> com.databricks.spark.util.PublicDBLogging.recordOperation0(DatabricksSparkUsageLogger.scala:55)
> at
> com.databricks.spark.util.DatabricksSparkUsageLogger.recordOperation(DatabricksSparkUsageLogger.scala:98)
> at
> com.databricks.spark.util.UsageLogger$class.recordOperation(UsageLogger.scala:67)
> at
> com.databricks.spark.util.DatabricksSparkUsageLogger.recordOperation(DatabricksSparkUsageLogger.scala:67)
> at
> com.databricks.spark.util.UsageLogging$class.recordOperation(UsageLogger.scala:342)
> at
> com.databricks.sql.transaction.tahoe.commands.CreateDeltaTableCommand.recordOperation(CreateDeltaTableCommand.scala:45)
> at
> com.databricks.sql.transaction.tahoe.metering.DeltaLog

[no subject]

2019-12-20 Thread Driesprong, Fokko
Folks,

I've opened a PR a while ago with a PR to merge the possibility to merge a
custom data type, into a native data type. This is something new because of
the introduction of Delta.

To have some background, I'm having a DataSet that has fields of the type
XMLGregorianCalendarType. I don't care about this type and would like to
convert this to a standard data type. Mainly because, if I'm reading the
data again using another job, it needs to have the customer data type being
registered, which is not possible in the SQL API. The magic bit here is
that I'm overriding the jsonValue to lose the information about the custom
data type. In this case, you have to make sure that it is serialized as the
normal timestamp.

Before Delta, when appending to the table, everything would go fine because
it would not check compatibility on write. Now with Delta, things are
different. When writing, it will check if the two structures can be merged:

OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=512m; support
was removed in 8.0
Warning: Ignoring non-spark config property:
eventLog.rolloverIntervalSeconds=3600
Exception in thread "main" org.apache.spark.sql.AnalysisException: Failed
to merge fields 'EventTimestamp' and 'EventTimestamp'. Failed to merge
incompatible data types TimestampType and
org.apache.spark.sql.types.CustomXMLGregorianCalendarType@6334178e;;
at
com.databricks.sql.transaction.tahoe.schema.SchemaUtils$$anonfun$18.apply(SchemaUtils.scala:685)
at
com.databricks.sql.transaction.tahoe.schema.SchemaUtils$$anonfun$18.apply(SchemaUtils.scala:674)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
at
com.databricks.sql.transaction.tahoe.schema.SchemaUtils$.com$databricks$sql$transaction$tahoe$schema$SchemaUtils$$merge$1(SchemaUtils.scala:674)
at
com.databricks.sql.transaction.tahoe.schema.SchemaUtils$.mergeSchemas(SchemaUtils.scala:750)
at
com.databricks.sql.transaction.tahoe.schema.ImplicitMetadataOperation$class.updateMetadata(ImplicitMetadataOperation.scala:63)
at
com.databricks.sql.transaction.tahoe.commands.WriteIntoDelta.updateMetadata(WriteIntoDelta.scala:50)
at
com.databricks.sql.transaction.tahoe.commands.WriteIntoDelta.write(WriteIntoDelta.scala:90)
at
com.databricks.sql.transaction.tahoe.commands.CreateDeltaTableCommand$$anonfun$run$2.apply(CreateDeltaTableCommand.scala:119)
at
com.databricks.sql.transaction.tahoe.commands.CreateDeltaTableCommand$$anonfun$run$2.apply(CreateDeltaTableCommand.scala:93)
at
com.databricks.logging.UsageLogging$$anonfun$recordOperation$1.apply(UsageLogging.scala:405)
at
com.databricks.logging.UsageLogging$$anonfun$withAttributionContext$1.apply(UsageLogging.scala:235)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at
com.databricks.logging.UsageLogging$class.withAttributionContext(UsageLogging.scala:230)
at
com.databricks.spark.util.PublicDBLogging.withAttributionContext(DatabricksSparkUsageLogger.scala:18)
at
com.databricks.logging.UsageLogging$class.withAttributionTags(UsageLogging.scala:272)
at
com.databricks.spark.util.PublicDBLogging.withAttributionTags(DatabricksSparkUsageLogger.scala:18)
at
com.databricks.logging.UsageLogging$class.recordOperation(UsageLogging.scala:386)
at
com.databricks.spark.util.PublicDBLogging.recordOperation(DatabricksSparkUsageLogger.scala:18)
at
com.databricks.spark.util.PublicDBLogging.recordOperation0(DatabricksSparkUsageLogger.scala:55)
at
com.databricks.spark.util.DatabricksSparkUsageLogger.recordOperation(DatabricksSparkUsageLogger.scala:98)
at
com.databricks.spark.util.UsageLogger$class.recordOperation(UsageLogger.scala:67)
at
com.databricks.spark.util.DatabricksSparkUsageLogger.recordOperation(DatabricksSparkUsageLogger.scala:67)
at
com.databricks.spark.util.UsageLogging$class.recordOperation(UsageLogger.scala:342)
at
com.databricks.sql.transaction.tahoe.commands.CreateDeltaTableCommand.recordOperation(CreateDeltaTableCommand.scala:45)
at
com.databricks.sql.transaction.tahoe.metering.DeltaLogging$class.recordDeltaOperation(DeltaLogging.scala:108)
at
com.databricks.sql.transaction.tahoe.commands.CreateDeltaTableCommand.recordDeltaOperation(CreateDeltaTableCommand.scala:45)
at
com.databricks.sql.transaction.tahoe.commands.CreateDeltaTableCommand.run(CreateDeltaTableCommand.scala:93)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
at

Re: [DISCUSS] PostgreSQL dialect

2019-12-01 Thread Driesprong, Fokko
+1 (non-binding)

Cheers, Fokko

Op do 28 nov. 2019 om 03:47 schreef Dongjoon Hyun :

> +1
>
> Bests,
> Dongjoon.
>
> On Tue, Nov 26, 2019 at 3:52 PM Takeshi Yamamuro 
> wrote:
>
>> Yea, +1, that looks pretty reasonable to me.
>> > Here I'm proposing to hold off the PostgreSQL dialect. Let's remove it
>> from the codebase before it's too late. Curently we only have 3 features
>> under PostgreSQL dialect:
>> I personally think we could at least stop work about the Dialect until
>> 3.0 released.
>>
>>
>> On Wed, Nov 27, 2019 at 2:41 AM Gengliang Wang <
>> gengliang.w...@databricks.com> wrote:
>>
>>> +1 with the practical proposal.
>>> To me, the major concern is that the code base becomes complicated,
>>> while the PostgreSQL dialect has very limited features. I tried introducing
>>> one big flag `spark.sql.dialect` and isolating related code in #25697
>>> , but it seems hard to be
>>> clean.
>>> Furthermore, the PostgreSQL dialect configuration overlaps with the ANSI
>>> mode, which can be confusing sometimes.
>>>
>>> Gengliang
>>>
>>> On Tue, Nov 26, 2019 at 8:57 AM Xiao Li  wrote:
>>>
 +1


> One particular negative effect has been that new postgresql tests add
> well over an hour to tests,


 Adding postgresql tests is for improving the test coverage of Spark
 SQL. We should continue to do this by importing more test cases. The
 quality of Spark highly depends on the test coverage. We can further
 paralyze the test execution to reduce the test time.

 Migrating PostgreSQL workloads to Spark SQL


 This should not be our current focus. In the near future, it is
 impossible to be fully compatible with PostgreSQL. We should focus on
 adding features that are useful to Spark community. PostgreSQL is a good
 reference, but we do not need to blindly follow it. We already closed
 multiple related JIRAs that try to add some PostgreSQL features that are
 not commonly used.

 Cheers,

 Xiao


 On Tue, Nov 26, 2019 at 8:30 AM Maciej Szymkiewicz <
 mszymkiew...@gmail.com> wrote:

> I think it is important to distinguish between two different concepts:
>
>- Adherence to standards and their well established
>implementations.
>- Enabling migrations from some product X to Spark.
>
> While these two problems are related, there are independent and one
> can be achieved without the other.
>
>- The former approach doesn't imply that all features of SQL
>standard (or its specific implementation) are provided. It is 
> sufficient
>that commonly used features that are implemented, are standard 
> compliant.
>Therefore if end user applies some well known pattern, thing will work 
> as
>expected. I
>
>In my personal opinion that's something that is worth the required
>development resources, and in general should happen within the project.
>
>
>- The latter one is more complicated. First of all the premise
>that one can "migrate PostgreSQL workloads to Spark" seems to be 
> flawed.
>While both Spark and PostgreSQL evolve, and probably have more in 
> common
>today, than a few years ago, they're not even close enough to pretend 
> that
>one can be replacement for the other. In contrast, existing 
> compatibility
>layers between major vendors make sense, because feature disparity (at
>least when it comes to core functionality) is usually minimal. And that
>doesn't even touch the problem that PostgreSQL provides extensively 
> used
>extension points that enable broad and evolving ecosystem (what should 
> we
>do about continuous queries? Should Structured Streaming provide some
>compatibility layer as well?).
>
>More realistically Spark could provide a compatibility layer with
>some analytical tools that itself provide some PostgreSQL 
> compatibility,
>but these are not always fully compatible with upstream PostgreSQL, nor
>necessarily follow the latest PostgreSQL development.
>
>Furthermore compatibility layer can be, within certain limits
>(i.e. availability of required primitives), maintained as a separate
>project, without putting more strain on existing resources. Effectively
>what we care about here is if we can translate certain SQL string into
>logical or physical plan.
>
>
> On 11/26/19 3:26 PM, Wenchen Fan wrote:
>
> Hi all,
>
> Recently we start an effort to achieve feature parity between Spark
> and PostgreSQL: https://issues.apache.org/jira/browse/SPARK-27764
>
> This goes very well. We've added many missing features(parser rules,
> built-in functions, etc.) to Spark, and also corrected several

Re: override collect_list

2019-12-01 Thread Driesprong, Fokko
Hi Abhnav,

this sounds to me like a bad design, since it isn't scalable. Would it be
possible to store all the data in a database like hbase/bigtable/cassandra?
This would allow you to write the data from all the workers in parallel to
the database/

Cheers, Fokko

Op wo 27 nov. 2019 om 06:58 schreef Ranjan, Abhinav <
abhinav.ranjan...@gmail.com>:

> Hi all,
>
> I want to collect some rows in a list by using the spark's collect_list
> function.
>
> However, the no. of rows getting in the list is overflowing the memory. Is
> there any way to force the collection of rows onto the disk rather than in
> memory, or else instead of collecting it as a list, collect it as a list of
> list so as to avoid collecting it whole into the memory.
>
> *ex: df as:*
>
> *idcol1col2*
>
> 1assd
>
> 1dffg
>
> 1ghjk
>
> 2rtty
>
> *df.groupBy(id).agg(collect_list(struct(col1, col2) as col3)))*
>
> *idcol3*
>
> 1[(as,sd),(df,fg),(gh,jk)]
>
> 2[(rt,ty)]
>
>
> so if id=1 is having too much rows than the list will overflow. How to
> avoid this scenario?
>
>
> Thanks,
>
> Abhnav
>
>
>


Re: Thoughts on Spark 3 release, or a preview release

2019-09-13 Thread Driesprong, Fokko
Michael Heuer, that's an interesting issue.

1.8.2 to 1.9.0 is almost binary compatible (94%):
http://people.apache.org/~busbey/avro/1.9.0-RC4/1.8.2_to_1.9.0RC4_compat_report.html.
Most of the stuff is removing the Jackson and Netty API from Avro's public
API and deprecating the Joda library. I would strongly advise moving to
1.9.1 since there are some regression issues, for Java most important:
https://jira.apache.org/jira/browse/AVRO-2400

I'd love to dive into the issue that you describe and I'm curious if the
issue is still there with Avro 1.9.1. I'm a bit busy at the moment but
might have some time this weekend to dive into it.

Cheers, Fokko Driesprong


Op vr 13 sep. 2019 om 02:32 schreef Reynold Xin :

> +1! Long due for a preview release.
>
>
> On Thu, Sep 12, 2019 at 5:26 PM, Holden Karau 
> wrote:
>
>> I like the idea from the PoV of giving folks something to start testing
>> against and exploring so they can raise issues with us earlier in the
>> process and we have more time to make calls around this.
>>
>> On Thu, Sep 12, 2019 at 4:15 PM John Zhuge  wrote:
>>
>> +1  Like the idea as a user and a DSv2 contributor.
>>
>> On Thu, Sep 12, 2019 at 4:10 PM Jungtaek Lim  wrote:
>>
>> +1 (as a contributor) from me to have preview release on Spark 3 as it
>> would help to test the feature. When to cut preview release is
>> questionable, as major works are ideally to be done before that - if we are
>> intended to introduce new features before official release, that should
>> work regardless of this, but if we are intended to have opportunity to test
>> earlier, ideally it should.
>>
>> As a one of contributors in structured streaming area, I'd like to add
>> some items for Spark 3.0, both "must be done" and "better to have". For
>> "better to have", I pick some items for new features which committers
>> reviewed couple of rounds and dropped off without soft-reject (No valid
>> reason to stop). For Spark 2.4 users, only added feature for structured
>> streaming is Kafka delegation token. (given we assume revising Kafka
>> consumer pool as improvement) I hope we provide some gifts for structured
>> streaming users in Spark 3.0 envelope.
>>
>> > must be done
>> * SPARK-26154 Stream-stream joins - left outer join gives inconsistent
>> output
>> It's a correctness issue with multiple users reported, being reported at
>> Nov. 2018. There's a way to reproduce it consistently, and we have a patch
>> submitted at Jan. 2019 to fix it.
>>
>> > better to have
>> * SPARK-23539 Add support for Kafka headers in Structured Streaming
>> * SPARK-26848 Introduce new option to Kafka source - specify timestamp to
>> start and end offset
>> * SPARK-20568 Delete files after processing in structured streaming
>>
>> There're some more new features/improvements items in SS, but given we're
>> talking about ramping-down, above list might be realistic one.
>>
>>
>>
>> On Thu, Sep 12, 2019 at 9:53 AM Jean Georges Perrin  wrote:
>>
>> As a user/non committer, +1
>>
>> I love the idea of an early 3.0.0 so we can test current dev against it,
>> I know the final 3.x will probably need another round of testing when it
>> gets out, but less for sure... I know I could checkout and compile, but
>> having a “packaged” preversion is great if it does not take too much time
>> to the team...
>>
>> jg
>>
>>
>> On Sep 11, 2019, at 20:40, Hyukjin Kwon  wrote:
>>
>> +1 from me too but I would like to know what other people think too.
>>
>> 2019년 9월 12일 (목) 오전 9:07, Dongjoon Hyun 님이 작성:
>>
>> Thank you, Sean.
>>
>> I'm also +1 for the following three.
>>
>> 1. Start to ramp down (by the official branch-3.0 cut)
>> 2. Apache Spark 3.0.0-preview in 2019
>> 3. Apache Spark 3.0.0 in early 2020
>>
>> For JDK11 clean-up, it will meet the timeline and `3.0.0-preview` helps
>> it a lot.
>>
>> After this discussion, can we have some timeline for `Spark 3.0 Release
>> Window` in our versioning-policy page?
>>
>> - https://spark.apache.org/versioning-policy.html
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Wed, Sep 11, 2019 at 11:54 AM Michael Heuer  wrote:
>>
>> I would love to see Spark + Hadoop + Parquet + Avro compatibility
>> problems resolved, e.g.
>>
>> https://issues.apache.org/jira/browse/SPARK-25588
>> https://issues.apache.org/jira/browse/SPARK-27781
>>
>> Note that Avro is now at 1.9.1, binary-incompatible with 1.8.x.  As far
>> as I know, Parquet has not cut a release based on this new version.
>>
>> Then out of curiosity, are the new Spark Graph APIs targeting 3.0?
>>
>> https://github.com/apache/spark/pull/24851
>> https://github.com/apache/spark/pull/24297
>>
>>michael
>>
>>
>> On Sep 11, 2019, at 1:37 PM, Sean Owen  wrote:
>>
>> I'm curious what current feelings are about ramping down towards a
>> Spark 3 release. It feels close to ready. There is no fixed date,
>> though in the past we had informally tossed around "back end of 2019".
>> For reference, Spark 1 was May 2014, Spark 2 was July 2016. I'd expect
>> Spark 2 to last longer, so to 

Re: Welcoming some new committers and PMC members

2019-09-10 Thread Driesprong, Fokko
Congrats all, well deserved!


Cheers, Fokko

Op di 10 sep. 2019 om 10:21 schreef Gabor Somogyi :

> Congrats Guys!
>
> G
>
>
> On Tue, Sep 10, 2019 at 2:32 AM Matei Zaharia 
> wrote:
>
>> Hi all,
>>
>> The Spark PMC recently voted to add several new committers and one PMC
>> member. Join me in welcoming them to their new roles!
>>
>> New PMC member: Dongjoon Hyun
>>
>> New committers: Ryan Blue, Liang-Chi Hsieh, Gengliang Wang, Yuming Wang,
>> Weichen Xu, Ruifeng Zheng
>>
>> The new committers cover lots of important areas including ML, SQL, and
>> data sources, so it’s great to have them here. All the best,
>>
>> Matei and the Spark PMC
>>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: [DISCUSS] Migrate development scripts under dev/ from Python2 to Python 3

2019-08-15 Thread Driesprong, Fokko
Sorry for the late reply, was a bit busy lately, but I still would like to
share my thoughts on this.

For Apache Airflow we're dropping support for Python 2 in the next major
release. We're now supporting Python 3.5+. Mostly because:

   - Easier to maintain and test, and less if/else constructions for the
   different Python versions. Also, not having to test against Python 2.x
   reduces the build matrix.
   - Python 3 has support for typing. From Python 3.5 you can include
   provisional type hints. An excellent presentation by Guido himself:
   https://www.youtube.com/watch?v=2wDvzy6Hgxg. From Python 3.5 it is still
   provisional, but it is a really good idea. From Airflow we've noticed that
   using mypy is catching bugs early:
  - This will put less stress on the (boring part of the) reviewing
  process since a lot of this stuff is checked automatically.
  - For new developers, it is easier to read the code because of the
  annotations.
  - Can be used as an input for generated documentation (or check if it
  still in sync with the docstrings)
  - Easier to extend the code since you know what kind of types you can
  expect, and your IDE will also pick up the hinting.
   - Python 2.x will be EOL end this year

I have a strong preference to migrate everything to Python 3.

Cheers, Fokko


Op wo 7 aug. 2019 om 12:14 schreef Weichen Xu :

> All right we could support both Python 2 and Python 3 for spark 3.0.
>
> On Wed, Aug 7, 2019 at 6:10 PM Hyukjin Kwon  wrote:
>
>> We didn't drop Python 2 yet although it's deprecated. So I think It
>> should support both Python 2 and Python 3 at the current status.
>>
>> 2019년 8월 7일 (수) 오후 6:54, Weichen Xu 님이 작성:
>>
>>> Hi all,
>>>
>>> I would like to discuss the compatibility for dev scripts. Because we
>>> already decided to deprecate python2 in spark 3.0, for development scripts
>>> under dev/ , we have two choice:
>>> 1) Migration from Python 2 to Python 3
>>> 2) Support both Python 2 and Python 3
>>>
>>> I tend to option (2) which is more friendly to maintenance.
>>>
>>> Regards,
>>> Weichen
>>>
>>


Re: Jackson version updation

2019-06-28 Thread Driesprong, Fokko
The PR of bumping Jackson to 2.9.6 gives some examples of the behavioral
changes that Sean is referring to:
https://github.com/apache/spark/pull/21596

Cheers,
Fokko Driesprong

Op vr 28 jun. 2019 om 14:13 schreef Sean Owen :

> https://github.com/apache/spark/blob/branch-2.4/pom.xml#L161
> Correct, because it would introduce behavior changes.
>
> On Fri, Jun 28, 2019 at 3:54 AM Pavithra R  wrote:
>
>> In spark master branch, the version of Jackson jars have been upgraded to
>> 2.9.9
>>
>>
>> https://github.com/apache/spark/commit/bd8732300385ad99d2cec3a4af49953d8925eaf6
>>
>>
>>
>> *[SPARK-27757][CORE] Bump Jackson to 2.9.9 – *
>>
>>
>>
>> This has been done to address CVE-2019-12086.
>>
>>
>>
>> Could you confirm why Jackson jars are not upgraded in older branches
>> like 2.3 etc?
>>
>>
>>
>> Thanks,
>>
>> Pavithra R
>>
>>
>>
>> Huawei Technologies India Pvt. Ltd.
>>
>> Survey No. 37, Next to EPIP Area, Kundalahalli, Whitefield
>>
>> Bengaluru-560066, Karnataka
>>
>> Tel: + 91-80-49160700 Ext 72060II Mob: 9790706742 Email:
>> pavithr...@huawei.com
>>
>> [image: Company_logo]
>> --
>>
>>
>>
>> This e-mail and its attachments contain confidential information from
>> HUAWEI, which
>> is intended only for the person or entity whose address is listed above.
>> Any use of the
>> information contained herein in any way (including, but not limited to,
>> total or partial
>> disclosure, reproduction, or dissemination) by persons other than the
>> intended
>> recipient(s) is prohibited. If you receive this e-mail in error, please
>> notify the sender by
>> phone or email immediately and delete it!
>>
>>
>>
>>
>>
>


Re: Spark 2.4.2

2019-04-19 Thread Driesprong, Fokko
For me a +1 on upgrading Jackson as well. This has been long overdue. There
are some behavioural changes regarding handling null/None. This is also
described in the PR:
https://github.com/apache/spark/pull/21596

Also it has a positive impact on the performance.

Cheers, Fokko

Op vr 19 apr. 2019 om 19:16 schreef Arun Mahadevan 

> +1 to upgrade Jackson. It has come up multiple times due to CVEs and the
> back port has worked out but it may be good to include if its not going to
> delay the release.
>
> On Thu, 18 Apr 2019 at 19:53, Wenchen Fan  wrote:
>
>> I've cut RC1. If people think we must upgrade Jackson in 2.4, I can cut
>> RC2 shortly.
>>
>> Thanks,
>> Wenchen
>>
>> On Fri, Apr 19, 2019 at 3:32 AM Felix Cheung 
>> wrote:
>>
>>> Re shading - same argument I’ve made earlier today in a PR...
>>>
>>> (Context- in many cases Spark has light or indirect dependencies but
>>> bringing them into the process breaks users code easily)
>>>
>>>
>>> --
>>> *From:* Michael Heuer 
>>> *Sent:* Thursday, April 18, 2019 6:41 AM
>>> *To:* Reynold Xin
>>> *Cc:* Sean Owen; Michael Armbrust; Ryan Blue; Spark Dev List; Wenchen
>>> Fan; Xiao Li
>>> *Subject:* Re: Spark 2.4.2
>>>
>>> +100
>>>
>>>
>>> On Apr 18, 2019, at 1:48 AM, Reynold Xin  wrote:
>>>
>>> We should have shaded all Spark’s dependencies :(
>>>
>>> On Wed, Apr 17, 2019 at 11:47 PM Sean Owen  wrote:
>>>
 For users that would inherit Jackson and use it directly, or whose
 dependencies do. Spark itself (with modifications) should be OK with
 the change.
 It's risky and normally wouldn't backport, except that I've heard a
 few times about concerns about CVEs affecting Databind, so wondering
 who else out there might have an opinion. I'm not pushing for it
 necessarily.

 On Wed, Apr 17, 2019 at 6:18 PM Reynold Xin 
 wrote:
 >
 > For Jackson - are you worrying about JSON parsing for users or
 internal Spark functionality breaking?
 >
 > On Wed, Apr 17, 2019 at 6:02 PM Sean Owen  wrote:
 >>
 >> There's only one other item on my radar, which is considering
 updating
 >> Jackson to 2.9 in branch-2.4 to get security fixes. Pros: it's come
 up
 >> a few times now that there are a number of CVEs open for 2.6.7. Cons:
 >> not clear they affect Spark, and Jackson 2.6->2.9 does change Jackson
 >> behavior non-trivially. That said back-porting the update PR to 2.4
 >> worked out OK locally. Any strong opinions on this one?
 >>
 >> On Wed, Apr 17, 2019 at 7:49 PM Wenchen Fan 
 wrote:
 >> >
 >> > I volunteer to be the release manager for 2.4.2, as I was also
 going to propose 2.4.2 because of the reverting of SPARK-25250. Is there
 any other ongoing bug fixes we want to include in 2.4.2? If no I'd like to
 start the release process today (CST).
 >> >
 >> > Thanks,
 >> > Wenchen
 >> >
 >> > On Thu, Apr 18, 2019 at 3:44 AM Sean Owen 
 wrote:
 >> >>
 >> >> I think the 'only backport bug fixes to branches' principle
 remains sound. But what's a bug fix? Something that changes behavior to
 match what is explicitly supposed to happen, or implicitly supposed to
 happen -- implied by what other similar things do, by reasonable user
 expectations, or simply how it worked previously.
 >> >>
 >> >> Is this a bug fix? I guess the criteria that matches is that
 behavior doesn't match reasonable user expectations? I don't know enough to
 have a strong opinion. I also don't think there is currently an objection
 to backporting it, whatever it's called.
 >> >>
 >> >>
 >> >> Is the question whether this needs a new release? There's no harm
 in another point release, other than needing a volunteer release manager.
 One could say, wait a bit longer to see what more info comes in about
 2.4.1. But given that 2.4.1 took like 2 months, it's reasonable to move
 towards a release cycle again. I don't see objection to that either (?)
 >> >>
 >> >>
 >> >> The meta question remains: is a 'bug fix' definition even agreed,
 and being consistently applied? There aren't correct answers, only best
 guesses from each person's own experience, judgment and priorities. These
 can differ even when applied in good faith.
 >> >>
 >> >> Sometimes the variance of opinion comes because people have
 different info that needs to be surfaced. Here, maybe it's best to share
 what about that offline conversation was convincing, for example.
 >> >>
 >> >> I'd say it's also important to separate what one would prefer
 from what one can't live with(out). Assuming one trusts the intent and
 experience of the handful of others with an opinion, I'd defer to someone
 who wants X and will own it, even if I'm moderately against it. Otherwise
 we'd get little done.
 >> >>
 >> >> In that light, it seems like both 

Re: [Events] Events not fired for SaveAsTextFile (?)

2018-10-15 Thread Driesprong, Fokko
Hi Bolke,

I would argue that Spark is not the right level of abstraction of doing
this. I would create a wrapper around the particular filesystem:
http://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/fs/FileSystem.html
Therefore you can write a wrapper around the LocalFileSystem if data will
be written to local disk, DistributedFileSystem when written to HDFS, and
also many object stores implements this interface. My 2¢

Cheers, Fokko

Op ma 15 okt. 2018 om 18:58 schreef Bolke de Bruin :

> Hi,
>
> Apologies upfront if this should have gone to user@ but it seems a
> developer question so here goes.
>
> We are trying to improve a listener to track lineage across our platform.
> This requires tracking where data comes from and where it goes to. E.g.
>
> sc.setLogLevel("INFO");
> val data = sc.textFile("hdfs://migration/staffingsec/Mydata.gz")
> data.saveAsTextFile ("hdfs://datalab/user/xxx”);
>
> In this case we would like to know that Spark picked up “Mydata.gz” and
> wrote it to “xxx”. Of course more complex examples are possible.
>
> In the particular case of the above Spark (2.3.2) does not seem trigger
> any events, or at least not that we know of that give us the relevant
> information.
>
> Is that a correct assessment? What can we do to get that information
> without knowing the code upfront? Should we provide a patch?
>
> Thanks
> Bolke
>
> Verstuurd vanaf mijn iPad
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: welcome a new batch of committers

2018-10-04 Thread Driesprong, Fokko
Congratulations all!

Op wo 3 okt. 2018 om 23:03 schreef Bryan Cutler :

> Congratulations everyone! Very well deserved!!
>
> On Wed, Oct 3, 2018, 1:59 AM Reynold Xin  wrote:
>
>> Hi all,
>>
>> The Apache Spark PMC has recently voted to add several new committers to
>> the project, for their contributions:
>>
>> - Shane Knapp (contributor to infra)
>> - Dongjoon Hyun (contributor to ORC support and other parts of Spark)
>> - Kazuaki Ishizaki (contributor to Spark SQL)
>> - Xingbo Jiang (contributor to Spark Core and SQL)
>> - Yinan Li (contributor to Spark on Kubernetes)
>> - Takeshi Yamamuro (contributor to Spark SQL)
>>
>> Please join me in welcoming them!
>>
>>


Re: Spark data quality bug when reading parquet files from hive metastore

2018-08-24 Thread Driesprong, Fokko
Hi Andrew,

This blog gives an idea how to schema is resolved:
https://blog.godatadriven.com/multiformat-spark-partition There is some
optimisation going on when reading Parquet using Spark. Hope this helps.

Cheers, Fokko


Op wo 22 aug. 2018 om 23:59 schreef t4 :

> https://issues.apache.org/jira/browse/SPARK-23576 ?
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: New to dev community | Contribution to Mlib

2017-09-22 Thread Driesprong, Fokko
Hi Venna,

Sounds like a very interesting algorithm. I have to agree with Seth, in the
end you don't want to add a lot of algorithms to Spark itself, it will blow
up the codebase and in the end the tests will run forever. You can also
consider publishing it to the Spark Packages website. I've also published
an outlier detection over there:
https://spark-packages.org/package/Fokko/spark-stochastic-outlier-selection

Cheers, Fokko

2017-09-22 2:10 GMT+02:00 Venali Sonone :

> Thank you for your response.
>
> The algorithm that I am proposing is Isolation Forest.
> Link to paper: paper
> . I
> particularly find that it should be included in Spark ML because so many
> applications that use Spark as part of real time streaming engine in
> industry need anomaly detection and current Spark ML supports it in some
> way by means clustering. I will probably start to create the implementation
> and prepare for proposal as you suggested.
>
> It is interesting to know that Spark is still implementing stuff in Spark
> ML to reach full parity with MLlib. Can I please get connected to folks
> working on it as I am interested in contributing. I have been heavy user of
> Spark since summer'15.
>
>  Cheers!
> -Venali
>
> On Thu, Sep 21, 2017 at 1:33 AM, Seth Hendrickson <
> seth.hendrickso...@gmail.com> wrote:
>
>> I'm not exactly clear on what you're proposing, but this sounds like
>> something that would live as a Spark package - a framework for anomaly
>> detection built on Spark. If there is some specific algorithm you have in
>> mind, it would be good to propose it on JIRA and discuss why you think it
>> needs to be included in Spark and not live as a Spark package.
>>
>> In general, there will probably be resistance to including new algorithms
>> in Spark ML, especially until the ML package has reached full parity with
>> MLlib. Still, if you can provide more details that will help to understand
>> what is best here.
>>
>> On Thu, Sep 14, 2017 at 1:29 AM, Venali Sonone 
>> wrote:
>>
>>>
>>> Hello,
>>>
>>> I am new to dev community of Spark and also open source in general but
>>> have used Spark extensively.
>>> I want to create a complete part on anomaly detection in spark Mlib,
>>> For the same I want to know if someone could guide me so i can start the
>>> development and contribute to Spark Mlib.
>>>
>>> Sorry for sounding naive if i do but any help is appreciated.
>>>
>>> Cheers!
>>> -venna
>>>
>>>
>>
>


Re: Scala 2.11 default build

2016-01-30 Thread Driesprong, Fokko
Nice, good work!

I've been using a Docker container to compile against 2.11:
https://github.com/fokko/docker-spark

Cheers, Fokko


2016-01-30 9:22 GMT+01:00 Reynold Xin :

> FYI - I just merged Josh's pull request to switch to Scala 2.11 as the
> default build.
>
> https://github.com/apache/spark/pull/10608
>
>


Re: How Spark utilize low-level architecture features?

2016-01-21 Thread Driesprong, Fokko
Hi Boric,

For the Spark Mllib package, which is build on top of Breeze
, which uses in turn netlib-java
. This netlib-java library can be
optimized for each system by compiling the specific architecture:

*To get optimal performance for a specific machine, it is best to compile
locally by grabbing the latest ATLAS or the latest OpenBLAS and following
the compilation instructions.*

For the rest, Spark focusses on adding more machines instead of using very
specific optimization procedures. Also optimizing your jobs (decreasing
communication between workers e.d.) might do the trick.

Cheers, Fokko.

2016-01-21 6:55 GMT+01:00 Boric Tan :

> Anyone could shed some light on this?
>
> Thanks,
> Boric
>
> On Tue, Jan 19, 2016 at 4:12 PM, Boric Tan 
> wrote:
>
>> Hi there,
>>
>> I am new to Spark, and would like to get some help to understand if Spark
>> can utilize the underlying architectures for better performance. If so, how
>> does it do it?
>>
>> For example, assume there is a cluster built with machines of different
>> CPUs, will Spark check the individual CPU information and use some
>> machine-specific setting for the tasks assigned to that machine? Or is it
>> totally dependent on the underlying JVM implementation to run the JAR file,
>> and therefor the JVM is the place to check if certain CPU features can be
>> used?
>>
>> Thanks,
>> Boric
>>
>
>


Optimized toIndexedRowMatrix

2016-01-20 Thread Driesprong, Fokko
Hi guys,

I've been working on an optimized implementation of the toIndexedRowMatrix

of the BlockMatrix. I already created a ticket
 and submitted a pull
 request at Github. What has to
be done to get this accepted? All the tests are passing.

On my own Github I created a project
 to see how the
performance is affected, for dense matrices this is a speedup of almost 19
times. Also for sparse matrices it will most likely be more optimal, as the
current implementation requires a lot of shuffling and creates high volumes
of intermediate objects (unless it is super sparse, but then also a
BlockMatrix would not be very optimal).

I would appreciate suggestions or tips to get this accepted.

Cheers, Fokko Driesprong.