Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)

2024-03-31 Thread Haejoon Lee
+1

On Mon, Apr 1, 2024 at 10:15 AM Hyukjin Kwon  wrote:

> Hi all,
>
> I'd like to start the vote for SPIP: Pure Python Package in PyPI (Spark
> Connect)
>
> JIRA 
> Prototype 
> SPIP doc
> 
>
> Please vote on the SPIP for the next 72 hours:
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don’t think this is a good idea because …
>
> Thanks.
>


Re: [VOTE] SPIP: Structured Logging Framework for Apache Spark

2024-03-11 Thread Haejoon Lee
+1

On Mon, Mar 11, 2024 at 10:36 AM Gengliang Wang  wrote:

> Hi all,
>
> I'd like to start the vote for SPIP: Structured Logging Framework for
> Apache Spark
>
> References:
>
>- JIRA ticket 
>- SPIP doc
>
> 
>- Discussion thread
>
>
> Please vote on the SPIP for the next 72 hours:
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don’t think this is a good idea because …
>
> Thanks!
> Gengliang Wang
>


Re: First Time contribution.

2023-09-17 Thread Haejoon Lee
Welcome Ram! :-)

I would recommend you to check
https://issues.apache.org/jira/browse/SPARK-37935 out as a starter task.

Refer to https://github.com/apache/spark/pull/41504,
https://github.com/apache/spark/pull/41455 as an example PR.

Or you can also add a new sub-task if you find any error messages that need
improvement.

Thanks!

On Mon, Sep 18, 2023 at 9:33 AM Denny Lee  wrote:

> Hi Ram,
>
> We have some good guidance at
> https://spark.apache.org/contributing.html
>
> HTH!
> Denny
>
>
> On Sun, Sep 17, 2023 at 17:18 ram manickam  wrote:
>
>>
>>
>>
>> Hello All,
>> Recently, joined this community and would like to contribute. Is there a
>> guideline or recommendation on tasks that can be picked up by a first timer
>> or a started task?.
>>
>> Tried looking at stack overflow tag: apache-spark
>> , couldn't find
>> any information for first time contributors.
>>
>> Looking forward to learning and contributing.
>>
>> Thanks
>> Ram
>>
>


Re: LLM script for error message improvement

2023-08-03 Thread Haejoon Lee
Additional information:

Please check https://issues.apache.org/jira/browse/SPARK-37935 if you want
to start contributing to improving error messages.

You can create sub-tasks if you believe there are error messages that need
improvement, in addition to the tasks listed in the umbrella JIRA.

You can also refer to https://github.com/apache/spark/pull/41504,
https://github.com/apache/spark/pull/41455 as an example PR.

On Thu, Aug 3, 2023 at 1:10 PM Ruifeng Zheng  wrote:

> +1 from my side, I'm fine to have it as a helper script
>
> On Thu, Aug 3, 2023 at 10:53 AM Hyukjin Kwon  wrote:
>
>> I think adding that dev tool script to improve the error message is fine.
>>
>> On Thu, 3 Aug 2023 at 10:24, Haejoon Lee
>>  wrote:
>>
>>> Dear contributors, I hope you are doing well!
>>>
>>> I see there are contributors who are interested in working on error
>>> message improvements and persistent contribution, so I want to share an
>>> llm-based error message improvement script for helping your contribution.
>>>
>>> You can find a detail for the script at
>>> https://github.com/apache/spark/pull/41711. I believe this can help
>>> your error message improvement work, so I encourage you to take a look at
>>> the pull request and leverage the script.
>>>
>>> Please let me know if you have any questions or concerns.
>>>
>>> Thanks all for your time and contributions!
>>>
>>> Best regards,
>>>
>>> Haejoon
>>>
>>


LLM script for error message improvement

2023-08-02 Thread Haejoon Lee
Dear contributors, I hope you are doing well!

I see there are contributors who are interested in working on error message
improvements and persistent contribution, so I want to share an llm-based
error message improvement script for helping your contribution.

You can find a detail for the script at
https://github.com/apache/spark/pull/41711. I believe this can help your
error message improvement work, so I encourage you to take a look at the
pull request and leverage the script.

Please let me know if you have any questions or concerns.

Thanks all for your time and contributions!

Best regards,

Haejoon


Re: [Question] Can't start Spark Connect

2023-03-08 Thread Haejoon Lee
Additionally, try deleting the `.idea` in the spark home directory and
restarting IntelliJ if it does not work properly after re-building during
development.
The .idea stores IntelliJ's project configuration and settings, and is
automatically generated when IntelliJ is launched.

>


Re: Welcome Xinrong Meng as a Spark committer

2022-08-09 Thread Haejoon Lee
Congrats, Xinrong!!

On Tue, Aug 9, 2022 at 5:12 PM Hyukjin Kwon  wrote:

> Hi all,
>
> The Spark PMC recently added Xinrong Meng as a committer on the project.
> Xinrong is the major contributor of PySpark especially Pandas API on Spark.
> She has guided a lot of new contributors enthusiastically. Please join me
> in welcoming Xinrong!
>
>


Question using multiple partition for Window cumulative functions when partition is not specified.

2021-08-29 Thread Haejoon Lee
Hi all,

I noticed that Spark uses only one partition when performing Window
cumulative functions without specifying the partition, so all the dataset
is moved into a single partition which easily causes OOM or serious
performance degradation.

See the example below:

>>> from pyspark.sql import functions as F, Window
>>> sdf = spark.range(10)
>>> sdf.select(F.sum(sdf["id"]).over(Window.rowsBetween(Window.unboundedPreceding,
>>>  Window.currentRow))).show()
...
WARN WindowExec: No Partition Defined for Window operation! Moving all
data to a single partition, this can cause serious performance
degradation.
...
+---+
|sum(id) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)|
+---+
|  0|
|  1|
|  3|
|  6|
| 10|
| 15|
| 21|
| 28|
| 36|
| 45|
+---+

As shown in the example, the window cumulative function requires the result
of the previous operation to be used for the next operation. In Spark, it
is calculated by simply moving all data to one partition if a partition is
not specified.

To overcome this, for example in Dask, they introduce the concept of
Overlapping
Computations , which
creates the copies of the entire dataset into multiple blocks and
sequentially performs the cumulative function, when the dataset exceeds the
memory size.

Of course, this method requires more cost for creating the copies and
communication of each block, but it allows performing cumulative functions
when even the size of the dataset exceeds the size of the memory, rather
than causing the OOM.

So, it's the way to simply resolve the out-of-memory issue without any
performance advantage, though.

I think maybe this kind of use case is pretty common in data science, but I
wonder how frequent these use cases are in Apache Spark.

Would it be helpful to implement this way in Apache Spark when doing Window
cumulative functions on out-of-memory data without specifying partition??

Check here  where the
issue was firstly initiated, for more detail.


Best,

Haejoon.