[jira] [Updated] (SPARK-44564) Refine the documents with LLM

2023-07-26 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng updated SPARK-44564:
--
Description: 
Let's first focus on the Documents of *PySpark DataFrame APIs*.

*1*, Chose a subset of DF APIs
Since the review bandwidth is limited, we recommend each PR contains at least 5 
APIs;

*2*, For each API, copy-paste the function (including function signature, doc 
string) to a LLM Model, and ask it to refine the document with prompts like:
* please improve the docstring of the 'unionByName' function
* please refine the comments of the 'unionByName' function
* please refine the documents of the 'unionByName' function, and add more 
examples
* please provide more examples for function 'unionByName'
* ...

It is highly recommended to leverage *GPT-4* instead of GPT-3.5, since the 
former generate better results. 

*3*, Note that the LLM is not 100% reliable, the generated doc string may 
contain some mistakes, e.g.
* The example code can not run
* The example results are incorrect
* The example code doesn't reflect the example title
* The description use wrong version, add a 'Raise' selection for non-existent 
exception
* ...

we need to fix them before sending a PR.

We can try different prompts, choose the good parts and combine them to the new 
doc sting.

  was:
Let's first focus on the Documents of *PySpark DataFrame APIs*.

*1*, Chose a subset of DF APIs
Since the review bandwidth is limited, we recommend each PR contains at least 5 
APIs;

*2*, For each API, copy-paste the function (including function signature, doc 
string) to a LLM Model, and ask it to refine the document with prompts like:
* please improve the docstring of the 'unionByName' function
* please refine the comments of the 'unionByName' function
* please refine the documents of the 'unionByName' function, and add more 
examples
* please provide more examples for function 'unionByName'
* ...

It is highly recommended to leverage *GPT-4* instead of GPT-3.5, since the 
former generate better results. 

*3*, Note that the LLM is not 100% reliable, the generated doc string may 
contain some mistakes, e.g.
* The example code can not run
* The example results are incorrect
* The example code doesn't reflect the example title
* The description use wrong version, add a 'Raise' selection for non-existent 
exception
* ...

we need to fix them before sending a PR.

We can generate the docs with different prompts, choose the good parts and 
combine them to the new doc sting.


> Refine the documents with LLM
> -
>
> Key: SPARK-44564
> URL: https://issues.apache.org/jira/browse/SPARK-44564
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Major
>
> Let's first focus on the Documents of *PySpark DataFrame APIs*.
> *1*, Chose a subset of DF APIs
> Since the review bandwidth is limited, we recommend each PR contains at least 
> 5 APIs;
> *2*, For each API, copy-paste the function (including function signature, doc 
> string) to a LLM Model, and ask it to refine the document with prompts like:
> * please improve the docstring of the 'unionByName' function
> * please refine the comments of the 'unionByName' function
> * please refine the documents of the 'unionByName' function, and add more 
> examples
> * please provide more examples for function 'unionByName'
> * ...
> It is highly recommended to leverage *GPT-4* instead of GPT-3.5, since the 
> former generate better results. 
> *3*, Note that the LLM is not 100% reliable, the generated doc string may 
> contain some mistakes, e.g.
> * The example code can not run
> * The example results are incorrect
> * The example code doesn't reflect the example title
> * The description use wrong version, add a 'Raise' selection for non-existent 
> exception
> * ...
> we need to fix them before sending a PR.
> We can try different prompts, choose the good parts and combine them to the 
> new doc sting.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44564) Refine the documents with LLM

2023-07-26 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng updated SPARK-44564:
--
Description: 
Let's first focus on the Documents of *PySpark DataFrame APIs*.

*1*, Chose a subset of DF APIs
Since the review bandwidth is limited, we recommend each PR contains at least 5 
APIs;

*2*, For each API, copy-paste the function (including function signature, doc 
string) to a LLM Model, and ask it to refine the document with prompts like:
* please improve the docstring of the 'unionByName' function
* please refine the comments of the 'unionByName' function
* please refine the documents of the 'unionByName' function, and add more 
examples
* please provide more example for function 'unionByName'
* ...

It is highly recommended to leverage *GPT-4* instead of GPT-3.5, since the 
former generate better results. 

*3*, Note that the LLM is not 100% reliable, the generated doc string may 
contain some mistakes, e.g.
* The example code can not run
* The example results are incorrect
* The example code doesn't reflect the example title
* The description use wrong version, add a 'Raise' selection for non-existent 
exception
* ...

we need to fix them before sending a PR.

We can generate the docs with different prompts, choose the good parts and 
combine them to the new doc sting.

  was:
Let's first focus on the Documents of *PySpark DataFrame APIs*.

*1*, Chose a subset of DF APIs
Since the review bandwidth is limited, we recommend each PR contains at least 5 
APIs;

*2*, For each API, copy-paste the function (including function signature, doc 
string) to a LLM Model, and ask it to refine the document with prompts like:
* please improve the docstring of the 'unionByName' function
* please refine the comments of the 'unionByName' function
* please refine the documents of the 'unionByName' function, and add more 
examples
* please provide more example for function 'unionByName'
* ...

It is highly recommended to leverage *GPT-4* instead of GPT-3.5, since the 
former generate better results. 

*3*, Note that the LLM is not 100% reliable, the generated doc string may 
contain some mistakes, e.g.
* The example results are incorrect
* The example code doesn't reflect the example title
* The description use wrong version, add a 'Raise' selection for non-existent 
exception
* ...

we need to fix them before sending a PR.

We can generate the docs with different prompts, choose the good parts and 
combine them to the new doc sting.


> Refine the documents with LLM
> -
>
> Key: SPARK-44564
> URL: https://issues.apache.org/jira/browse/SPARK-44564
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Major
>
> Let's first focus on the Documents of *PySpark DataFrame APIs*.
> *1*, Chose a subset of DF APIs
> Since the review bandwidth is limited, we recommend each PR contains at least 
> 5 APIs;
> *2*, For each API, copy-paste the function (including function signature, doc 
> string) to a LLM Model, and ask it to refine the document with prompts like:
> * please improve the docstring of the 'unionByName' function
> * please refine the comments of the 'unionByName' function
> * please refine the documents of the 'unionByName' function, and add more 
> examples
> * please provide more example for function 'unionByName'
> * ...
> It is highly recommended to leverage *GPT-4* instead of GPT-3.5, since the 
> former generate better results. 
> *3*, Note that the LLM is not 100% reliable, the generated doc string may 
> contain some mistakes, e.g.
> * The example code can not run
> * The example results are incorrect
> * The example code doesn't reflect the example title
> * The description use wrong version, add a 'Raise' selection for non-existent 
> exception
> * ...
> we need to fix them before sending a PR.
> We can generate the docs with different prompts, choose the good parts and 
> combine them to the new doc sting.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44564) Refine the documents with LLM

2023-07-26 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng updated SPARK-44564:
--
Description: 
Let's first focus on the Documents of *PySpark DataFrame APIs*.

*1*, Chose a subset of DF APIs
Since the review bandwidth is limited, we recommend each PR contains at least 5 
APIs;

*2*, For each API, copy-paste the function (including function signature, doc 
string) to a LLM Model, and ask it to refine the document with prompts like:
* please improve the docstring of the 'unionByName' function
* please refine the comments of the 'unionByName' function
* please refine the documents of the 'unionByName' function, and add more 
examples
* please provide more examples for function 'unionByName'
* ...

It is highly recommended to leverage *GPT-4* instead of GPT-3.5, since the 
former generate better results. 

*3*, Note that the LLM is not 100% reliable, the generated doc string may 
contain some mistakes, e.g.
* The example code can not run
* The example results are incorrect
* The example code doesn't reflect the example title
* The description use wrong version, add a 'Raise' selection for non-existent 
exception
* ...

we need to fix them before sending a PR.

We can generate the docs with different prompts, choose the good parts and 
combine them to the new doc sting.

  was:
Let's first focus on the Documents of *PySpark DataFrame APIs*.

*1*, Chose a subset of DF APIs
Since the review bandwidth is limited, we recommend each PR contains at least 5 
APIs;

*2*, For each API, copy-paste the function (including function signature, doc 
string) to a LLM Model, and ask it to refine the document with prompts like:
* please improve the docstring of the 'unionByName' function
* please refine the comments of the 'unionByName' function
* please refine the documents of the 'unionByName' function, and add more 
examples
* please provide more example for function 'unionByName'
* ...

It is highly recommended to leverage *GPT-4* instead of GPT-3.5, since the 
former generate better results. 

*3*, Note that the LLM is not 100% reliable, the generated doc string may 
contain some mistakes, e.g.
* The example code can not run
* The example results are incorrect
* The example code doesn't reflect the example title
* The description use wrong version, add a 'Raise' selection for non-existent 
exception
* ...

we need to fix them before sending a PR.

We can generate the docs with different prompts, choose the good parts and 
combine them to the new doc sting.


> Refine the documents with LLM
> -
>
> Key: SPARK-44564
> URL: https://issues.apache.org/jira/browse/SPARK-44564
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Major
>
> Let's first focus on the Documents of *PySpark DataFrame APIs*.
> *1*, Chose a subset of DF APIs
> Since the review bandwidth is limited, we recommend each PR contains at least 
> 5 APIs;
> *2*, For each API, copy-paste the function (including function signature, doc 
> string) to a LLM Model, and ask it to refine the document with prompts like:
> * please improve the docstring of the 'unionByName' function
> * please refine the comments of the 'unionByName' function
> * please refine the documents of the 'unionByName' function, and add more 
> examples
> * please provide more examples for function 'unionByName'
> * ...
> It is highly recommended to leverage *GPT-4* instead of GPT-3.5, since the 
> former generate better results. 
> *3*, Note that the LLM is not 100% reliable, the generated doc string may 
> contain some mistakes, e.g.
> * The example code can not run
> * The example results are incorrect
> * The example code doesn't reflect the example title
> * The description use wrong version, add a 'Raise' selection for non-existent 
> exception
> * ...
> we need to fix them before sending a PR.
> We can generate the docs with different prompts, choose the good parts and 
> combine them to the new doc sting.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44564) Refine the documents with LLM

2023-07-26 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng updated SPARK-44564:
--
Description: 
Let's first focus on the Documents of *PySpark DataFrame APIs*.

*1*, Chose a subset of DF APIs
Since the review bandwidth is limited, we recommend each PR contains at least 5 
APIs;

*2*, For each API, copy-paste the function (including function signature, doc 
string) to a LLM Model, and ask it to refine the document with prompts like:
* please improve the docstring of the 'unionByName' function
* please refine the comments of the 'unionByName' function
* please refine the documents of the 'unionByName' function, and add more 
examples
* please provide more example for function 'unionByName'
* ...

It is highly recommended to leverage *GPT-4* instead of GPT-3.5, since the 
former generate better results. 

*3*, Note that the LLM is not 100% reliable, the generated doc string may 
contain some mistakes, e.g.
* The example results are incorrect
* The example code doesn't reflect the example title
* The description use wrong version, add a 'Raise' selection for non-existent 
exception
* ...

we need to fix them before sending a PR.

We can generate the docs with different prompts, choose the good parts and 
combine them to the new doc sting.

  was:
Let's first focus on the Documents of *PySpark DataFrame APIs*.

*1*, Chose a subset of DF APIs
Since the review bandwidth is limited, we recommend each PR contains at least 5 
APIs;

*2*, For each API, copy-paste the function (including function signature, doc 
string) to a LLM Model, and ask it to refine the document with prompts like:
* please improve the docstring of the 'unionByName' function
* please refine the comments of the 'unionByName' function
* please refine the documents of the 'unionByName' function, and add more 
examples
* please provide more example for function 'unionByName'
* ...

It is highly recommended to leverage *GPT-4* instead of GPT-3.5, since the 
former generate better results. 

*3*, The generated doc string may contain some bugs, e.g.
* The example results are incorrect
* The example code doesn't reflect the example title
* The description use wrong version, add a 'Raise' selection for non-existent 
exception
* ...

we need to fix them before sending a PR.

We can generate the docs with different prompts, choose the good parts and 
combine them to the new doc sting.


> Refine the documents with LLM
> -
>
> Key: SPARK-44564
> URL: https://issues.apache.org/jira/browse/SPARK-44564
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Major
>
> Let's first focus on the Documents of *PySpark DataFrame APIs*.
> *1*, Chose a subset of DF APIs
> Since the review bandwidth is limited, we recommend each PR contains at least 
> 5 APIs;
> *2*, For each API, copy-paste the function (including function signature, doc 
> string) to a LLM Model, and ask it to refine the document with prompts like:
> * please improve the docstring of the 'unionByName' function
> * please refine the comments of the 'unionByName' function
> * please refine the documents of the 'unionByName' function, and add more 
> examples
> * please provide more example for function 'unionByName'
> * ...
> It is highly recommended to leverage *GPT-4* instead of GPT-3.5, since the 
> former generate better results. 
> *3*, Note that the LLM is not 100% reliable, the generated doc string may 
> contain some mistakes, e.g.
> * The example results are incorrect
> * The example code doesn't reflect the example title
> * The description use wrong version, add a 'Raise' selection for non-existent 
> exception
> * ...
> we need to fix them before sending a PR.
> We can generate the docs with different prompts, choose the good parts and 
> combine them to the new doc sting.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44564) Refine the documents with LLM

2023-07-26 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng updated SPARK-44564:
--
Description: 
Let's first focus on the Documents of *PySpark DataFrame APIs*.

*1*, Chose a subset of DF APIs
Since the review bandwidth is limited, we recommend each PR contains at least 5 
APIs;

*2*, For each API, copy-paste the function (including function signature, doc 
string) to a LLM Model, and ask it to refine the document with prompts like:
* please improve the docstring of the 'unionByName' function
* please refine the comments of the 'unionByName' function
* please refine the documents of the 'unionByName' function, and add more 
examples
* please provide more example for function 'unionByName'
* ...

It is highly recommended to leverage *GPT-4* instead of GPT-3.5, since the 
former generate better results. 

*3*, The generated doc string may contain some bugs, e.g.
* The example results are incorrect
* The example code doesn't reflect the example title
* The description use wrong version, add a 'Raise' selection for non-existent 
exception
* ...

we need to fix them before sending a PR.

We can generate the docs with different prompts, choose the good parts and 
combine them to the new doc sting.

  was:
Let's first focus on the Documents of *PySpark DataFrame APIs*.

1, Chose a subset of DF APIs
Since the review bandwidth is limited, we recommend each PR contains at least 5 
APIs;

2, For each API, copy-paste the function (including function signature, doc 
string) to a LLM Model, and ask it to refine the document with prompts like:
* please improve the docstring of the 'unionByName' function
* please refine the comments of the 'unionByName' function
* please refine the documents of the 'unionByName' function, and add more 
examples
* please provide more example for function 'unionByName'
* ...

It is highly recommended to leverage *GPT-4* instead of GPT-3.5, since the 
former generate better results. 

3, The generated doc string may contain some bugs, e.g.
* The example results are incorrect
* The example code doesn't reflect the example title
* The description use wrong version, add a 'Raise' selection for non-existent 
exception
* ...

we need to fix them before sending a PR.

We can generate the docs with different prompts, choose the good parts and 
combine them to the new doc sting.


> Refine the documents with LLM
> -
>
> Key: SPARK-44564
> URL: https://issues.apache.org/jira/browse/SPARK-44564
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Major
>
> Let's first focus on the Documents of *PySpark DataFrame APIs*.
> *1*, Chose a subset of DF APIs
> Since the review bandwidth is limited, we recommend each PR contains at least 
> 5 APIs;
> *2*, For each API, copy-paste the function (including function signature, doc 
> string) to a LLM Model, and ask it to refine the document with prompts like:
> * please improve the docstring of the 'unionByName' function
> * please refine the comments of the 'unionByName' function
> * please refine the documents of the 'unionByName' function, and add more 
> examples
> * please provide more example for function 'unionByName'
> * ...
> It is highly recommended to leverage *GPT-4* instead of GPT-3.5, since the 
> former generate better results. 
> *3*, The generated doc string may contain some bugs, e.g.
> * The example results are incorrect
> * The example code doesn't reflect the example title
> * The description use wrong version, add a 'Raise' selection for non-existent 
> exception
> * ...
> we need to fix them before sending a PR.
> We can generate the docs with different prompts, choose the good parts and 
> combine them to the new doc sting.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44557) Flaky PIP packaging test

2023-07-26 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-44557.
--
Fix Version/s: 3.5.0
   4.0.0
   3.4.2
   Resolution: Fixed

Issue resolved by pull request 42159
[https://github.com/apache/spark/pull/42159]

> Flaky PIP packaging test
> 
>
> Key: SPARK-44557
> URL: https://issues.apache.org/jira/browse/SPARK-44557
> Project: Spark
>  Issue Type: Task
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.5.0, 4.0.0, 3.4.2
>
>
> e.g., https://github.com/apache/spark/actions/runs/5665869112/job/15351515397



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44557) Flaky PIP packaging test

2023-07-26 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-44557:


Assignee: Hyukjin Kwon

> Flaky PIP packaging test
> 
>
> Key: SPARK-44557
> URL: https://issues.apache.org/jira/browse/SPARK-44557
> Project: Spark
>  Issue Type: Task
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> e.g., https://github.com/apache/spark/actions/runs/5665869112/job/15351515397



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44564) Refine the documents with LLM

2023-07-26 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng updated SPARK-44564:
--
Description: 
Let's first focus on the Documents of *PySpark DataFrame APIs*.

1, Chose a subset of DF APIs
Since the review bandwidth is limited, we recommend each PR contains at least 5 
APIs;

2, For each API, copy-paste the function (including function signature, doc 
string) to a LLM Model, and ask it to refine the document with prompts like:
* please improve the docstring of the 'unionByName' function
* please refine the comments of the 'unionByName' function
* please refine the documents of the 'unionByName' function, and add more 
examples
* please provide more example for function 'unionByName'
* ...

It is highly recommended to leverage *GPT-4* instead of GPT-3.5, since the 
former generate better results. 

3, The generated doc string may contain some bugs, e.g.
* The example results are incorrect
* The example code doesn't reflect the example title
* The description use wrong version, add a 'Raise' selection for non-existent 
exception
* ...

we need to fix them before sending a PR.

We can generate the docs with different prompts, choose the good parts and 
combine them to the new doc sting.

  was:
Let's first focus on the Documents of PySpark DataFrame APIs.

1, Chose a subset of DF APIs
Since the review bandwidth is limited, we recommend each PR contains at least 5 
APIs;

2, For each API, copy-paste the function (including function signature, doc 
string) to a LLM Model, and ask it to refine the document with prompts like:
* please improve the docstring of the 'unionByName' function
* please refine the comments of the 'unionByName' function
* please refine the documents of the 'unionByName' function, and add more 
examples
* please provide more example for function 'unionByName'
* ...

It is highly recommended to leverage *GPT-4* instead of GPT-3.5, since the 
former generate better results. 

3, The generated doc string may contain some bugs, e.g.
* The example results are incorrect
* The example code doesn't reflect the example title
* The description use wrong version, add a 'Raise' selection for non-existent 
exception
* ...

we need to fix them before sending a PR.

We can generate the docs with different prompts, choose the good parts and 
combine them to the new doc sting.


> Refine the documents with LLM
> -
>
> Key: SPARK-44564
> URL: https://issues.apache.org/jira/browse/SPARK-44564
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Major
>
> Let's first focus on the Documents of *PySpark DataFrame APIs*.
> 1, Chose a subset of DF APIs
> Since the review bandwidth is limited, we recommend each PR contains at least 
> 5 APIs;
> 2, For each API, copy-paste the function (including function signature, doc 
> string) to a LLM Model, and ask it to refine the document with prompts like:
> * please improve the docstring of the 'unionByName' function
> * please refine the comments of the 'unionByName' function
> * please refine the documents of the 'unionByName' function, and add more 
> examples
> * please provide more example for function 'unionByName'
> * ...
> It is highly recommended to leverage *GPT-4* instead of GPT-3.5, since the 
> former generate better results. 
> 3, The generated doc string may contain some bugs, e.g.
> * The example results are incorrect
> * The example code doesn't reflect the example title
> * The description use wrong version, add a 'Raise' selection for non-existent 
> exception
> * ...
> we need to fix them before sending a PR.
> We can generate the docs with different prompts, choose the good parts and 
> combine them to the new doc sting.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44565) Example: Refine the docs for Union, UnionAll and unionByName

2023-07-26 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng updated SPARK-44565:
--
Summary: Example: Refine the docs for Union, UnionAll and unionByName  
(was: Refine the docs for Union, UnionAll and unionByName)

> Example: Refine the docs for Union, UnionAll and unionByName
> 
>
> Key: SPARK-44565
> URL: https://issues.apache.org/jira/browse/SPARK-44565
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44565) Refine the docs for Union, UnionAll and unionByName

2023-07-26 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-44565:
-

 Summary: Refine the docs for Union, UnionAll and unionByName
 Key: SPARK-44565
 URL: https://issues.apache.org/jira/browse/SPARK-44565
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation
Affects Versions: 4.0.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44564) Refine the documents with LLM

2023-07-26 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng updated SPARK-44564:
--
Description: 
Let's first focus on the Documents of PySpark DataFrame APIs.

1, Chose a subset of DF APIs
Since the review bandwidth is limited, we recommend each PR contains at least 5 
APIs;

2, For each API, copy-paste the function (including function signature, doc 
string) to a LLM Model, and ask it to refine the document with prompts like:
* please improve the docstring of the 'unionByName' function
* please refine the comments of the 'unionByName' function
* please refine the documents of the 'unionByName' function, and add more 
examples
* please provide more example for function 'unionByName'
* ...

It is highly recommended to leverage *GPT-4* instead of GPT-3.5, since the 
former generate better results. 

3, The generated doc string may contain some bugs, e.g.
* The example results are incorrect
* The example code doesn't reflect the example title
* The description use wrong version, add a 'Raise' selection for non-existent 
exception
* ...

we need to fix them before sending a PR.

We can generate the docs with different prompts, choose the good parts and 
combine them to the new doc sting.

  was:
Let's first focus on the Documents of PySpark DataFrame APIs.

1, Chose a subset of DF APIs
Since the review bandwidth is limited, we recommend each PR contains at least 5 
APIs;

2, For each API, copy-paste the function (including function signature, doc 
string) to a LLM Model, and ask it to refine the document with prompts like:
* please improve the docstring of the 'unionByName' function
* please refine the comments of the 'unionByName' function
* please refine the documents of the 'unionByName' function, and add more 
examples
* please provide more example for function 'unionByName'

It is highly recommended to leverage *GPT-4* instead of GPT-3.5, since the 
former generate better results. 

3, The generated doc string may contain some bugs, e.g.
* The description use wrong version, add a 'Raise' selection;
* The example code doesn't reflect the example title;
* The example results are incorrect

we need to fix them before sending a PR.

We can generate the docs with different prompts, choose the good parts and 
combine them to the new doc sting.


> Refine the documents with LLM
> -
>
> Key: SPARK-44564
> URL: https://issues.apache.org/jira/browse/SPARK-44564
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Major
>
> Let's first focus on the Documents of PySpark DataFrame APIs.
> 1, Chose a subset of DF APIs
> Since the review bandwidth is limited, we recommend each PR contains at least 
> 5 APIs;
> 2, For each API, copy-paste the function (including function signature, doc 
> string) to a LLM Model, and ask it to refine the document with prompts like:
> * please improve the docstring of the 'unionByName' function
> * please refine the comments of the 'unionByName' function
> * please refine the documents of the 'unionByName' function, and add more 
> examples
> * please provide more example for function 'unionByName'
> * ...
> It is highly recommended to leverage *GPT-4* instead of GPT-3.5, since the 
> former generate better results. 
> 3, The generated doc string may contain some bugs, e.g.
> * The example results are incorrect
> * The example code doesn't reflect the example title
> * The description use wrong version, add a 'Raise' selection for non-existent 
> exception
> * ...
> we need to fix them before sending a PR.
> We can generate the docs with different prompts, choose the good parts and 
> combine them to the new doc sting.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44564) Refine the documents with LLM

2023-07-26 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng updated SPARK-44564:
--
Description: 
Let's first focus on the Documents of PySpark DataFrame APIs.

1, Chose a subset of DF APIs
Since the review bandwidth is limited, we recommend each PR contains at least 5 
APIs;

2, For each API, copy-paste the function (including function signature, doc 
string) to a LLM Model, and ask it to refine the document with prompts like:
* please improve the docstring of the 'unionByName' function
* please refine the comments of the 'unionByName' function
* please refine the documents of the 'unionByName' function, and add more 
examples
* please provide more example for function 'unionByName'

It is highly recommended to leverage *GPT-4* instead of GPT-3.5, since the 
former generate better results. 

3, The generated doc string may contain some bugs, e.g.
* The description use wrong version, add a 'Raise' selection;
* The example code doesn't reflect the example title;
* The example results are incorrect

we need to fix them before sending a PR.

We can generate the docs with different prompts, choose the good parts and 
combine them to the new doc sting.

  was:
Let's first focus on the Documents of PySpark DataFrame APIs.

1, Chose a subset of DF APIs
Since the review bandwidth is limited, we recommend each PR contains at least 5 
APIs;

2, For each API, copy-paste the function (including function signature, doc 
string) to a LLM Model, and ask it to refine the document with prompts like:
* please improve the docstring of the 'unionByName' function
* please refine the comments of the 'unionByName' function
* please refine the documents of the 'unionByName' function, and add more 
examples
* please provide more example for function 'unionByName'

It is highly recommended to leverage *GPT-4* instead of GPT-3.5, since the 
former generate better results. 

3, The generated doc string may contain some bugs, e.g.
* The description use wrong version, add a 'Raise' selection;
* The example code doesn't reflect the example title;
* The example results are incorrect
we need to fix them before sending a PR.

We can generate the docs with different prompts, choose the good parts and 
combine them to the new doc sting.


> Refine the documents with LLM
> -
>
> Key: SPARK-44564
> URL: https://issues.apache.org/jira/browse/SPARK-44564
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Major
>
> Let's first focus on the Documents of PySpark DataFrame APIs.
> 1, Chose a subset of DF APIs
> Since the review bandwidth is limited, we recommend each PR contains at least 
> 5 APIs;
> 2, For each API, copy-paste the function (including function signature, doc 
> string) to a LLM Model, and ask it to refine the document with prompts like:
> * please improve the docstring of the 'unionByName' function
> * please refine the comments of the 'unionByName' function
> * please refine the documents of the 'unionByName' function, and add more 
> examples
> * please provide more example for function 'unionByName'
> It is highly recommended to leverage *GPT-4* instead of GPT-3.5, since the 
> former generate better results. 
> 3, The generated doc string may contain some bugs, e.g.
> * The description use wrong version, add a 'Raise' selection;
> * The example code doesn't reflect the example title;
> * The example results are incorrect
> we need to fix them before sending a PR.
> We can generate the docs with different prompts, choose the good parts and 
> combine them to the new doc sting.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44564) Refine the documents with LLM

2023-07-26 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-44564:
-

 Summary: Refine the documents with LLM
 Key: SPARK-44564
 URL: https://issues.apache.org/jira/browse/SPARK-44564
 Project: Spark
  Issue Type: Umbrella
  Components: Documentation
Affects Versions: 4.0.0
Reporter: Ruifeng Zheng


Let's first focus on the Documents of PySpark DataFrame APIs.

1, Chose a subset of DF APIs
Since the review bandwidth is limited, we recommend each PR contains at least 5 
APIs;

2, For each API, copy-paste the function (including function signature, doc 
string) to a LLM Model, and ask it to refine the document with prompts like:
* please improve the docstring of the 'unionByName' function
* please refine the comments of the 'unionByName' function
* please refine the documents of the 'unionByName' function, and add more 
examples
* please provide more example for function 'unionByName'

It is highly recommended to leverage *GPT-4* instead of GPT-3.5, since the 
former generate better results. 

3, The generated doc string may contain some bugs, e.g.
* The description use wrong version, add a 'Raise' selection;
* The example code doesn't reflect the example title;
* The example results are incorrect
we need to fix them before sending a PR.

We can generate the docs with different prompts, choose the good parts and 
combine them to the new doc sting.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44533) Add support for accumulator, broadcast, and Spark files in Python UDTF's analyze.

2023-07-26 Thread Takuya Ueshin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-44533.
---
  Assignee: Takuya Ueshin
Resolution: Fixed

Issue resolved by pull request 42135
https://github.com/apache/spark/pull/42135

> Add support for accumulator, broadcast, and Spark files in Python UDTF's 
> analyze.
> -
>
> Key: SPARK-44533
> URL: https://issues.apache.org/jira/browse/SPARK-44533
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44563) Upgrade Apache Arrow to 13.0.0

2023-07-26 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-44563:
---

 Summary: Upgrade Apache Arrow to 13.0.0
 Key: SPARK-44563
 URL: https://issues.apache.org/jira/browse/SPARK-44563
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 4.0.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43611) Fix unexpected `AnalysisException` from Spark Connect client

2023-07-26 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-43611.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42086
[https://github.com/apache/spark/pull/42086]

> Fix unexpected `AnalysisException` from Spark Connect client
> 
>
> Key: SPARK-43611
> URL: https://issues.apache.org/jira/browse/SPARK-43611
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Pandas API on Spark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Priority: Major
> Fix For: 4.0.0
>
>
> Reproducible example:
> {code:java}
> >>> import pyspark.pandas as ps
> >>> psdf1 = ps.DataFrame({"A": [1, 2, 3]})
> >>> psdf2 = ps.DataFrame({"B": [1, 2, 3]})
> >>> psdf1.append(psdf2)
> /Users/haejoon.lee/Desktop/git_store/spark/python/pyspark/pandas/frame.py:8897:
>  FutureWarning: The DataFrame.append method is deprecated and will be removed 
> in a future version. Use pyspark.pandas.concat instead.
>   warnings.warn(
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/Users/haejoon.lee/Desktop/git_store/spark/python/pyspark/pandas/frame.py", 
> line 8930, in append
>     return cast(DataFrame, concat([self, other], ignore_index=ignore_index))
>   File 
> "/Users/haejoon.lee/Desktop/git_store/spark/python/pyspark/pandas/namespace.py",
>  line 2703, in concat
>     psdfs[0]._internal.copy(
>   File 
> "/Users/haejoon.lee/Desktop/git_store/spark/python/pyspark/pandas/internal.py",
>  line 1508, in copy
>     return InternalFrame(
>   File 
> "/Users/haejoon.lee/Desktop/git_store/spark/python/pyspark/pandas/internal.py",
>  line 753, in __init__
>     schema = spark_frame.select(data_spark_columns).schema
>   File 
> "/Users/haejoon.lee/Desktop/git_store/spark/python/pyspark/sql/connect/dataframe.py",
>  line 1650, in schema
>     return self._session.client.schema(query)
>   File 
> "/Users/haejoon.lee/Desktop/git_store/spark/python/pyspark/sql/connect/client.py",
>  line 777, in schema
>     schema = self._analyze(method="schema", plan=plan).schema
>   File 
> "/Users/haejoon.lee/Desktop/git_store/spark/python/pyspark/sql/connect/client.py",
>  line 958, in _analyze
>     self._handle_error(error)
>   File 
> "/Users/haejoon.lee/Desktop/git_store/spark/python/pyspark/sql/connect/client.py",
>  line 1195, in _handle_error
>     self._handle_rpc_error(error)
>   File 
> "/Users/haejoon.lee/Desktop/git_store/spark/python/pyspark/sql/connect/client.py",
>  line 1231, in _handle_rpc_error
>     raise convert_exception(info, status.message) from None
> pyspark.errors.exceptions.connect.AnalysisException: When resolving 'A, fail 
> to find subplan with plan_id=16 in 'Project ['A, 'B]
> +- Project [__index_level_0__#1101L, A#1102L, B#1157L, 
> monotonically_increasing_id() AS __natural_order__#1163L]
>    +- Union false, false
>       :- Project [__index_level_0__#1101L, A#1102L, cast(B#1116 as bigint) AS 
> B#1157L]
>       :  +- Project [__index_level_0__#1101L, A#1102L, B#1116]
>       :     +- Project [__index_level_0__#1101L, A#1102L, 
> __natural_order__#1108L, null AS B#1116]
>       :        +- Project [__index_level_0__#1101L, A#1102L, 
> __natural_order__#1108L]
>       :           +- Project [__index_level_0__#1101L, A#1102L, 
> monotonically_increasing_id() AS __natural_order__#1108L]
>       :              +- Project [__index_level_0__#1097L AS 
> __index_level_0__#1101L, A#1098L AS A#1102L]
>       :                 +- LocalRelation [__index_level_0__#1097L, A#1098L]
>       +- Project [__index_level_0__#1137L, cast(A#1152 as bigint) AS A#1158L, 
> B#1138L]
>          +- Project [__index_level_0__#1137L, A#1152, B#1138L]
>             +- Project [__index_level_0__#1137L, B#1138L, 
> __natural_order__#1144L, null AS A#1152]
>                +- Project [__index_level_0__#1137L, B#1138L, 
> __natural_order__#1144L]
>                   +- Project [__index_level_0__#1137L, B#1138L, 
> monotonically_increasing_id() AS __natural_order__#1144L]
>                      +- Project [__index_level_0__#1133L AS 
> __index_level_0__#1137L, B#1134L AS B#1138L]
>                         +- LocalRelation [__index_level_0__#1133L, B#1134L] 
> {code}
> Another example:
> {code:java}
> >>> pdf = pd.DataFrame(
> ...     {
> ...         "A": [None, 3, None, None],
> ...         "B": [2, 4, None, 3],
> ...         "C": [None, None, None, 1],
> ...         "D": [0, 1, 5, 4],
> ...     },
> ...     columns=["A", "B", "C", "D"],
> ... )
> >>> psdf = ps.from_pandas(pdf)
> >>> psdf.backfill()
> /Users/haejoon.lee/Desktop/git_store/spark/python/pyspark/sql/connect/expressions.py:945:
>  UserWarning: WARN WindowExpression: No Partition Defined for Window 
> operation! Moving all data to a single partition, 

[jira] [Assigned] (SPARK-43611) Fix unexpected `AnalysisException` from Spark Connect client

2023-07-26 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-43611:
-

Assignee: Ruifeng Zheng

> Fix unexpected `AnalysisException` from Spark Connect client
> 
>
> Key: SPARK-43611
> URL: https://issues.apache.org/jira/browse/SPARK-43611
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Pandas API on Spark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 4.0.0
>
>
> Reproducible example:
> {code:java}
> >>> import pyspark.pandas as ps
> >>> psdf1 = ps.DataFrame({"A": [1, 2, 3]})
> >>> psdf2 = ps.DataFrame({"B": [1, 2, 3]})
> >>> psdf1.append(psdf2)
> /Users/haejoon.lee/Desktop/git_store/spark/python/pyspark/pandas/frame.py:8897:
>  FutureWarning: The DataFrame.append method is deprecated and will be removed 
> in a future version. Use pyspark.pandas.concat instead.
>   warnings.warn(
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/Users/haejoon.lee/Desktop/git_store/spark/python/pyspark/pandas/frame.py", 
> line 8930, in append
>     return cast(DataFrame, concat([self, other], ignore_index=ignore_index))
>   File 
> "/Users/haejoon.lee/Desktop/git_store/spark/python/pyspark/pandas/namespace.py",
>  line 2703, in concat
>     psdfs[0]._internal.copy(
>   File 
> "/Users/haejoon.lee/Desktop/git_store/spark/python/pyspark/pandas/internal.py",
>  line 1508, in copy
>     return InternalFrame(
>   File 
> "/Users/haejoon.lee/Desktop/git_store/spark/python/pyspark/pandas/internal.py",
>  line 753, in __init__
>     schema = spark_frame.select(data_spark_columns).schema
>   File 
> "/Users/haejoon.lee/Desktop/git_store/spark/python/pyspark/sql/connect/dataframe.py",
>  line 1650, in schema
>     return self._session.client.schema(query)
>   File 
> "/Users/haejoon.lee/Desktop/git_store/spark/python/pyspark/sql/connect/client.py",
>  line 777, in schema
>     schema = self._analyze(method="schema", plan=plan).schema
>   File 
> "/Users/haejoon.lee/Desktop/git_store/spark/python/pyspark/sql/connect/client.py",
>  line 958, in _analyze
>     self._handle_error(error)
>   File 
> "/Users/haejoon.lee/Desktop/git_store/spark/python/pyspark/sql/connect/client.py",
>  line 1195, in _handle_error
>     self._handle_rpc_error(error)
>   File 
> "/Users/haejoon.lee/Desktop/git_store/spark/python/pyspark/sql/connect/client.py",
>  line 1231, in _handle_rpc_error
>     raise convert_exception(info, status.message) from None
> pyspark.errors.exceptions.connect.AnalysisException: When resolving 'A, fail 
> to find subplan with plan_id=16 in 'Project ['A, 'B]
> +- Project [__index_level_0__#1101L, A#1102L, B#1157L, 
> monotonically_increasing_id() AS __natural_order__#1163L]
>    +- Union false, false
>       :- Project [__index_level_0__#1101L, A#1102L, cast(B#1116 as bigint) AS 
> B#1157L]
>       :  +- Project [__index_level_0__#1101L, A#1102L, B#1116]
>       :     +- Project [__index_level_0__#1101L, A#1102L, 
> __natural_order__#1108L, null AS B#1116]
>       :        +- Project [__index_level_0__#1101L, A#1102L, 
> __natural_order__#1108L]
>       :           +- Project [__index_level_0__#1101L, A#1102L, 
> monotonically_increasing_id() AS __natural_order__#1108L]
>       :              +- Project [__index_level_0__#1097L AS 
> __index_level_0__#1101L, A#1098L AS A#1102L]
>       :                 +- LocalRelation [__index_level_0__#1097L, A#1098L]
>       +- Project [__index_level_0__#1137L, cast(A#1152 as bigint) AS A#1158L, 
> B#1138L]
>          +- Project [__index_level_0__#1137L, A#1152, B#1138L]
>             +- Project [__index_level_0__#1137L, B#1138L, 
> __natural_order__#1144L, null AS A#1152]
>                +- Project [__index_level_0__#1137L, B#1138L, 
> __natural_order__#1144L]
>                   +- Project [__index_level_0__#1137L, B#1138L, 
> monotonically_increasing_id() AS __natural_order__#1144L]
>                      +- Project [__index_level_0__#1133L AS 
> __index_level_0__#1137L, B#1134L AS B#1138L]
>                         +- LocalRelation [__index_level_0__#1133L, B#1134L] 
> {code}
> Another example:
> {code:java}
> >>> pdf = pd.DataFrame(
> ...     {
> ...         "A": [None, 3, None, None],
> ...         "B": [2, 4, None, 3],
> ...         "C": [None, None, None, 1],
> ...         "D": [0, 1, 5, 4],
> ...     },
> ...     columns=["A", "B", "C", "D"],
> ... )
> >>> psdf = ps.from_pandas(pdf)
> >>> psdf.backfill()
> /Users/haejoon.lee/Desktop/git_store/spark/python/pyspark/sql/connect/expressions.py:945:
>  UserWarning: WARN WindowExpression: No Partition Defined for Window 
> operation! Moving all data to a single partition, this can cause serious 
> performance degradation.
>   

[jira] [Created] (SPARK-44562) Add OptimizeOneRowRelationSubquery in batch of Subquery

2023-07-26 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-44562:
---

 Summary: Add OptimizeOneRowRelationSubquery in batch of Subquery
 Key: SPARK-44562
 URL: https://issues.apache.org/jira/browse/SPARK-44562
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Yuming Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44479) Support Python UDTFs with empty schema

2023-07-26 Thread Takuya Ueshin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-44479.
---
  Assignee: Takuya Ueshin
Resolution: Fixed

Issue resolved by pull request 42161
https://github.com/apache/spark/pull/42161

> Support Python UDTFs with empty schema
> --
>
> Key: SPARK-44479
> URL: https://issues.apache.org/jira/browse/SPARK-44479
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
>
> Support UDTFs with empty schema, for example:
> {code:python}
> >>> class TestUDTF:
> ...   def eval(self):
> ... yield tuple()
> {code}
> Currently it fails with `useArrow=True`:
> {code:python}
> >>> udtf(TestUDTF, returnType=StructType())().collect()
> Traceback (most recent call last):
> ...
> ValueError: not enough values to unpack (expected 2, got 0)
> {code}
> whereas without Arrow:
> {code:python}
> >>> udtf(TestUDTF, returnType=StructType(), useArrow=False)().collect()
> [Row()]
> {code}
> Otherwise, we should raise an error without Arrow, too, to be consistent.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44553) Ignoring `connect-check-protos` logic in GA testing

2023-07-26 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-44553.
---
Fix Version/s: 3.4.2
   Resolution: Fixed

Issue resolved by pull request 42166
[https://github.com/apache/spark/pull/42166]

> Ignoring `connect-check-protos` logic in GA testing
> ---
>
> Key: SPARK-44553
> URL: https://issues.apache.org/jira/browse/SPARK-44553
> Project: Spark
>  Issue Type: Test
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
> Fix For: 3.4.2
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44553) Ignoring `connect-check-protos` logic in GA testing

2023-07-26 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-44553:
-

Assignee: BingKun Pan

> Ignoring `connect-check-protos` logic in GA testing
> ---
>
> Key: SPARK-44553
> URL: https://issues.apache.org/jira/browse/SPARK-44553
> Project: Spark
>  Issue Type: Test
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44544) Deduplicate run_python_packaging_tests

2023-07-26 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-44544:
-
Fix Version/s: 3.4.2

> Deduplicate run_python_packaging_tests
> --
>
> Key: SPARK-44544
> URL: https://issues.apache.org/jira/browse/SPARK-44544
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.2, 3.5.0, 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44457) Make ArrowEncoderSuite pass Java 17 daily test

2023-07-26 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-44457:
-
Priority: Minor  (was: Major)

> Make ArrowEncoderSuite pass  Java 17 daily test 
> 
>
> Key: SPARK-44457
> URL: https://issues.apache.org/jira/browse/SPARK-44457
> Project: Spark
>  Issue Type: Bug
>  Components: Connect, Tests
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.5.0, 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44457) Make ArrowEncoderSuite pass Java 17 daily test

2023-07-26 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-44457.
--
Fix Version/s: 3.5.0
   4.0.0
 Assignee: Yang Jie
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/42039

> Make ArrowEncoderSuite pass  Java 17 daily test 
> 
>
> Key: SPARK-44457
> URL: https://issues.apache.org/jira/browse/SPARK-44457
> Project: Spark
>  Issue Type: Bug
>  Components: Connect, Tests
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
> Fix For: 3.5.0, 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44522) Upgrade scala-xml to 2.2.0

2023-07-26 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-44522.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42119
[https://github.com/apache/spark/pull/42119]

> Upgrade scala-xml to 2.2.0
> --
>
> Key: SPARK-44522
> URL: https://issues.apache.org/jira/browse/SPARK-44522
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 4.0.0
>
>
> https://github.com/scala/scala-xml/releases/tag/v2.2.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44522) Upgrade scala-xml to 2.2.0

2023-07-26 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-44522:


Assignee: Yang Jie

> Upgrade scala-xml to 2.2.0
> --
>
> Key: SPARK-44522
> URL: https://issues.apache.org/jira/browse/SPARK-44522
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>
> https://github.com/scala/scala-xml/releases/tag/v2.2.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44528) Spark Connect DataFrame does not allow to add custom instance attributes and check for it

2023-07-26 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-44528.
--
Fix Version/s: 3.5.0
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 42132
[https://github.com/apache/spark/pull/42132]

> Spark Connect DataFrame does not allow to add custom instance attributes and 
> check for it
> -
>
> Key: SPARK-44528
> URL: https://issues.apache.org/jira/browse/SPARK-44528
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.4.1
>Reporter: Martin Grund
>Assignee: Martin Grund
>Priority: Major
> Fix For: 3.5.0, 4.0.0
>
>
> ```
> df = spark.range(10)
> df._test = 10
> assert(hasattr(df, "_test"))
> assert(!hasattr(df, "_test_no"))
> ```
> Treats `df._test` like a column



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44528) Spark Connect DataFrame does not allow to add custom instance attributes and check for it

2023-07-26 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-44528:


Assignee: Martin Grund

> Spark Connect DataFrame does not allow to add custom instance attributes and 
> check for it
> -
>
> Key: SPARK-44528
> URL: https://issues.apache.org/jira/browse/SPARK-44528
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.4.1
>Reporter: Martin Grund
>Assignee: Martin Grund
>Priority: Major
>
> ```
> df = spark.range(10)
> df._test = 10
> assert(hasattr(df, "_test"))
> assert(!hasattr(df, "_test_no"))
> ```
> Treats `df._test` like a column



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44561) Fix AssertionError when converting UDTF output to a complex type

2023-07-26 Thread Allison Wang (Jira)
Allison Wang created SPARK-44561:


 Summary: Fix AssertionError when converting UDTF output to a 
complex type
 Key: SPARK-44561
 URL: https://issues.apache.org/jira/browse/SPARK-44561
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.5.0
Reporter: Allison Wang


{code:java}
class TestUDTF:
def eval(self):
yield {'a': 1, 'b': 2},

udtf(TestUDTF, returnType="x: map")().show() {code}
This will fail with:

  File "pandas/_libs/lib.pyx", line 2834, in pandas._libs.lib.map_infer
  File "python/pyspark/sql/pandas/types.py", line 804, in convert_map
    assert isinstance(value, dict)
AssertionError

Same for `convert_struct`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44560) Improve tests and documentation for Arrow Python UDF

2023-07-26 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-44560:


 Summary: Improve tests and documentation for Arrow Python UDF
 Key: SPARK-44560
 URL: https://issues.apache.org/jira/browse/SPARK-44560
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, PySpark
Affects Versions: 3.5.0, 4.0.0
Reporter: Xinrong Meng


Test on complex return type

Remove complex return type constraints for Arrow Python UDF on Spark Connect

Update documentation of the related Spark conf



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37562) Add Spark History Server Links for Kubernetes & other CMs

2023-07-26 Thread Holden Karau (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17747676#comment-17747676
 ] 

Holden Karau commented on SPARK-37562:
--

So (in theory) the cluster administrator has some base config, they set it up. 
They also configure a history server location. When we run on YARN they can 
configure that location the the log URL will be printed with the correct 
location (e.g. [historyserver]/[app] ) for someone investigating after the case.

 

This just proposes to generalize the YARN config.

> Add Spark History Server Links for Kubernetes & other CMs
> -
>
> Key: SPARK-37562
> URL: https://issues.apache.org/jira/browse/SPARK-37562
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.2.1, 3.3.0
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Minor
>
> In YARN we have the Spark history server configured with 
> `spark.yarn.historyServer.address` which allows us to print out useful links 
> on startup for eventual debugging. More than just YARN can have the history 
> server. We should either add `spark.kubernetes.historyServer.address` or move 
> it to `spark.historyServer.address` w/a fall back to the old YARN specific 
> config.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44559) Improve error messages for invalid Python UDTF arrow type casts

2023-07-26 Thread Allison Wang (Jira)
Allison Wang created SPARK-44559:


 Summary: Improve error messages for invalid Python UDTF arrow type 
casts
 Key: SPARK-44559
 URL: https://issues.apache.org/jira/browse/SPARK-44559
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.5.0
Reporter: Allison Wang


Currently, if a Python UDTF outputs a type that is incompatible with the 
specified output schema, Spark will throw the following confusing error message:
{code:java}
  File "pyarrow/array.pxi", line 1044, in pyarrow.lib.Array.from_pandas
  File "pyarrow/array.pxi", line 316, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Could not convert [1, 2] with type list: tried to 
convert to int32{code}
We should improve this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44558) Export Pyspark's Spark Connect Log Level

2023-07-26 Thread Alice Sayutina (Jira)
Alice Sayutina created SPARK-44558:
--

 Summary: Export Pyspark's Spark Connect Log Level
 Key: SPARK-44558
 URL: https://issues.apache.org/jira/browse/SPARK-44558
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.4.1
Reporter: Alice Sayutina


Export spark connect log level as API function



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44264) DeepSpeed Distrobutor

2023-07-26 Thread Ignite TC Bot (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17747612#comment-17747612
 ] 

Ignite TC Bot commented on SPARK-44264:
---

User 'mathewjacob1002' has created a pull request for this issue:
https://github.com/apache/spark/pull/42118

> DeepSpeed Distrobutor
> -
>
> Key: SPARK-44264
> URL: https://issues.apache.org/jira/browse/SPARK-44264
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 3.4.1
>Reporter: Lu Wang
>Priority: Critical
> Fix For: 3.5.0
>
> Attachments: Trying to Run Deepspeed Funcs.html
>
>
> To make it easier for Pyspark users to run distributed training and inference 
> with DeepSpeed on spark clusters using PySpark. This was a project determined 
> by the Databricks ML Training Team.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-44503) Support PARTITION BY and ORDER BY clause for table arguments

2023-07-26 Thread Daniel (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel reopened SPARK-44503:


Reopening since I added the SQL grammar support only in 
[https://github.com/apache/spark/pull/42100,] and next I will add the planning 
and execution parts.

> Support PARTITION BY and ORDER BY clause for table arguments
> 
>
> Key: SPARK-44503
> URL: https://issues.apache.org/jira/browse/SPARK-44503
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Daniel
>Assignee: Daniel
>Priority: Major
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44537) Upgrade kubernetes-client to 6.8.0

2023-07-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-44537.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42142
[https://github.com/apache/spark/pull/42142]

>  Upgrade kubernetes-client to 6.8.0
> ---
>
> Key: SPARK-44537
> URL: https://issues.apache.org/jira/browse/SPARK-44537
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Trivial
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44537) Upgrade kubernetes-client to 6.8.0

2023-07-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-44537:
-

Assignee: BingKun Pan

>  Upgrade kubernetes-client to 6.8.0
> ---
>
> Key: SPARK-44537
> URL: https://issues.apache.org/jira/browse/SPARK-44537
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Trivial
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44557) Flaky PIP packaging test

2023-07-26 Thread Nikita Awasthi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17747471#comment-17747471
 ] 

Nikita Awasthi commented on SPARK-44557:


User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/42159

> Flaky PIP packaging test
> 
>
> Key: SPARK-44557
> URL: https://issues.apache.org/jira/browse/SPARK-44557
> Project: Spark
>  Issue Type: Task
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> e.g., https://github.com/apache/spark/actions/runs/5665869112/job/15351515397



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44524) Balancing pyspark-pandas-connect and pyspark-pandas-slow-connect GA testing time

2023-07-26 Thread BingKun Pan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan updated SPARK-44524:

Summary:  Balancing pyspark-pandas-connect and pyspark-pandas-slow-connect 
GA testing time  (was:  Add a new test group for pyspark-pandas-slow-connect 
module)

>  Balancing pyspark-pandas-connect and pyspark-pandas-slow-connect GA testing 
> time
> -
>
> Key: SPARK-44524
> URL: https://issues.apache.org/jira/browse/SPARK-44524
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44557) Flaky PIP packaging test

2023-07-26 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-44557:


 Summary: Flaky PIP packaging test
 Key: SPARK-44557
 URL: https://issues.apache.org/jira/browse/SPARK-44557
 Project: Spark
  Issue Type: Task
  Components: Project Infra
Affects Versions: 4.0.0
Reporter: Hyukjin Kwon


e.g., https://github.com/apache/spark/actions/runs/5665869112/job/15351515397



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44531) Move encoder inference to sql/api

2023-07-26 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-44531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hövell resolved SPARK-44531.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

> Move encoder inference to sql/api
> -
>
> Key: SPARK-44531
> URL: https://issues.apache.org/jira/browse/SPARK-44531
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect, SQL
>Affects Versions: 3.4.1
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44555) Use checkError() to check Exception in command Suite & assign some error class names

2023-07-26 Thread BingKun Pan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan updated SPARK-44555:

Summary: Use checkError() to check Exception in command Suite & assign some 
error class names  (was: Use checkError() to check Exception in command Suite & 
Assign new error-class)

> Use checkError() to check Exception in command Suite & assign some error 
> class names
> 
>
> Key: SPARK-44555
> URL: https://issues.apache.org/jira/browse/SPARK-44555
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0, 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44555) Use checkError() to check Exception in command Suite & Assign new error-class

2023-07-26 Thread BingKun Pan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan updated SPARK-44555:

Summary: Use checkError() to check Exception in command Suite & Assign new 
error-class  (was: Make branch-3.3 & branch-3.4 daily test happy)

> Use checkError() to check Exception in command Suite & Assign new error-class
> -
>
> Key: SPARK-44555
> URL: https://issues.apache.org/jira/browse/SPARK-44555
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0, 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44556) Reuse `OrcTail` when enable vectorizedReader

2023-07-26 Thread dzcxzl (Jira)
dzcxzl created SPARK-44556:
--

 Summary: Reuse `OrcTail` when enable vectorizedReader
 Key: SPARK-44556
 URL: https://issues.apache.org/jira/browse/SPARK-44556
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.1
Reporter: dzcxzl






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44098) Introduce python breaking change detection

2023-07-26 Thread GridGain Integration (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17747396#comment-17747396
 ] 

GridGain Integration commented on SPARK-44098:
--

User 'StardustDL' has created a pull request for this issue:
https://github.com/apache/spark/pull/42125

> Introduce python breaking change detection
> --
>
> Key: SPARK-44098
> URL: https://issues.apache.org/jira/browse/SPARK-44098
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra, Tests
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Major
>
> We have breaking change detections for Binary Compatibility and Protobufs, 
> but we don't have one for python.
> Authors of [aexpy|https://github.com/StardustDL/aexpy] are willing to help 
> PySpark detecting python breaking changes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44525) Improve error message when Invoke method is not found

2023-07-26 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie reassigned SPARK-44525:


Assignee: Cheng Pan

> Improve error message when Invoke method is not found
> -
>
> Key: SPARK-44525
> URL: https://issues.apache.org/jira/browse/SPARK-44525
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.2
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44525) Improve error message when Invoke method is not found

2023-07-26 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie resolved SPARK-44525.
--
Fix Version/s: 3.5.0
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 42128
[https://github.com/apache/spark/pull/42128]

> Improve error message when Invoke method is not found
> -
>
> Key: SPARK-44525
> URL: https://issues.apache.org/jira/browse/SPARK-44525
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.2
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Major
> Fix For: 3.5.0, 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44544) Deduplicate run_python_packaging_tests

2023-07-26 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-44544.
---
Fix Version/s: 3.5.0
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 42146
[https://github.com/apache/spark/pull/42146]

> Deduplicate run_python_packaging_tests
> --
>
> Key: SPARK-44544
> URL: https://issues.apache.org/jira/browse/SPARK-44544
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.5.0, 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44544) Deduplicate run_python_packaging_tests

2023-07-26 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-44544:
-

Assignee: Ruifeng Zheng

> Deduplicate run_python_packaging_tests
> --
>
> Key: SPARK-44544
> URL: https://issues.apache.org/jira/browse/SPARK-44544
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44544) Deduplicate run_python_packaging_tests

2023-07-26 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng updated SPARK-44544:
--
Summary: Deduplicate run_python_packaging_tests  (was: Move python 
packaging tests to a separate module)

> Deduplicate run_python_packaging_tests
> --
>
> Key: SPARK-44544
> URL: https://issues.apache.org/jira/browse/SPARK-44544
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44554) Install different Python linter dependencies for daily testing of different Spark versions

2023-07-26 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-44554:
-
Description: 
Fix daily test python lint check failure for branches 3.3 and 3.4

 

3.4 : [https://github.com/apache/spark/actions/runs/5654787844/job/15318633266]

3.3 : https://github.com/apache/spark/actions/runs/5653655970/job/15315236052

> Install different Python linter dependencies for daily testing of different 
> Spark versions
> --
>
> Key: SPARK-44554
> URL: https://issues.apache.org/jira/browse/SPARK-44554
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Priority: Major
>
> Fix daily test python lint check failure for branches 3.3 and 3.4
>  
> 3.4 : 
> [https://github.com/apache/spark/actions/runs/5654787844/job/15318633266]
> 3.3 : https://github.com/apache/spark/actions/runs/5653655970/job/15315236052



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44555) Make branch-3.3 & branch-3.4 daily test happy

2023-07-26 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-44555:
---

 Summary: Make branch-3.3 & branch-3.4 daily test happy
 Key: SPARK-44555
 URL: https://issues.apache.org/jira/browse/SPARK-44555
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.5.0, 4.0.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35914) Driver can't distribute task to executor because NullPointerException

2023-07-26 Thread surya (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17747294#comment-17747294
 ] 

surya commented on SPARK-35914:
---

Hey,

We are facing similar issue and we are using spark3.1.1 with hadoop3.2.
Is this issue resolved in further versions ? If yes, can you guys let me know 
the fixed version.
OR If there is any workaround to solve this issue, Please let me know.

> Driver can't distribute task to executor because NullPointerException
> -
>
> Key: SPARK-35914
> URL: https://issues.apache.org/jira/browse/SPARK-35914
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1, 3.1.1, 3.1.2
> Environment: hadoop 2.6.0-cdh5.7.1
> Spark 3.0.1, 3.1.1, 3.1.2
>Reporter: Helt Long
>Priority: Major
> Attachments: stuck log.png, webui stuck.png
>
>
> When use spark3 submit a spark job to yarn cluster, I get a problem. Once in 
> a while, driver can't distribute any tasks to any executors, and the stage 
> will stuck , the total spark job will stuck. Check driver log, I found 
> NullPointerException. It's like a netty problem, I can confirm this problem 
> only exist in spark3, because I use spark2 never happend.
>  
> {code:java}
> // Error message
> 21/06/28 14:42:43 INFO TaskSetManager: Starting task 2592.0 in stage 1.0 (TID 
> 3494) (worker39.hadoop, executor 84, partition 2592, RACK_LOCAL, 5006 bytes) 
> taskResourceAssignments Map()
> 21/06/28 14:42:43 INFO TaskSetManager: Finished task 4155.0 in stage 1.0 (TID 
> 3367) in 36670 ms on worker39.hadoop (executor 84) (3278/4249)
> 21/06/28 14:42:43 INFO TaskSetManager: Finished task 2283.0 in stage 1.0 (TID 
> 3422) in 22371 ms on worker15.hadoop (executor 109) (3279/4249)
> 21/06/28 14:42:43 ERROR Inbox: Ignoring error
> java.lang.NullPointerException
>   at java.lang.String.length(String.java:623)
>   at 
> java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:420)
>   at java.lang.StringBuilder.append(StringBuilder.java:136)
>   at 
> org.apache.spark.scheduler.TaskSetManager.$anonfun$resourceOffer$5(TaskSetManager.scala:483)
>   at org.apache.spark.internal.Logging.logInfo(Logging.scala:57)
>   at org.apache.spark.internal.Logging.logInfo$(Logging.scala:56)
>   at 
> org.apache.spark.scheduler.TaskSetManager.logInfo(TaskSetManager.scala:54)
>   at 
> org.apache.spark.scheduler.TaskSetManager.$anonfun$resourceOffer$2(TaskSetManager.scala:484)
>   at scala.Option.map(Option.scala:230)
>   at 
> org.apache.spark.scheduler.TaskSetManager.resourceOffer(TaskSetManager.scala:444)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.$anonfun$resourceOfferSingleTaskSet$2(TaskSchedulerImpl.scala:397)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.$anonfun$resourceOfferSingleTaskSet$2$adapted(TaskSchedulerImpl.scala:392)
>   at scala.Option.foreach(Option.scala:407)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.$anonfun$resourceOfferSingleTaskSet$1(TaskSchedulerImpl.scala:392)
>   at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.resourceOfferSingleTaskSet(TaskSchedulerImpl.scala:383)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.$anonfun$resourceOffers$20(TaskSchedulerImpl.scala:581)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.$anonfun$resourceOffers$20$adapted(TaskSchedulerImpl.scala:576)
>   at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
>   at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.$anonfun$resourceOffers$16(TaskSchedulerImpl.scala:576)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.$anonfun$resourceOffers$16$adapted(TaskSchedulerImpl.scala:547)
>   at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.resourceOffers(TaskSchedulerImpl.scala:547)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.$anonfun$makeOffers$5(CoarseGrainedSchedulerBackend.scala:340)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.org$apache$spark$scheduler$cluster$CoarseGrainedSchedulerBackend$$withLock(CoarseGrainedSchedulerBackend.scala:904)
>   at 
> 

[jira] [Created] (SPARK-44554) Install different Python linter dependencies for daily testing of different Spark versions

2023-07-26 Thread Yang Jie (Jira)
Yang Jie created SPARK-44554:


 Summary: Install different Python linter dependencies for daily 
testing of different Spark versions
 Key: SPARK-44554
 URL: https://issues.apache.org/jira/browse/SPARK-44554
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 4.0.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44553) Ignoring `connect-check-protos` logic in GA testing

2023-07-26 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-44553:
---

 Summary: Ignoring `connect-check-protos` logic in GA testing
 Key: SPARK-44553
 URL: https://issues.apache.org/jira/browse/SPARK-44553
 Project: Spark
  Issue Type: Test
  Components: Build
Affects Versions: 3.4.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org