[jira] [Updated] (SPARK-44564) Refine the documents with LLM
[ https://issues.apache.org/jira/browse/SPARK-44564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng updated SPARK-44564: -- Description: Let's first focus on the Documents of *PySpark DataFrame APIs*. *1*, Chose a subset of DF APIs Since the review bandwidth is limited, we recommend each PR contains at least 5 APIs; *2*, For each API, copy-paste the function (including function signature, doc string) to a LLM Model, and ask it to refine the document with prompts like: * please improve the docstring of the 'unionByName' function * please refine the comments of the 'unionByName' function * please refine the documents of the 'unionByName' function, and add more examples * please provide more examples for function 'unionByName' * ... It is highly recommended to leverage *GPT-4* instead of GPT-3.5, since the former generate better results. *3*, Note that the LLM is not 100% reliable, the generated doc string may contain some mistakes, e.g. * The example code can not run * The example results are incorrect * The example code doesn't reflect the example title * The description use wrong version, add a 'Raise' selection for non-existent exception * ... we need to fix them before sending a PR. We can try different prompts, choose the good parts and combine them to the new doc sting. was: Let's first focus on the Documents of *PySpark DataFrame APIs*. *1*, Chose a subset of DF APIs Since the review bandwidth is limited, we recommend each PR contains at least 5 APIs; *2*, For each API, copy-paste the function (including function signature, doc string) to a LLM Model, and ask it to refine the document with prompts like: * please improve the docstring of the 'unionByName' function * please refine the comments of the 'unionByName' function * please refine the documents of the 'unionByName' function, and add more examples * please provide more examples for function 'unionByName' * ... It is highly recommended to leverage *GPT-4* instead of GPT-3.5, since the former generate better results. *3*, Note that the LLM is not 100% reliable, the generated doc string may contain some mistakes, e.g. * The example code can not run * The example results are incorrect * The example code doesn't reflect the example title * The description use wrong version, add a 'Raise' selection for non-existent exception * ... we need to fix them before sending a PR. We can generate the docs with different prompts, choose the good parts and combine them to the new doc sting. > Refine the documents with LLM > - > > Key: SPARK-44564 > URL: https://issues.apache.org/jira/browse/SPARK-44564 > Project: Spark > Issue Type: Umbrella > Components: Documentation >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Major > > Let's first focus on the Documents of *PySpark DataFrame APIs*. > *1*, Chose a subset of DF APIs > Since the review bandwidth is limited, we recommend each PR contains at least > 5 APIs; > *2*, For each API, copy-paste the function (including function signature, doc > string) to a LLM Model, and ask it to refine the document with prompts like: > * please improve the docstring of the 'unionByName' function > * please refine the comments of the 'unionByName' function > * please refine the documents of the 'unionByName' function, and add more > examples > * please provide more examples for function 'unionByName' > * ... > It is highly recommended to leverage *GPT-4* instead of GPT-3.5, since the > former generate better results. > *3*, Note that the LLM is not 100% reliable, the generated doc string may > contain some mistakes, e.g. > * The example code can not run > * The example results are incorrect > * The example code doesn't reflect the example title > * The description use wrong version, add a 'Raise' selection for non-existent > exception > * ... > we need to fix them before sending a PR. > We can try different prompts, choose the good parts and combine them to the > new doc sting. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44564) Refine the documents with LLM
[ https://issues.apache.org/jira/browse/SPARK-44564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng updated SPARK-44564: -- Description: Let's first focus on the Documents of *PySpark DataFrame APIs*. *1*, Chose a subset of DF APIs Since the review bandwidth is limited, we recommend each PR contains at least 5 APIs; *2*, For each API, copy-paste the function (including function signature, doc string) to a LLM Model, and ask it to refine the document with prompts like: * please improve the docstring of the 'unionByName' function * please refine the comments of the 'unionByName' function * please refine the documents of the 'unionByName' function, and add more examples * please provide more example for function 'unionByName' * ... It is highly recommended to leverage *GPT-4* instead of GPT-3.5, since the former generate better results. *3*, Note that the LLM is not 100% reliable, the generated doc string may contain some mistakes, e.g. * The example code can not run * The example results are incorrect * The example code doesn't reflect the example title * The description use wrong version, add a 'Raise' selection for non-existent exception * ... we need to fix them before sending a PR. We can generate the docs with different prompts, choose the good parts and combine them to the new doc sting. was: Let's first focus on the Documents of *PySpark DataFrame APIs*. *1*, Chose a subset of DF APIs Since the review bandwidth is limited, we recommend each PR contains at least 5 APIs; *2*, For each API, copy-paste the function (including function signature, doc string) to a LLM Model, and ask it to refine the document with prompts like: * please improve the docstring of the 'unionByName' function * please refine the comments of the 'unionByName' function * please refine the documents of the 'unionByName' function, and add more examples * please provide more example for function 'unionByName' * ... It is highly recommended to leverage *GPT-4* instead of GPT-3.5, since the former generate better results. *3*, Note that the LLM is not 100% reliable, the generated doc string may contain some mistakes, e.g. * The example results are incorrect * The example code doesn't reflect the example title * The description use wrong version, add a 'Raise' selection for non-existent exception * ... we need to fix them before sending a PR. We can generate the docs with different prompts, choose the good parts and combine them to the new doc sting. > Refine the documents with LLM > - > > Key: SPARK-44564 > URL: https://issues.apache.org/jira/browse/SPARK-44564 > Project: Spark > Issue Type: Umbrella > Components: Documentation >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Major > > Let's first focus on the Documents of *PySpark DataFrame APIs*. > *1*, Chose a subset of DF APIs > Since the review bandwidth is limited, we recommend each PR contains at least > 5 APIs; > *2*, For each API, copy-paste the function (including function signature, doc > string) to a LLM Model, and ask it to refine the document with prompts like: > * please improve the docstring of the 'unionByName' function > * please refine the comments of the 'unionByName' function > * please refine the documents of the 'unionByName' function, and add more > examples > * please provide more example for function 'unionByName' > * ... > It is highly recommended to leverage *GPT-4* instead of GPT-3.5, since the > former generate better results. > *3*, Note that the LLM is not 100% reliable, the generated doc string may > contain some mistakes, e.g. > * The example code can not run > * The example results are incorrect > * The example code doesn't reflect the example title > * The description use wrong version, add a 'Raise' selection for non-existent > exception > * ... > we need to fix them before sending a PR. > We can generate the docs with different prompts, choose the good parts and > combine them to the new doc sting. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44564) Refine the documents with LLM
[ https://issues.apache.org/jira/browse/SPARK-44564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng updated SPARK-44564: -- Description: Let's first focus on the Documents of *PySpark DataFrame APIs*. *1*, Chose a subset of DF APIs Since the review bandwidth is limited, we recommend each PR contains at least 5 APIs; *2*, For each API, copy-paste the function (including function signature, doc string) to a LLM Model, and ask it to refine the document with prompts like: * please improve the docstring of the 'unionByName' function * please refine the comments of the 'unionByName' function * please refine the documents of the 'unionByName' function, and add more examples * please provide more examples for function 'unionByName' * ... It is highly recommended to leverage *GPT-4* instead of GPT-3.5, since the former generate better results. *3*, Note that the LLM is not 100% reliable, the generated doc string may contain some mistakes, e.g. * The example code can not run * The example results are incorrect * The example code doesn't reflect the example title * The description use wrong version, add a 'Raise' selection for non-existent exception * ... we need to fix them before sending a PR. We can generate the docs with different prompts, choose the good parts and combine them to the new doc sting. was: Let's first focus on the Documents of *PySpark DataFrame APIs*. *1*, Chose a subset of DF APIs Since the review bandwidth is limited, we recommend each PR contains at least 5 APIs; *2*, For each API, copy-paste the function (including function signature, doc string) to a LLM Model, and ask it to refine the document with prompts like: * please improve the docstring of the 'unionByName' function * please refine the comments of the 'unionByName' function * please refine the documents of the 'unionByName' function, and add more examples * please provide more example for function 'unionByName' * ... It is highly recommended to leverage *GPT-4* instead of GPT-3.5, since the former generate better results. *3*, Note that the LLM is not 100% reliable, the generated doc string may contain some mistakes, e.g. * The example code can not run * The example results are incorrect * The example code doesn't reflect the example title * The description use wrong version, add a 'Raise' selection for non-existent exception * ... we need to fix them before sending a PR. We can generate the docs with different prompts, choose the good parts and combine them to the new doc sting. > Refine the documents with LLM > - > > Key: SPARK-44564 > URL: https://issues.apache.org/jira/browse/SPARK-44564 > Project: Spark > Issue Type: Umbrella > Components: Documentation >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Major > > Let's first focus on the Documents of *PySpark DataFrame APIs*. > *1*, Chose a subset of DF APIs > Since the review bandwidth is limited, we recommend each PR contains at least > 5 APIs; > *2*, For each API, copy-paste the function (including function signature, doc > string) to a LLM Model, and ask it to refine the document with prompts like: > * please improve the docstring of the 'unionByName' function > * please refine the comments of the 'unionByName' function > * please refine the documents of the 'unionByName' function, and add more > examples > * please provide more examples for function 'unionByName' > * ... > It is highly recommended to leverage *GPT-4* instead of GPT-3.5, since the > former generate better results. > *3*, Note that the LLM is not 100% reliable, the generated doc string may > contain some mistakes, e.g. > * The example code can not run > * The example results are incorrect > * The example code doesn't reflect the example title > * The description use wrong version, add a 'Raise' selection for non-existent > exception > * ... > we need to fix them before sending a PR. > We can generate the docs with different prompts, choose the good parts and > combine them to the new doc sting. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44564) Refine the documents with LLM
[ https://issues.apache.org/jira/browse/SPARK-44564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng updated SPARK-44564: -- Description: Let's first focus on the Documents of *PySpark DataFrame APIs*. *1*, Chose a subset of DF APIs Since the review bandwidth is limited, we recommend each PR contains at least 5 APIs; *2*, For each API, copy-paste the function (including function signature, doc string) to a LLM Model, and ask it to refine the document with prompts like: * please improve the docstring of the 'unionByName' function * please refine the comments of the 'unionByName' function * please refine the documents of the 'unionByName' function, and add more examples * please provide more example for function 'unionByName' * ... It is highly recommended to leverage *GPT-4* instead of GPT-3.5, since the former generate better results. *3*, Note that the LLM is not 100% reliable, the generated doc string may contain some mistakes, e.g. * The example results are incorrect * The example code doesn't reflect the example title * The description use wrong version, add a 'Raise' selection for non-existent exception * ... we need to fix them before sending a PR. We can generate the docs with different prompts, choose the good parts and combine them to the new doc sting. was: Let's first focus on the Documents of *PySpark DataFrame APIs*. *1*, Chose a subset of DF APIs Since the review bandwidth is limited, we recommend each PR contains at least 5 APIs; *2*, For each API, copy-paste the function (including function signature, doc string) to a LLM Model, and ask it to refine the document with prompts like: * please improve the docstring of the 'unionByName' function * please refine the comments of the 'unionByName' function * please refine the documents of the 'unionByName' function, and add more examples * please provide more example for function 'unionByName' * ... It is highly recommended to leverage *GPT-4* instead of GPT-3.5, since the former generate better results. *3*, The generated doc string may contain some bugs, e.g. * The example results are incorrect * The example code doesn't reflect the example title * The description use wrong version, add a 'Raise' selection for non-existent exception * ... we need to fix them before sending a PR. We can generate the docs with different prompts, choose the good parts and combine them to the new doc sting. > Refine the documents with LLM > - > > Key: SPARK-44564 > URL: https://issues.apache.org/jira/browse/SPARK-44564 > Project: Spark > Issue Type: Umbrella > Components: Documentation >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Major > > Let's first focus on the Documents of *PySpark DataFrame APIs*. > *1*, Chose a subset of DF APIs > Since the review bandwidth is limited, we recommend each PR contains at least > 5 APIs; > *2*, For each API, copy-paste the function (including function signature, doc > string) to a LLM Model, and ask it to refine the document with prompts like: > * please improve the docstring of the 'unionByName' function > * please refine the comments of the 'unionByName' function > * please refine the documents of the 'unionByName' function, and add more > examples > * please provide more example for function 'unionByName' > * ... > It is highly recommended to leverage *GPT-4* instead of GPT-3.5, since the > former generate better results. > *3*, Note that the LLM is not 100% reliable, the generated doc string may > contain some mistakes, e.g. > * The example results are incorrect > * The example code doesn't reflect the example title > * The description use wrong version, add a 'Raise' selection for non-existent > exception > * ... > we need to fix them before sending a PR. > We can generate the docs with different prompts, choose the good parts and > combine them to the new doc sting. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44564) Refine the documents with LLM
[ https://issues.apache.org/jira/browse/SPARK-44564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng updated SPARK-44564: -- Description: Let's first focus on the Documents of *PySpark DataFrame APIs*. *1*, Chose a subset of DF APIs Since the review bandwidth is limited, we recommend each PR contains at least 5 APIs; *2*, For each API, copy-paste the function (including function signature, doc string) to a LLM Model, and ask it to refine the document with prompts like: * please improve the docstring of the 'unionByName' function * please refine the comments of the 'unionByName' function * please refine the documents of the 'unionByName' function, and add more examples * please provide more example for function 'unionByName' * ... It is highly recommended to leverage *GPT-4* instead of GPT-3.5, since the former generate better results. *3*, The generated doc string may contain some bugs, e.g. * The example results are incorrect * The example code doesn't reflect the example title * The description use wrong version, add a 'Raise' selection for non-existent exception * ... we need to fix them before sending a PR. We can generate the docs with different prompts, choose the good parts and combine them to the new doc sting. was: Let's first focus on the Documents of *PySpark DataFrame APIs*. 1, Chose a subset of DF APIs Since the review bandwidth is limited, we recommend each PR contains at least 5 APIs; 2, For each API, copy-paste the function (including function signature, doc string) to a LLM Model, and ask it to refine the document with prompts like: * please improve the docstring of the 'unionByName' function * please refine the comments of the 'unionByName' function * please refine the documents of the 'unionByName' function, and add more examples * please provide more example for function 'unionByName' * ... It is highly recommended to leverage *GPT-4* instead of GPT-3.5, since the former generate better results. 3, The generated doc string may contain some bugs, e.g. * The example results are incorrect * The example code doesn't reflect the example title * The description use wrong version, add a 'Raise' selection for non-existent exception * ... we need to fix them before sending a PR. We can generate the docs with different prompts, choose the good parts and combine them to the new doc sting. > Refine the documents with LLM > - > > Key: SPARK-44564 > URL: https://issues.apache.org/jira/browse/SPARK-44564 > Project: Spark > Issue Type: Umbrella > Components: Documentation >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Major > > Let's first focus on the Documents of *PySpark DataFrame APIs*. > *1*, Chose a subset of DF APIs > Since the review bandwidth is limited, we recommend each PR contains at least > 5 APIs; > *2*, For each API, copy-paste the function (including function signature, doc > string) to a LLM Model, and ask it to refine the document with prompts like: > * please improve the docstring of the 'unionByName' function > * please refine the comments of the 'unionByName' function > * please refine the documents of the 'unionByName' function, and add more > examples > * please provide more example for function 'unionByName' > * ... > It is highly recommended to leverage *GPT-4* instead of GPT-3.5, since the > former generate better results. > *3*, The generated doc string may contain some bugs, e.g. > * The example results are incorrect > * The example code doesn't reflect the example title > * The description use wrong version, add a 'Raise' selection for non-existent > exception > * ... > we need to fix them before sending a PR. > We can generate the docs with different prompts, choose the good parts and > combine them to the new doc sting. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44557) Flaky PIP packaging test
[ https://issues.apache.org/jira/browse/SPARK-44557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-44557. -- Fix Version/s: 3.5.0 4.0.0 3.4.2 Resolution: Fixed Issue resolved by pull request 42159 [https://github.com/apache/spark/pull/42159] > Flaky PIP packaging test > > > Key: SPARK-44557 > URL: https://issues.apache.org/jira/browse/SPARK-44557 > Project: Spark > Issue Type: Task > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.5.0, 4.0.0, 3.4.2 > > > e.g., https://github.com/apache/spark/actions/runs/5665869112/job/15351515397 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44557) Flaky PIP packaging test
[ https://issues.apache.org/jira/browse/SPARK-44557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-44557: Assignee: Hyukjin Kwon > Flaky PIP packaging test > > > Key: SPARK-44557 > URL: https://issues.apache.org/jira/browse/SPARK-44557 > Project: Spark > Issue Type: Task > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > e.g., https://github.com/apache/spark/actions/runs/5665869112/job/15351515397 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44564) Refine the documents with LLM
[ https://issues.apache.org/jira/browse/SPARK-44564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng updated SPARK-44564: -- Description: Let's first focus on the Documents of *PySpark DataFrame APIs*. 1, Chose a subset of DF APIs Since the review bandwidth is limited, we recommend each PR contains at least 5 APIs; 2, For each API, copy-paste the function (including function signature, doc string) to a LLM Model, and ask it to refine the document with prompts like: * please improve the docstring of the 'unionByName' function * please refine the comments of the 'unionByName' function * please refine the documents of the 'unionByName' function, and add more examples * please provide more example for function 'unionByName' * ... It is highly recommended to leverage *GPT-4* instead of GPT-3.5, since the former generate better results. 3, The generated doc string may contain some bugs, e.g. * The example results are incorrect * The example code doesn't reflect the example title * The description use wrong version, add a 'Raise' selection for non-existent exception * ... we need to fix them before sending a PR. We can generate the docs with different prompts, choose the good parts and combine them to the new doc sting. was: Let's first focus on the Documents of PySpark DataFrame APIs. 1, Chose a subset of DF APIs Since the review bandwidth is limited, we recommend each PR contains at least 5 APIs; 2, For each API, copy-paste the function (including function signature, doc string) to a LLM Model, and ask it to refine the document with prompts like: * please improve the docstring of the 'unionByName' function * please refine the comments of the 'unionByName' function * please refine the documents of the 'unionByName' function, and add more examples * please provide more example for function 'unionByName' * ... It is highly recommended to leverage *GPT-4* instead of GPT-3.5, since the former generate better results. 3, The generated doc string may contain some bugs, e.g. * The example results are incorrect * The example code doesn't reflect the example title * The description use wrong version, add a 'Raise' selection for non-existent exception * ... we need to fix them before sending a PR. We can generate the docs with different prompts, choose the good parts and combine them to the new doc sting. > Refine the documents with LLM > - > > Key: SPARK-44564 > URL: https://issues.apache.org/jira/browse/SPARK-44564 > Project: Spark > Issue Type: Umbrella > Components: Documentation >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Major > > Let's first focus on the Documents of *PySpark DataFrame APIs*. > 1, Chose a subset of DF APIs > Since the review bandwidth is limited, we recommend each PR contains at least > 5 APIs; > 2, For each API, copy-paste the function (including function signature, doc > string) to a LLM Model, and ask it to refine the document with prompts like: > * please improve the docstring of the 'unionByName' function > * please refine the comments of the 'unionByName' function > * please refine the documents of the 'unionByName' function, and add more > examples > * please provide more example for function 'unionByName' > * ... > It is highly recommended to leverage *GPT-4* instead of GPT-3.5, since the > former generate better results. > 3, The generated doc string may contain some bugs, e.g. > * The example results are incorrect > * The example code doesn't reflect the example title > * The description use wrong version, add a 'Raise' selection for non-existent > exception > * ... > we need to fix them before sending a PR. > We can generate the docs with different prompts, choose the good parts and > combine them to the new doc sting. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44565) Example: Refine the docs for Union, UnionAll and unionByName
[ https://issues.apache.org/jira/browse/SPARK-44565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng updated SPARK-44565: -- Summary: Example: Refine the docs for Union, UnionAll and unionByName (was: Refine the docs for Union, UnionAll and unionByName) > Example: Refine the docs for Union, UnionAll and unionByName > > > Key: SPARK-44565 > URL: https://issues.apache.org/jira/browse/SPARK-44565 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44565) Refine the docs for Union, UnionAll and unionByName
Ruifeng Zheng created SPARK-44565: - Summary: Refine the docs for Union, UnionAll and unionByName Key: SPARK-44565 URL: https://issues.apache.org/jira/browse/SPARK-44565 Project: Spark Issue Type: Sub-task Components: Documentation Affects Versions: 4.0.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44564) Refine the documents with LLM
[ https://issues.apache.org/jira/browse/SPARK-44564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng updated SPARK-44564: -- Description: Let's first focus on the Documents of PySpark DataFrame APIs. 1, Chose a subset of DF APIs Since the review bandwidth is limited, we recommend each PR contains at least 5 APIs; 2, For each API, copy-paste the function (including function signature, doc string) to a LLM Model, and ask it to refine the document with prompts like: * please improve the docstring of the 'unionByName' function * please refine the comments of the 'unionByName' function * please refine the documents of the 'unionByName' function, and add more examples * please provide more example for function 'unionByName' * ... It is highly recommended to leverage *GPT-4* instead of GPT-3.5, since the former generate better results. 3, The generated doc string may contain some bugs, e.g. * The example results are incorrect * The example code doesn't reflect the example title * The description use wrong version, add a 'Raise' selection for non-existent exception * ... we need to fix them before sending a PR. We can generate the docs with different prompts, choose the good parts and combine them to the new doc sting. was: Let's first focus on the Documents of PySpark DataFrame APIs. 1, Chose a subset of DF APIs Since the review bandwidth is limited, we recommend each PR contains at least 5 APIs; 2, For each API, copy-paste the function (including function signature, doc string) to a LLM Model, and ask it to refine the document with prompts like: * please improve the docstring of the 'unionByName' function * please refine the comments of the 'unionByName' function * please refine the documents of the 'unionByName' function, and add more examples * please provide more example for function 'unionByName' It is highly recommended to leverage *GPT-4* instead of GPT-3.5, since the former generate better results. 3, The generated doc string may contain some bugs, e.g. * The description use wrong version, add a 'Raise' selection; * The example code doesn't reflect the example title; * The example results are incorrect we need to fix them before sending a PR. We can generate the docs with different prompts, choose the good parts and combine them to the new doc sting. > Refine the documents with LLM > - > > Key: SPARK-44564 > URL: https://issues.apache.org/jira/browse/SPARK-44564 > Project: Spark > Issue Type: Umbrella > Components: Documentation >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Major > > Let's first focus on the Documents of PySpark DataFrame APIs. > 1, Chose a subset of DF APIs > Since the review bandwidth is limited, we recommend each PR contains at least > 5 APIs; > 2, For each API, copy-paste the function (including function signature, doc > string) to a LLM Model, and ask it to refine the document with prompts like: > * please improve the docstring of the 'unionByName' function > * please refine the comments of the 'unionByName' function > * please refine the documents of the 'unionByName' function, and add more > examples > * please provide more example for function 'unionByName' > * ... > It is highly recommended to leverage *GPT-4* instead of GPT-3.5, since the > former generate better results. > 3, The generated doc string may contain some bugs, e.g. > * The example results are incorrect > * The example code doesn't reflect the example title > * The description use wrong version, add a 'Raise' selection for non-existent > exception > * ... > we need to fix them before sending a PR. > We can generate the docs with different prompts, choose the good parts and > combine them to the new doc sting. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44564) Refine the documents with LLM
[ https://issues.apache.org/jira/browse/SPARK-44564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng updated SPARK-44564: -- Description: Let's first focus on the Documents of PySpark DataFrame APIs. 1, Chose a subset of DF APIs Since the review bandwidth is limited, we recommend each PR contains at least 5 APIs; 2, For each API, copy-paste the function (including function signature, doc string) to a LLM Model, and ask it to refine the document with prompts like: * please improve the docstring of the 'unionByName' function * please refine the comments of the 'unionByName' function * please refine the documents of the 'unionByName' function, and add more examples * please provide more example for function 'unionByName' It is highly recommended to leverage *GPT-4* instead of GPT-3.5, since the former generate better results. 3, The generated doc string may contain some bugs, e.g. * The description use wrong version, add a 'Raise' selection; * The example code doesn't reflect the example title; * The example results are incorrect we need to fix them before sending a PR. We can generate the docs with different prompts, choose the good parts and combine them to the new doc sting. was: Let's first focus on the Documents of PySpark DataFrame APIs. 1, Chose a subset of DF APIs Since the review bandwidth is limited, we recommend each PR contains at least 5 APIs; 2, For each API, copy-paste the function (including function signature, doc string) to a LLM Model, and ask it to refine the document with prompts like: * please improve the docstring of the 'unionByName' function * please refine the comments of the 'unionByName' function * please refine the documents of the 'unionByName' function, and add more examples * please provide more example for function 'unionByName' It is highly recommended to leverage *GPT-4* instead of GPT-3.5, since the former generate better results. 3, The generated doc string may contain some bugs, e.g. * The description use wrong version, add a 'Raise' selection; * The example code doesn't reflect the example title; * The example results are incorrect we need to fix them before sending a PR. We can generate the docs with different prompts, choose the good parts and combine them to the new doc sting. > Refine the documents with LLM > - > > Key: SPARK-44564 > URL: https://issues.apache.org/jira/browse/SPARK-44564 > Project: Spark > Issue Type: Umbrella > Components: Documentation >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Major > > Let's first focus on the Documents of PySpark DataFrame APIs. > 1, Chose a subset of DF APIs > Since the review bandwidth is limited, we recommend each PR contains at least > 5 APIs; > 2, For each API, copy-paste the function (including function signature, doc > string) to a LLM Model, and ask it to refine the document with prompts like: > * please improve the docstring of the 'unionByName' function > * please refine the comments of the 'unionByName' function > * please refine the documents of the 'unionByName' function, and add more > examples > * please provide more example for function 'unionByName' > It is highly recommended to leverage *GPT-4* instead of GPT-3.5, since the > former generate better results. > 3, The generated doc string may contain some bugs, e.g. > * The description use wrong version, add a 'Raise' selection; > * The example code doesn't reflect the example title; > * The example results are incorrect > we need to fix them before sending a PR. > We can generate the docs with different prompts, choose the good parts and > combine them to the new doc sting. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44564) Refine the documents with LLM
Ruifeng Zheng created SPARK-44564: - Summary: Refine the documents with LLM Key: SPARK-44564 URL: https://issues.apache.org/jira/browse/SPARK-44564 Project: Spark Issue Type: Umbrella Components: Documentation Affects Versions: 4.0.0 Reporter: Ruifeng Zheng Let's first focus on the Documents of PySpark DataFrame APIs. 1, Chose a subset of DF APIs Since the review bandwidth is limited, we recommend each PR contains at least 5 APIs; 2, For each API, copy-paste the function (including function signature, doc string) to a LLM Model, and ask it to refine the document with prompts like: * please improve the docstring of the 'unionByName' function * please refine the comments of the 'unionByName' function * please refine the documents of the 'unionByName' function, and add more examples * please provide more example for function 'unionByName' It is highly recommended to leverage *GPT-4* instead of GPT-3.5, since the former generate better results. 3, The generated doc string may contain some bugs, e.g. * The description use wrong version, add a 'Raise' selection; * The example code doesn't reflect the example title; * The example results are incorrect we need to fix them before sending a PR. We can generate the docs with different prompts, choose the good parts and combine them to the new doc sting. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44533) Add support for accumulator, broadcast, and Spark files in Python UDTF's analyze.
[ https://issues.apache.org/jira/browse/SPARK-44533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-44533. --- Assignee: Takuya Ueshin Resolution: Fixed Issue resolved by pull request 42135 https://github.com/apache/spark/pull/42135 > Add support for accumulator, broadcast, and Spark files in Python UDTF's > analyze. > - > > Key: SPARK-44533 > URL: https://issues.apache.org/jira/browse/SPARK-44533 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44563) Upgrade Apache Arrow to 13.0.0
BingKun Pan created SPARK-44563: --- Summary: Upgrade Apache Arrow to 13.0.0 Key: SPARK-44563 URL: https://issues.apache.org/jira/browse/SPARK-44563 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 4.0.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43611) Fix unexpected `AnalysisException` from Spark Connect client
[ https://issues.apache.org/jira/browse/SPARK-43611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-43611. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42086 [https://github.com/apache/spark/pull/42086] > Fix unexpected `AnalysisException` from Spark Connect client > > > Key: SPARK-43611 > URL: https://issues.apache.org/jira/browse/SPARK-43611 > Project: Spark > Issue Type: Sub-task > Components: Connect, Pandas API on Spark >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Priority: Major > Fix For: 4.0.0 > > > Reproducible example: > {code:java} > >>> import pyspark.pandas as ps > >>> psdf1 = ps.DataFrame({"A": [1, 2, 3]}) > >>> psdf2 = ps.DataFrame({"B": [1, 2, 3]}) > >>> psdf1.append(psdf2) > /Users/haejoon.lee/Desktop/git_store/spark/python/pyspark/pandas/frame.py:8897: > FutureWarning: The DataFrame.append method is deprecated and will be removed > in a future version. Use pyspark.pandas.concat instead. > warnings.warn( > Traceback (most recent call last): > File "", line 1, in > File > "/Users/haejoon.lee/Desktop/git_store/spark/python/pyspark/pandas/frame.py", > line 8930, in append > return cast(DataFrame, concat([self, other], ignore_index=ignore_index)) > File > "/Users/haejoon.lee/Desktop/git_store/spark/python/pyspark/pandas/namespace.py", > line 2703, in concat > psdfs[0]._internal.copy( > File > "/Users/haejoon.lee/Desktop/git_store/spark/python/pyspark/pandas/internal.py", > line 1508, in copy > return InternalFrame( > File > "/Users/haejoon.lee/Desktop/git_store/spark/python/pyspark/pandas/internal.py", > line 753, in __init__ > schema = spark_frame.select(data_spark_columns).schema > File > "/Users/haejoon.lee/Desktop/git_store/spark/python/pyspark/sql/connect/dataframe.py", > line 1650, in schema > return self._session.client.schema(query) > File > "/Users/haejoon.lee/Desktop/git_store/spark/python/pyspark/sql/connect/client.py", > line 777, in schema > schema = self._analyze(method="schema", plan=plan).schema > File > "/Users/haejoon.lee/Desktop/git_store/spark/python/pyspark/sql/connect/client.py", > line 958, in _analyze > self._handle_error(error) > File > "/Users/haejoon.lee/Desktop/git_store/spark/python/pyspark/sql/connect/client.py", > line 1195, in _handle_error > self._handle_rpc_error(error) > File > "/Users/haejoon.lee/Desktop/git_store/spark/python/pyspark/sql/connect/client.py", > line 1231, in _handle_rpc_error > raise convert_exception(info, status.message) from None > pyspark.errors.exceptions.connect.AnalysisException: When resolving 'A, fail > to find subplan with plan_id=16 in 'Project ['A, 'B] > +- Project [__index_level_0__#1101L, A#1102L, B#1157L, > monotonically_increasing_id() AS __natural_order__#1163L] > +- Union false, false > :- Project [__index_level_0__#1101L, A#1102L, cast(B#1116 as bigint) AS > B#1157L] > : +- Project [__index_level_0__#1101L, A#1102L, B#1116] > : +- Project [__index_level_0__#1101L, A#1102L, > __natural_order__#1108L, null AS B#1116] > : +- Project [__index_level_0__#1101L, A#1102L, > __natural_order__#1108L] > : +- Project [__index_level_0__#1101L, A#1102L, > monotonically_increasing_id() AS __natural_order__#1108L] > : +- Project [__index_level_0__#1097L AS > __index_level_0__#1101L, A#1098L AS A#1102L] > : +- LocalRelation [__index_level_0__#1097L, A#1098L] > +- Project [__index_level_0__#1137L, cast(A#1152 as bigint) AS A#1158L, > B#1138L] > +- Project [__index_level_0__#1137L, A#1152, B#1138L] > +- Project [__index_level_0__#1137L, B#1138L, > __natural_order__#1144L, null AS A#1152] > +- Project [__index_level_0__#1137L, B#1138L, > __natural_order__#1144L] > +- Project [__index_level_0__#1137L, B#1138L, > monotonically_increasing_id() AS __natural_order__#1144L] > +- Project [__index_level_0__#1133L AS > __index_level_0__#1137L, B#1134L AS B#1138L] > +- LocalRelation [__index_level_0__#1133L, B#1134L] > {code} > Another example: > {code:java} > >>> pdf = pd.DataFrame( > ... { > ... "A": [None, 3, None, None], > ... "B": [2, 4, None, 3], > ... "C": [None, None, None, 1], > ... "D": [0, 1, 5, 4], > ... }, > ... columns=["A", "B", "C", "D"], > ... ) > >>> psdf = ps.from_pandas(pdf) > >>> psdf.backfill() > /Users/haejoon.lee/Desktop/git_store/spark/python/pyspark/sql/connect/expressions.py:945: > UserWarning: WARN WindowExpression: No Partition Defined for Window > operation! Moving all data to a single partition,
[jira] [Assigned] (SPARK-43611) Fix unexpected `AnalysisException` from Spark Connect client
[ https://issues.apache.org/jira/browse/SPARK-43611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-43611: - Assignee: Ruifeng Zheng > Fix unexpected `AnalysisException` from Spark Connect client > > > Key: SPARK-43611 > URL: https://issues.apache.org/jira/browse/SPARK-43611 > Project: Spark > Issue Type: Sub-task > Components: Connect, Pandas API on Spark >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 4.0.0 > > > Reproducible example: > {code:java} > >>> import pyspark.pandas as ps > >>> psdf1 = ps.DataFrame({"A": [1, 2, 3]}) > >>> psdf2 = ps.DataFrame({"B": [1, 2, 3]}) > >>> psdf1.append(psdf2) > /Users/haejoon.lee/Desktop/git_store/spark/python/pyspark/pandas/frame.py:8897: > FutureWarning: The DataFrame.append method is deprecated and will be removed > in a future version. Use pyspark.pandas.concat instead. > warnings.warn( > Traceback (most recent call last): > File "", line 1, in > File > "/Users/haejoon.lee/Desktop/git_store/spark/python/pyspark/pandas/frame.py", > line 8930, in append > return cast(DataFrame, concat([self, other], ignore_index=ignore_index)) > File > "/Users/haejoon.lee/Desktop/git_store/spark/python/pyspark/pandas/namespace.py", > line 2703, in concat > psdfs[0]._internal.copy( > File > "/Users/haejoon.lee/Desktop/git_store/spark/python/pyspark/pandas/internal.py", > line 1508, in copy > return InternalFrame( > File > "/Users/haejoon.lee/Desktop/git_store/spark/python/pyspark/pandas/internal.py", > line 753, in __init__ > schema = spark_frame.select(data_spark_columns).schema > File > "/Users/haejoon.lee/Desktop/git_store/spark/python/pyspark/sql/connect/dataframe.py", > line 1650, in schema > return self._session.client.schema(query) > File > "/Users/haejoon.lee/Desktop/git_store/spark/python/pyspark/sql/connect/client.py", > line 777, in schema > schema = self._analyze(method="schema", plan=plan).schema > File > "/Users/haejoon.lee/Desktop/git_store/spark/python/pyspark/sql/connect/client.py", > line 958, in _analyze > self._handle_error(error) > File > "/Users/haejoon.lee/Desktop/git_store/spark/python/pyspark/sql/connect/client.py", > line 1195, in _handle_error > self._handle_rpc_error(error) > File > "/Users/haejoon.lee/Desktop/git_store/spark/python/pyspark/sql/connect/client.py", > line 1231, in _handle_rpc_error > raise convert_exception(info, status.message) from None > pyspark.errors.exceptions.connect.AnalysisException: When resolving 'A, fail > to find subplan with plan_id=16 in 'Project ['A, 'B] > +- Project [__index_level_0__#1101L, A#1102L, B#1157L, > monotonically_increasing_id() AS __natural_order__#1163L] > +- Union false, false > :- Project [__index_level_0__#1101L, A#1102L, cast(B#1116 as bigint) AS > B#1157L] > : +- Project [__index_level_0__#1101L, A#1102L, B#1116] > : +- Project [__index_level_0__#1101L, A#1102L, > __natural_order__#1108L, null AS B#1116] > : +- Project [__index_level_0__#1101L, A#1102L, > __natural_order__#1108L] > : +- Project [__index_level_0__#1101L, A#1102L, > monotonically_increasing_id() AS __natural_order__#1108L] > : +- Project [__index_level_0__#1097L AS > __index_level_0__#1101L, A#1098L AS A#1102L] > : +- LocalRelation [__index_level_0__#1097L, A#1098L] > +- Project [__index_level_0__#1137L, cast(A#1152 as bigint) AS A#1158L, > B#1138L] > +- Project [__index_level_0__#1137L, A#1152, B#1138L] > +- Project [__index_level_0__#1137L, B#1138L, > __natural_order__#1144L, null AS A#1152] > +- Project [__index_level_0__#1137L, B#1138L, > __natural_order__#1144L] > +- Project [__index_level_0__#1137L, B#1138L, > monotonically_increasing_id() AS __natural_order__#1144L] > +- Project [__index_level_0__#1133L AS > __index_level_0__#1137L, B#1134L AS B#1138L] > +- LocalRelation [__index_level_0__#1133L, B#1134L] > {code} > Another example: > {code:java} > >>> pdf = pd.DataFrame( > ... { > ... "A": [None, 3, None, None], > ... "B": [2, 4, None, 3], > ... "C": [None, None, None, 1], > ... "D": [0, 1, 5, 4], > ... }, > ... columns=["A", "B", "C", "D"], > ... ) > >>> psdf = ps.from_pandas(pdf) > >>> psdf.backfill() > /Users/haejoon.lee/Desktop/git_store/spark/python/pyspark/sql/connect/expressions.py:945: > UserWarning: WARN WindowExpression: No Partition Defined for Window > operation! Moving all data to a single partition, this can cause serious > performance degradation. >
[jira] [Created] (SPARK-44562) Add OptimizeOneRowRelationSubquery in batch of Subquery
Yuming Wang created SPARK-44562: --- Summary: Add OptimizeOneRowRelationSubquery in batch of Subquery Key: SPARK-44562 URL: https://issues.apache.org/jira/browse/SPARK-44562 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44479) Support Python UDTFs with empty schema
[ https://issues.apache.org/jira/browse/SPARK-44479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-44479. --- Assignee: Takuya Ueshin Resolution: Fixed Issue resolved by pull request 42161 https://github.com/apache/spark/pull/42161 > Support Python UDTFs with empty schema > -- > > Key: SPARK-44479 > URL: https://issues.apache.org/jira/browse/SPARK-44479 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > > Support UDTFs with empty schema, for example: > {code:python} > >>> class TestUDTF: > ... def eval(self): > ... yield tuple() > {code} > Currently it fails with `useArrow=True`: > {code:python} > >>> udtf(TestUDTF, returnType=StructType())().collect() > Traceback (most recent call last): > ... > ValueError: not enough values to unpack (expected 2, got 0) > {code} > whereas without Arrow: > {code:python} > >>> udtf(TestUDTF, returnType=StructType(), useArrow=False)().collect() > [Row()] > {code} > Otherwise, we should raise an error without Arrow, too, to be consistent. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44553) Ignoring `connect-check-protos` logic in GA testing
[ https://issues.apache.org/jira/browse/SPARK-44553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-44553. --- Fix Version/s: 3.4.2 Resolution: Fixed Issue resolved by pull request 42166 [https://github.com/apache/spark/pull/42166] > Ignoring `connect-check-protos` logic in GA testing > --- > > Key: SPARK-44553 > URL: https://issues.apache.org/jira/browse/SPARK-44553 > Project: Spark > Issue Type: Test > Components: Build >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Fix For: 3.4.2 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44553) Ignoring `connect-check-protos` logic in GA testing
[ https://issues.apache.org/jira/browse/SPARK-44553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-44553: - Assignee: BingKun Pan > Ignoring `connect-check-protos` logic in GA testing > --- > > Key: SPARK-44553 > URL: https://issues.apache.org/jira/browse/SPARK-44553 > Project: Spark > Issue Type: Test > Components: Build >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44544) Deduplicate run_python_packaging_tests
[ https://issues.apache.org/jira/browse/SPARK-44544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-44544: - Fix Version/s: 3.4.2 > Deduplicate run_python_packaging_tests > -- > > Key: SPARK-44544 > URL: https://issues.apache.org/jira/browse/SPARK-44544 > Project: Spark > Issue Type: Test > Components: Project Infra >Affects Versions: 3.5.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.4.2, 3.5.0, 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44457) Make ArrowEncoderSuite pass Java 17 daily test
[ https://issues.apache.org/jira/browse/SPARK-44457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-44457: - Priority: Minor (was: Major) > Make ArrowEncoderSuite pass Java 17 daily test > > > Key: SPARK-44457 > URL: https://issues.apache.org/jira/browse/SPARK-44457 > Project: Spark > Issue Type: Bug > Components: Connect, Tests >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.5.0, 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44457) Make ArrowEncoderSuite pass Java 17 daily test
[ https://issues.apache.org/jira/browse/SPARK-44457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-44457. -- Fix Version/s: 3.5.0 4.0.0 Assignee: Yang Jie Resolution: Fixed Resolved by https://github.com/apache/spark/pull/42039 > Make ArrowEncoderSuite pass Java 17 daily test > > > Key: SPARK-44457 > URL: https://issues.apache.org/jira/browse/SPARK-44457 > Project: Spark > Issue Type: Bug > Components: Connect, Tests >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Fix For: 3.5.0, 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44522) Upgrade scala-xml to 2.2.0
[ https://issues.apache.org/jira/browse/SPARK-44522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-44522. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42119 [https://github.com/apache/spark/pull/42119] > Upgrade scala-xml to 2.2.0 > -- > > Key: SPARK-44522 > URL: https://issues.apache.org/jira/browse/SPARK-44522 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 4.0.0 > > > https://github.com/scala/scala-xml/releases/tag/v2.2.0 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44522) Upgrade scala-xml to 2.2.0
[ https://issues.apache.org/jira/browse/SPARK-44522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-44522: Assignee: Yang Jie > Upgrade scala-xml to 2.2.0 > -- > > Key: SPARK-44522 > URL: https://issues.apache.org/jira/browse/SPARK-44522 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > > https://github.com/scala/scala-xml/releases/tag/v2.2.0 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44528) Spark Connect DataFrame does not allow to add custom instance attributes and check for it
[ https://issues.apache.org/jira/browse/SPARK-44528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-44528. -- Fix Version/s: 3.5.0 4.0.0 Resolution: Fixed Issue resolved by pull request 42132 [https://github.com/apache/spark/pull/42132] > Spark Connect DataFrame does not allow to add custom instance attributes and > check for it > - > > Key: SPARK-44528 > URL: https://issues.apache.org/jira/browse/SPARK-44528 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.4.1 >Reporter: Martin Grund >Assignee: Martin Grund >Priority: Major > Fix For: 3.5.0, 4.0.0 > > > ``` > df = spark.range(10) > df._test = 10 > assert(hasattr(df, "_test")) > assert(!hasattr(df, "_test_no")) > ``` > Treats `df._test` like a column -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44528) Spark Connect DataFrame does not allow to add custom instance attributes and check for it
[ https://issues.apache.org/jira/browse/SPARK-44528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-44528: Assignee: Martin Grund > Spark Connect DataFrame does not allow to add custom instance attributes and > check for it > - > > Key: SPARK-44528 > URL: https://issues.apache.org/jira/browse/SPARK-44528 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.4.1 >Reporter: Martin Grund >Assignee: Martin Grund >Priority: Major > > ``` > df = spark.range(10) > df._test = 10 > assert(hasattr(df, "_test")) > assert(!hasattr(df, "_test_no")) > ``` > Treats `df._test` like a column -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44561) Fix AssertionError when converting UDTF output to a complex type
Allison Wang created SPARK-44561: Summary: Fix AssertionError when converting UDTF output to a complex type Key: SPARK-44561 URL: https://issues.apache.org/jira/browse/SPARK-44561 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.5.0 Reporter: Allison Wang {code:java} class TestUDTF: def eval(self): yield {'a': 1, 'b': 2}, udtf(TestUDTF, returnType="x: map")().show() {code} This will fail with: File "pandas/_libs/lib.pyx", line 2834, in pandas._libs.lib.map_infer File "python/pyspark/sql/pandas/types.py", line 804, in convert_map assert isinstance(value, dict) AssertionError Same for `convert_struct` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44560) Improve tests and documentation for Arrow Python UDF
Xinrong Meng created SPARK-44560: Summary: Improve tests and documentation for Arrow Python UDF Key: SPARK-44560 URL: https://issues.apache.org/jira/browse/SPARK-44560 Project: Spark Issue Type: Sub-task Components: Connect, PySpark Affects Versions: 3.5.0, 4.0.0 Reporter: Xinrong Meng Test on complex return type Remove complex return type constraints for Arrow Python UDF on Spark Connect Update documentation of the related Spark conf -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37562) Add Spark History Server Links for Kubernetes & other CMs
[ https://issues.apache.org/jira/browse/SPARK-37562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17747676#comment-17747676 ] Holden Karau commented on SPARK-37562: -- So (in theory) the cluster administrator has some base config, they set it up. They also configure a history server location. When we run on YARN they can configure that location the the log URL will be printed with the correct location (e.g. [historyserver]/[app] ) for someone investigating after the case. This just proposes to generalize the YARN config. > Add Spark History Server Links for Kubernetes & other CMs > - > > Key: SPARK-37562 > URL: https://issues.apache.org/jira/browse/SPARK-37562 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.2.1, 3.3.0 >Reporter: Holden Karau >Assignee: Holden Karau >Priority: Minor > > In YARN we have the Spark history server configured with > `spark.yarn.historyServer.address` which allows us to print out useful links > on startup for eventual debugging. More than just YARN can have the history > server. We should either add `spark.kubernetes.historyServer.address` or move > it to `spark.historyServer.address` w/a fall back to the old YARN specific > config. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44559) Improve error messages for invalid Python UDTF arrow type casts
Allison Wang created SPARK-44559: Summary: Improve error messages for invalid Python UDTF arrow type casts Key: SPARK-44559 URL: https://issues.apache.org/jira/browse/SPARK-44559 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.5.0 Reporter: Allison Wang Currently, if a Python UDTF outputs a type that is incompatible with the specified output schema, Spark will throw the following confusing error message: {code:java} File "pyarrow/array.pxi", line 1044, in pyarrow.lib.Array.from_pandas File "pyarrow/array.pxi", line 316, in pyarrow.lib.array File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Could not convert [1, 2] with type list: tried to convert to int32{code} We should improve this. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44558) Export Pyspark's Spark Connect Log Level
Alice Sayutina created SPARK-44558: -- Summary: Export Pyspark's Spark Connect Log Level Key: SPARK-44558 URL: https://issues.apache.org/jira/browse/SPARK-44558 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.4.1 Reporter: Alice Sayutina Export spark connect log level as API function -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44264) DeepSpeed Distrobutor
[ https://issues.apache.org/jira/browse/SPARK-44264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17747612#comment-17747612 ] Ignite TC Bot commented on SPARK-44264: --- User 'mathewjacob1002' has created a pull request for this issue: https://github.com/apache/spark/pull/42118 > DeepSpeed Distrobutor > - > > Key: SPARK-44264 > URL: https://issues.apache.org/jira/browse/SPARK-44264 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 3.4.1 >Reporter: Lu Wang >Priority: Critical > Fix For: 3.5.0 > > Attachments: Trying to Run Deepspeed Funcs.html > > > To make it easier for Pyspark users to run distributed training and inference > with DeepSpeed on spark clusters using PySpark. This was a project determined > by the Databricks ML Training Team. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-44503) Support PARTITION BY and ORDER BY clause for table arguments
[ https://issues.apache.org/jira/browse/SPARK-44503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel reopened SPARK-44503: Reopening since I added the SQL grammar support only in [https://github.com/apache/spark/pull/42100,] and next I will add the planning and execution parts. > Support PARTITION BY and ORDER BY clause for table arguments > > > Key: SPARK-44503 > URL: https://issues.apache.org/jira/browse/SPARK-44503 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Daniel >Assignee: Daniel >Priority: Major > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44537) Upgrade kubernetes-client to 6.8.0
[ https://issues.apache.org/jira/browse/SPARK-44537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-44537. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42142 [https://github.com/apache/spark/pull/42142] > Upgrade kubernetes-client to 6.8.0 > --- > > Key: SPARK-44537 > URL: https://issues.apache.org/jira/browse/SPARK-44537 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Trivial > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44537) Upgrade kubernetes-client to 6.8.0
[ https://issues.apache.org/jira/browse/SPARK-44537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-44537: - Assignee: BingKun Pan > Upgrade kubernetes-client to 6.8.0 > --- > > Key: SPARK-44537 > URL: https://issues.apache.org/jira/browse/SPARK-44537 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Trivial > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44557) Flaky PIP packaging test
[ https://issues.apache.org/jira/browse/SPARK-44557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17747471#comment-17747471 ] Nikita Awasthi commented on SPARK-44557: User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/42159 > Flaky PIP packaging test > > > Key: SPARK-44557 > URL: https://issues.apache.org/jira/browse/SPARK-44557 > Project: Spark > Issue Type: Task > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Priority: Major > > e.g., https://github.com/apache/spark/actions/runs/5665869112/job/15351515397 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44524) Balancing pyspark-pandas-connect and pyspark-pandas-slow-connect GA testing time
[ https://issues.apache.org/jira/browse/SPARK-44524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BingKun Pan updated SPARK-44524: Summary: Balancing pyspark-pandas-connect and pyspark-pandas-slow-connect GA testing time (was: Add a new test group for pyspark-pandas-slow-connect module) > Balancing pyspark-pandas-connect and pyspark-pandas-slow-connect GA testing > time > - > > Key: SPARK-44524 > URL: https://issues.apache.org/jira/browse/SPARK-44524 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44557) Flaky PIP packaging test
Hyukjin Kwon created SPARK-44557: Summary: Flaky PIP packaging test Key: SPARK-44557 URL: https://issues.apache.org/jira/browse/SPARK-44557 Project: Spark Issue Type: Task Components: Project Infra Affects Versions: 4.0.0 Reporter: Hyukjin Kwon e.g., https://github.com/apache/spark/actions/runs/5665869112/job/15351515397 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44531) Move encoder inference to sql/api
[ https://issues.apache.org/jira/browse/SPARK-44531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hövell resolved SPARK-44531. --- Fix Version/s: 3.5.0 Resolution: Fixed > Move encoder inference to sql/api > - > > Key: SPARK-44531 > URL: https://issues.apache.org/jira/browse/SPARK-44531 > Project: Spark > Issue Type: New Feature > Components: Connect, SQL >Affects Versions: 3.4.1 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44555) Use checkError() to check Exception in command Suite & assign some error class names
[ https://issues.apache.org/jira/browse/SPARK-44555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BingKun Pan updated SPARK-44555: Summary: Use checkError() to check Exception in command Suite & assign some error class names (was: Use checkError() to check Exception in command Suite & Assign new error-class) > Use checkError() to check Exception in command Suite & assign some error > class names > > > Key: SPARK-44555 > URL: https://issues.apache.org/jira/browse/SPARK-44555 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0, 4.0.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44555) Use checkError() to check Exception in command Suite & Assign new error-class
[ https://issues.apache.org/jira/browse/SPARK-44555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BingKun Pan updated SPARK-44555: Summary: Use checkError() to check Exception in command Suite & Assign new error-class (was: Make branch-3.3 & branch-3.4 daily test happy) > Use checkError() to check Exception in command Suite & Assign new error-class > - > > Key: SPARK-44555 > URL: https://issues.apache.org/jira/browse/SPARK-44555 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0, 4.0.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44556) Reuse `OrcTail` when enable vectorizedReader
dzcxzl created SPARK-44556: -- Summary: Reuse `OrcTail` when enable vectorizedReader Key: SPARK-44556 URL: https://issues.apache.org/jira/browse/SPARK-44556 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.1 Reporter: dzcxzl -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44098) Introduce python breaking change detection
[ https://issues.apache.org/jira/browse/SPARK-44098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17747396#comment-17747396 ] GridGain Integration commented on SPARK-44098: -- User 'StardustDL' has created a pull request for this issue: https://github.com/apache/spark/pull/42125 > Introduce python breaking change detection > -- > > Key: SPARK-44098 > URL: https://issues.apache.org/jira/browse/SPARK-44098 > Project: Spark > Issue Type: Test > Components: Project Infra, Tests >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Major > > We have breaking change detections for Binary Compatibility and Protobufs, > but we don't have one for python. > Authors of [aexpy|https://github.com/StardustDL/aexpy] are willing to help > PySpark detecting python breaking changes. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44525) Improve error message when Invoke method is not found
[ https://issues.apache.org/jira/browse/SPARK-44525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie reassigned SPARK-44525: Assignee: Cheng Pan > Improve error message when Invoke method is not found > - > > Key: SPARK-44525 > URL: https://issues.apache.org/jira/browse/SPARK-44525 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.2 >Reporter: Cheng Pan >Assignee: Cheng Pan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44525) Improve error message when Invoke method is not found
[ https://issues.apache.org/jira/browse/SPARK-44525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie resolved SPARK-44525. -- Fix Version/s: 3.5.0 4.0.0 Resolution: Fixed Issue resolved by pull request 42128 [https://github.com/apache/spark/pull/42128] > Improve error message when Invoke method is not found > - > > Key: SPARK-44525 > URL: https://issues.apache.org/jira/browse/SPARK-44525 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.2 >Reporter: Cheng Pan >Assignee: Cheng Pan >Priority: Major > Fix For: 3.5.0, 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44544) Deduplicate run_python_packaging_tests
[ https://issues.apache.org/jira/browse/SPARK-44544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-44544. --- Fix Version/s: 3.5.0 4.0.0 Resolution: Fixed Issue resolved by pull request 42146 [https://github.com/apache/spark/pull/42146] > Deduplicate run_python_packaging_tests > -- > > Key: SPARK-44544 > URL: https://issues.apache.org/jira/browse/SPARK-44544 > Project: Spark > Issue Type: Test > Components: Project Infra >Affects Versions: 3.5.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.5.0, 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44544) Deduplicate run_python_packaging_tests
[ https://issues.apache.org/jira/browse/SPARK-44544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-44544: - Assignee: Ruifeng Zheng > Deduplicate run_python_packaging_tests > -- > > Key: SPARK-44544 > URL: https://issues.apache.org/jira/browse/SPARK-44544 > Project: Spark > Issue Type: Test > Components: Project Infra >Affects Versions: 3.5.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44544) Deduplicate run_python_packaging_tests
[ https://issues.apache.org/jira/browse/SPARK-44544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng updated SPARK-44544: -- Summary: Deduplicate run_python_packaging_tests (was: Move python packaging tests to a separate module) > Deduplicate run_python_packaging_tests > -- > > Key: SPARK-44544 > URL: https://issues.apache.org/jira/browse/SPARK-44544 > Project: Spark > Issue Type: Test > Components: Project Infra >Affects Versions: 3.5.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44554) Install different Python linter dependencies for daily testing of different Spark versions
[ https://issues.apache.org/jira/browse/SPARK-44554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-44554: - Description: Fix daily test python lint check failure for branches 3.3 and 3.4 3.4 : [https://github.com/apache/spark/actions/runs/5654787844/job/15318633266] 3.3 : https://github.com/apache/spark/actions/runs/5653655970/job/15315236052 > Install different Python linter dependencies for daily testing of different > Spark versions > -- > > Key: SPARK-44554 > URL: https://issues.apache.org/jira/browse/SPARK-44554 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Major > > Fix daily test python lint check failure for branches 3.3 and 3.4 > > 3.4 : > [https://github.com/apache/spark/actions/runs/5654787844/job/15318633266] > 3.3 : https://github.com/apache/spark/actions/runs/5653655970/job/15315236052 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44555) Make branch-3.3 & branch-3.4 daily test happy
BingKun Pan created SPARK-44555: --- Summary: Make branch-3.3 & branch-3.4 daily test happy Key: SPARK-44555 URL: https://issues.apache.org/jira/browse/SPARK-44555 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.5.0, 4.0.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35914) Driver can't distribute task to executor because NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-35914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17747294#comment-17747294 ] surya commented on SPARK-35914: --- Hey, We are facing similar issue and we are using spark3.1.1 with hadoop3.2. Is this issue resolved in further versions ? If yes, can you guys let me know the fixed version. OR If there is any workaround to solve this issue, Please let me know. > Driver can't distribute task to executor because NullPointerException > - > > Key: SPARK-35914 > URL: https://issues.apache.org/jira/browse/SPARK-35914 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.1, 3.1.1, 3.1.2 > Environment: hadoop 2.6.0-cdh5.7.1 > Spark 3.0.1, 3.1.1, 3.1.2 >Reporter: Helt Long >Priority: Major > Attachments: stuck log.png, webui stuck.png > > > When use spark3 submit a spark job to yarn cluster, I get a problem. Once in > a while, driver can't distribute any tasks to any executors, and the stage > will stuck , the total spark job will stuck. Check driver log, I found > NullPointerException. It's like a netty problem, I can confirm this problem > only exist in spark3, because I use spark2 never happend. > > {code:java} > // Error message > 21/06/28 14:42:43 INFO TaskSetManager: Starting task 2592.0 in stage 1.0 (TID > 3494) (worker39.hadoop, executor 84, partition 2592, RACK_LOCAL, 5006 bytes) > taskResourceAssignments Map() > 21/06/28 14:42:43 INFO TaskSetManager: Finished task 4155.0 in stage 1.0 (TID > 3367) in 36670 ms on worker39.hadoop (executor 84) (3278/4249) > 21/06/28 14:42:43 INFO TaskSetManager: Finished task 2283.0 in stage 1.0 (TID > 3422) in 22371 ms on worker15.hadoop (executor 109) (3279/4249) > 21/06/28 14:42:43 ERROR Inbox: Ignoring error > java.lang.NullPointerException > at java.lang.String.length(String.java:623) > at > java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:420) > at java.lang.StringBuilder.append(StringBuilder.java:136) > at > org.apache.spark.scheduler.TaskSetManager.$anonfun$resourceOffer$5(TaskSetManager.scala:483) > at org.apache.spark.internal.Logging.logInfo(Logging.scala:57) > at org.apache.spark.internal.Logging.logInfo$(Logging.scala:56) > at > org.apache.spark.scheduler.TaskSetManager.logInfo(TaskSetManager.scala:54) > at > org.apache.spark.scheduler.TaskSetManager.$anonfun$resourceOffer$2(TaskSetManager.scala:484) > at scala.Option.map(Option.scala:230) > at > org.apache.spark.scheduler.TaskSetManager.resourceOffer(TaskSetManager.scala:444) > at > org.apache.spark.scheduler.TaskSchedulerImpl.$anonfun$resourceOfferSingleTaskSet$2(TaskSchedulerImpl.scala:397) > at > org.apache.spark.scheduler.TaskSchedulerImpl.$anonfun$resourceOfferSingleTaskSet$2$adapted(TaskSchedulerImpl.scala:392) > at scala.Option.foreach(Option.scala:407) > at > org.apache.spark.scheduler.TaskSchedulerImpl.$anonfun$resourceOfferSingleTaskSet$1(TaskSchedulerImpl.scala:392) > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158) > at > org.apache.spark.scheduler.TaskSchedulerImpl.resourceOfferSingleTaskSet(TaskSchedulerImpl.scala:383) > at > org.apache.spark.scheduler.TaskSchedulerImpl.$anonfun$resourceOffers$20(TaskSchedulerImpl.scala:581) > at > org.apache.spark.scheduler.TaskSchedulerImpl.$anonfun$resourceOffers$20$adapted(TaskSchedulerImpl.scala:576) > at > scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) > at > scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) > at > org.apache.spark.scheduler.TaskSchedulerImpl.$anonfun$resourceOffers$16(TaskSchedulerImpl.scala:576) > at > org.apache.spark.scheduler.TaskSchedulerImpl.$anonfun$resourceOffers$16$adapted(TaskSchedulerImpl.scala:547) > at > scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at > scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at > org.apache.spark.scheduler.TaskSchedulerImpl.resourceOffers(TaskSchedulerImpl.scala:547) > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.$anonfun$makeOffers$5(CoarseGrainedSchedulerBackend.scala:340) > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.org$apache$spark$scheduler$cluster$CoarseGrainedSchedulerBackend$$withLock(CoarseGrainedSchedulerBackend.scala:904) > at >
[jira] [Created] (SPARK-44554) Install different Python linter dependencies for daily testing of different Spark versions
Yang Jie created SPARK-44554: Summary: Install different Python linter dependencies for daily testing of different Spark versions Key: SPARK-44554 URL: https://issues.apache.org/jira/browse/SPARK-44554 Project: Spark Issue Type: Improvement Components: Project Infra Affects Versions: 4.0.0 Reporter: Yang Jie -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44553) Ignoring `connect-check-protos` logic in GA testing
BingKun Pan created SPARK-44553: --- Summary: Ignoring `connect-check-protos` logic in GA testing Key: SPARK-44553 URL: https://issues.apache.org/jira/browse/SPARK-44553 Project: Spark Issue Type: Test Components: Build Affects Versions: 3.4.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org