(spark) branch master updated: [SPARK-47560][PYTHON][CONNECT] Avoid RPC to validate column name with cached schema

ruifengz Tue, 26 Mar 2024 01:01:51 -0700

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new becbf8b94213 [SPARK-47560][PYTHON][CONNECT] Avoid RPC to validate 
column name with cached schema
becbf8b94213 is described below

commit becbf8b942132b82e7b906c63ea6077649329b93
Author: Ruifeng Zheng <ruife...@apache.org>
AuthorDate: Tue Mar 26 16:01:26 2024 +0800

    [SPARK-47560][PYTHON][CONNECT] Avoid RPC to validate column name with 
cached schema
    
    ### What changes were proposed in this pull request?
    
    If the column name exists in schema, avoid `df.select` validation
    
    ### Why are the changes needed?
    
https://github.com/apache/spark/commit/6f87fe2f513d1b1a022f0d03b6c81d73d7cfb228 
caches the schema, so if the column name exists in schema, we don't not need to 
validate it with `df.select` which requires additional RPC
    
    ### Does this PR introduce _any_ user-facing change?
    no
    
    ### How was this patch tested?
    ci
    
    ### Was this patch authored or co-authored using generative AI tooling?
    no
    
    Closes #45717 from zhengruifeng/py_df_getitem_validate.
    
    Authored-by: Ruifeng Zheng <ruife...@apache.org>
    Signed-off-by: Ruifeng Zheng <ruife...@apache.org>
---
 python/pyspark/sql/connect/dataframe.py | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/python/pyspark/sql/connect/dataframe.py 
b/python/pyspark/sql/connect/dataframe.py
index 2a22d02387ae..74a4efbe3a79 100644
--- a/python/pyspark/sql/connect/dataframe.py
+++ b/python/pyspark/sql/connect/dataframe.py
@@ -1736,7 +1736,10 @@ class DataFrame:
 
                 # validate the column name
                 if not hasattr(self._session, "is_mock_session"):
-                    self.select(item).isLocal()
+                    # Different from __getattr__, the name here can be quoted 
like df['`id`'].
+                    # Only validate the name when it is not in the cached 
schema.
+                    if item not in self.columns:
+                        self.select(item).isLocal()
 
                 return Column(
                     ColumnReference(


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-47560][PYTHON][CONNECT] Avoid RPC to validate column name with cached schema

Reply via email to