Re: [PR] [SPARK-47909][PYTHON][CONNECT] Parent DataFrame class for Spark Connect and Spark Classic [spark]

2024-04-22 Thread via GitHub


dongjoon-hyun commented on PR #46129:
URL: https://github.com/apache/spark/pull/46129#issuecomment-2068736545

   `classic` sounds like a too limited wording because it has no clear meaning 
and not-extensible in a long-term perspective.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47909][PYTHON][CONNECT] Parent DataFrame class for Spark Connect and Spark Classic [spark]

2024-04-21 Thread via GitHub


HyukjinKwon closed pull request #46129: [SPARK-47909][PYTHON][CONNECT] Parent 
DataFrame class for Spark Connect and Spark Classic
URL: https://github.com/apache/spark/pull/46129


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47909][PYTHON][CONNECT] Parent DataFrame class for Spark Connect and Spark Classic [spark]

2024-04-21 Thread via GitHub


HyukjinKwon commented on PR #46129:
URL: https://github.com/apache/spark/pull/46129#issuecomment-2068279142

   Merged to master.
   
   I will followup if there are more comments to address.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47909][PYTHON][CONNECT] Parent DataFrame class for Spark Connect and Spark Classic [spark]

2024-04-20 Thread via GitHub


HyukjinKwon commented on code in PR #46129:
URL: https://github.com/apache/spark/pull/46129#discussion_r1573466682


##
python/pyspark/sql/tests/connect/test_connect_plan.py:
##
@@ -333,6 +333,11 @@ def test_observe(self):
 from pyspark.sql.connect.observation import Observation
 
 class MockDF(DataFrame):

Review Comment:
   This might be a breaking change if somebody inherits `pyspark.sql.DataFrame` 
before, and it has it's own `__init__`. However, `__init__` is not really an 
API, and users shouldn't really customize/use/invoke them directly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47909][PYTHON][CONNECT] Parent DataFrame class for Spark Connect and Spark Classic [spark]

2024-04-20 Thread via GitHub


HyukjinKwon commented on PR #46129:
URL: https://github.com/apache/spark/pull/46129#issuecomment-2067819605

   Should be ready for a look.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47909][PYTHON][CONNECT] Parent DataFrame class for Spark Connect and Spark Classic [spark]

2024-04-20 Thread via GitHub


HyukjinKwon commented on code in PR #46129:
URL: https://github.com/apache/spark/pull/46129#discussion_r1573465793


##
python/pyspark/sql/classic/dataframe.py:
##
@@ -0,0 +1,1974 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import os
+import json
+import sys
+import random
+import warnings
+from collections.abc import Iterable
+from functools import reduce
+from typing import (
+Any,
+Callable,
+Dict,
+Iterator,
+List,
+Optional,
+Sequence,
+Tuple,
+Type,
+Union,
+cast,
+overload,
+TYPE_CHECKING,
+)
+
+from pyspark import _NoValue
+from pyspark.resource import ResourceProfile
+from pyspark._globals import _NoValueType
+from pyspark.errors import (
+PySparkTypeError,
+PySparkValueError,
+PySparkIndexError,
+PySparkAttributeError,
+)
+from pyspark.util import (
+is_remote_only,
+_load_from_socket,
+_local_iterator_from_socket,
+)
+from pyspark.serializers import BatchedSerializer, CPickleSerializer, 
UTF8Deserializer
+from pyspark.storagelevel import StorageLevel
+from pyspark.traceback_utils import SCCallSiteSync
+from pyspark.sql.column import Column, _to_seq, _to_list, _to_java_column
+from pyspark.sql.readwriter import DataFrameWriter, DataFrameWriterV2
+from pyspark.sql.streaming import DataStreamWriter
+from pyspark.sql.types import (
+StructType,
+Row,
+_parse_datatype_json_string,
+)
+from pyspark.sql.dataframe import (
+DataFrame as ParentDataFrame,
+DataFrameNaFunctions as ParentDataFrameNaFunctions,
+DataFrameStatFunctions as ParentDataFrameStatFunctions,
+)
+from pyspark.sql.utils import get_active_spark_context, toJArray
+from pyspark.sql.pandas.conversion import PandasConversionMixin
+from pyspark.sql.pandas.map_ops import PandasMapOpsMixin
+
+if TYPE_CHECKING:
+from py4j.java_gateway import JavaObject
+from pyspark.core.rdd import RDD
+from pyspark.core.context import SparkContext
+from pyspark._typing import PrimitiveType
+from pyspark.pandas.frame import DataFrame as PandasOnSparkDataFrame
+from pyspark.sql._typing import (
+ColumnOrName,
+ColumnOrNameOrOrdinal,
+LiteralType,
+OptionalPrimitiveType,
+)
+from pyspark.sql.context import SQLContext
+from pyspark.sql.session import SparkSession
+from pyspark.sql.group import GroupedData
+from pyspark.sql.observation import Observation
+
+
+class DataFrame(ParentDataFrame, PandasMapOpsMixin, PandasConversionMixin):
+def __new__(
+cls,
+jdf: "JavaObject",
+sql_ctx: Union["SQLContext", "SparkSession"],
+) -> "DataFrame":
+self = object.__new__(cls)
+self.__init__(jdf, sql_ctx)  # type: ignore[misc]
+return self
+
+def __init__(
+self,
+jdf: "JavaObject",
+sql_ctx: Union["SQLContext", "SparkSession"],
+):
+from pyspark.sql.context import SQLContext
+
+self._sql_ctx: Optional["SQLContext"] = None
+
+if isinstance(sql_ctx, SQLContext):
+assert not os.environ.get("SPARK_TESTING")  # Sanity check for our 
internal usage.
+assert isinstance(sql_ctx, SQLContext)
+# We should remove this if-else branch in the future release, and 
rename
+# sql_ctx to session in the constructor. This is an internal code 
path but
+# was kept with a warning because it's used intensively by 
third-party libraries.
+warnings.warn("DataFrame constructor is internal. Do not directly 
use it.")
+self._sql_ctx = sql_ctx
+session = sql_ctx.sparkSession
+else:
+session = sql_ctx
+self._session: "SparkSession" = session
+
+self._sc: "SparkContext" = sql_ctx._sc
+self._jdf: "JavaObject" = jdf
+self.is_cached = False
+# initialized lazily
+self._schema: Optional[StructType] = None
+self._lazy_rdd: Optional["RDD[Row]"] = None
+# Check whether _repr_html is supported or not, we use it to avoid 
calling _jdf twice
+# by __repr__ and _repr_html_ while eager evaluation opens.
+self._support_repr_html = False
+
+@property
+def 

Re: [PR] [SPARK-47909][PYTHON][CONNECT] Parent DataFrame class for Spark Connect and Spark Classic [spark]

2024-04-20 Thread via GitHub


HyukjinKwon commented on code in PR #46129:
URL: https://github.com/apache/spark/pull/46129#discussion_r1573465384


##
python/pyspark/sql/connect/session.py:
##
@@ -325,7 +325,7 @@ def active(cls) -> "SparkSession":
 
 active.__doc__ = PySparkSession.active.__doc__
 
-def table(self, tableName: str) -> DataFrame:
+def table(self, tableName: str) -> ParentDataFrame:

Review Comment:
   Here is one example of the error:
   
   ```
   python/pyspark/sql/classic/dataframe.py:276: error: Argument 1 of 
"exceptAll" is incompatible with supertype "DataFrame"; supertype defines the 
argument type as "DataFrame"  [override]
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47909][PYTHON][CONNECT] Parent DataFrame class for Spark Connect and Spark Classic [spark]

2024-04-20 Thread via GitHub


HyukjinKwon commented on code in PR #46129:
URL: https://github.com/apache/spark/pull/46129#discussion_r1573465371


##
python/pyspark/sql/connect/session.py:
##
@@ -325,7 +325,7 @@ def active(cls) -> "SparkSession":
 
 active.__doc__ = PySparkSession.active.__doc__
 
-def table(self, tableName: str) -> DataFrame:
+def table(self, tableName: str) -> ParentDataFrame:

Review Comment:
   Seems like the arguments cannot be more specific type, and return types 
can't be wider types 
(https://mypy.readthedocs.io/en/stable/common_issues.html#incompatible-overrides).
 So it complains about the argument.
   
   Let me just keep them all as parent dataframe for simplicity because those 
types aren't user-facing anyway.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47909][PYTHON][CONNECT] Parent DataFrame class for Spark Connect and Spark Classic [spark]

2024-04-19 Thread via GitHub


HyukjinKwon commented on code in PR #46129:
URL: https://github.com/apache/spark/pull/46129#discussion_r1573095371


##
python/pyspark/sql/classic/dataframe.py:
##
@@ -0,0 +1,1974 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import os
+import json
+import sys
+import random
+import warnings
+from collections.abc import Iterable
+from functools import reduce
+from typing import (
+Any,
+Callable,
+Dict,
+Iterator,
+List,
+Optional,
+Sequence,
+Tuple,
+Type,
+Union,
+cast,
+overload,
+TYPE_CHECKING,
+)
+
+from pyspark import _NoValue
+from pyspark.resource import ResourceProfile
+from pyspark._globals import _NoValueType
+from pyspark.errors import (
+PySparkTypeError,
+PySparkValueError,
+PySparkIndexError,
+PySparkAttributeError,
+)
+from pyspark.util import (
+is_remote_only,
+_load_from_socket,
+_local_iterator_from_socket,
+)
+from pyspark.serializers import BatchedSerializer, CPickleSerializer, 
UTF8Deserializer
+from pyspark.storagelevel import StorageLevel
+from pyspark.traceback_utils import SCCallSiteSync
+from pyspark.sql.column import Column, _to_seq, _to_list, _to_java_column
+from pyspark.sql.readwriter import DataFrameWriter, DataFrameWriterV2
+from pyspark.sql.streaming import DataStreamWriter
+from pyspark.sql.types import (
+StructType,
+Row,
+_parse_datatype_json_string,
+)
+from pyspark.sql.dataframe import (
+DataFrame as ParentDataFrame,
+DataFrameNaFunctions as ParentDataFrameNaFunctions,
+DataFrameStatFunctions as ParentDataFrameStatFunctions,
+)
+from pyspark.sql.utils import get_active_spark_context, toJArray
+from pyspark.sql.pandas.conversion import PandasConversionMixin
+from pyspark.sql.pandas.map_ops import PandasMapOpsMixin
+
+if TYPE_CHECKING:
+from py4j.java_gateway import JavaObject
+from pyspark.core.rdd import RDD
+from pyspark.core.context import SparkContext
+from pyspark._typing import PrimitiveType
+from pyspark.pandas.frame import DataFrame as PandasOnSparkDataFrame
+from pyspark.sql._typing import (
+ColumnOrName,
+ColumnOrNameOrOrdinal,
+LiteralType,
+OptionalPrimitiveType,
+)
+from pyspark.sql.context import SQLContext
+from pyspark.sql.session import SparkSession
+from pyspark.sql.group import GroupedData
+from pyspark.sql.observation import Observation
+
+
+class DataFrame(ParentDataFrame, PandasMapOpsMixin, PandasConversionMixin):
+def __new__(
+cls,
+jdf: "JavaObject",
+sql_ctx: Union["SQLContext", "SparkSession"],
+) -> "DataFrame":
+self = object.__new__(cls)
+self.__init__(jdf, sql_ctx)  # type: ignore[misc]
+return self
+
+def __init__(
+self,
+jdf: "JavaObject",
+sql_ctx: Union["SQLContext", "SparkSession"],
+):
+from pyspark.sql.context import SQLContext
+
+self._sql_ctx: Optional["SQLContext"] = None
+
+if isinstance(sql_ctx, SQLContext):
+assert not os.environ.get("SPARK_TESTING")  # Sanity check for our 
internal usage.
+assert isinstance(sql_ctx, SQLContext)
+# We should remove this if-else branch in the future release, and 
rename
+# sql_ctx to session in the constructor. This is an internal code 
path but
+# was kept with a warning because it's used intensively by 
third-party libraries.
+warnings.warn("DataFrame constructor is internal. Do not directly 
use it.")
+self._sql_ctx = sql_ctx
+session = sql_ctx.sparkSession
+else:
+session = sql_ctx
+self._session: "SparkSession" = session
+
+self._sc: "SparkContext" = sql_ctx._sc
+self._jdf: "JavaObject" = jdf
+self.is_cached = False
+# initialized lazily
+self._schema: Optional[StructType] = None
+self._lazy_rdd: Optional["RDD[Row]"] = None
+# Check whether _repr_html is supported or not, we use it to avoid 
calling _jdf twice
+# by __repr__ and _repr_html_ while eager evaluation opens.
+self._support_repr_html = False
+
+@property
+def 

Re: [PR] [SPARK-47909][PYTHON][CONNECT] Parent DataFrame class for Spark Connect and Spark Classic [spark]

2024-04-19 Thread via GitHub


HyukjinKwon commented on code in PR #46129:
URL: https://github.com/apache/spark/pull/46129#discussion_r1573095413


##
python/pyspark/sql/classic/dataframe.py:
##
@@ -0,0 +1,1974 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import os
+import json
+import sys
+import random
+import warnings
+from collections.abc import Iterable
+from functools import reduce
+from typing import (
+Any,
+Callable,
+Dict,
+Iterator,
+List,
+Optional,
+Sequence,
+Tuple,
+Type,
+Union,
+cast,
+overload,
+TYPE_CHECKING,
+)
+
+from pyspark import _NoValue
+from pyspark.resource import ResourceProfile
+from pyspark._globals import _NoValueType
+from pyspark.errors import (
+PySparkTypeError,
+PySparkValueError,
+PySparkIndexError,
+PySparkAttributeError,
+)
+from pyspark.util import (
+is_remote_only,
+_load_from_socket,
+_local_iterator_from_socket,
+)
+from pyspark.serializers import BatchedSerializer, CPickleSerializer, 
UTF8Deserializer
+from pyspark.storagelevel import StorageLevel
+from pyspark.traceback_utils import SCCallSiteSync
+from pyspark.sql.column import Column, _to_seq, _to_list, _to_java_column
+from pyspark.sql.readwriter import DataFrameWriter, DataFrameWriterV2
+from pyspark.sql.streaming import DataStreamWriter
+from pyspark.sql.types import (
+StructType,
+Row,
+_parse_datatype_json_string,
+)
+from pyspark.sql.dataframe import (
+DataFrame as ParentDataFrame,
+DataFrameNaFunctions as ParentDataFrameNaFunctions,
+DataFrameStatFunctions as ParentDataFrameStatFunctions,
+)
+from pyspark.sql.utils import get_active_spark_context, toJArray
+from pyspark.sql.pandas.conversion import PandasConversionMixin
+from pyspark.sql.pandas.map_ops import PandasMapOpsMixin
+
+if TYPE_CHECKING:
+from py4j.java_gateway import JavaObject
+from pyspark.core.rdd import RDD
+from pyspark.core.context import SparkContext
+from pyspark._typing import PrimitiveType
+from pyspark.pandas.frame import DataFrame as PandasOnSparkDataFrame
+from pyspark.sql._typing import (
+ColumnOrName,
+ColumnOrNameOrOrdinal,
+LiteralType,
+OptionalPrimitiveType,
+)
+from pyspark.sql.context import SQLContext
+from pyspark.sql.session import SparkSession
+from pyspark.sql.group import GroupedData
+from pyspark.sql.observation import Observation
+
+
+class DataFrame(ParentDataFrame, PandasMapOpsMixin, PandasConversionMixin):
+def __new__(
+cls,
+jdf: "JavaObject",
+sql_ctx: Union["SQLContext", "SparkSession"],
+) -> "DataFrame":
+self = object.__new__(cls)
+self.__init__(jdf, sql_ctx)  # type: ignore[misc]
+return self
+
+def __init__(
+self,
+jdf: "JavaObject",
+sql_ctx: Union["SQLContext", "SparkSession"],
+):
+from pyspark.sql.context import SQLContext
+
+self._sql_ctx: Optional["SQLContext"] = None
+
+if isinstance(sql_ctx, SQLContext):
+assert not os.environ.get("SPARK_TESTING")  # Sanity check for our 
internal usage.
+assert isinstance(sql_ctx, SQLContext)
+# We should remove this if-else branch in the future release, and 
rename
+# sql_ctx to session in the constructor. This is an internal code 
path but
+# was kept with a warning because it's used intensively by 
third-party libraries.
+warnings.warn("DataFrame constructor is internal. Do not directly 
use it.")
+self._sql_ctx = sql_ctx
+session = sql_ctx.sparkSession
+else:
+session = sql_ctx
+self._session: "SparkSession" = session
+
+self._sc: "SparkContext" = sql_ctx._sc
+self._jdf: "JavaObject" = jdf
+self.is_cached = False
+# initialized lazily
+self._schema: Optional[StructType] = None
+self._lazy_rdd: Optional["RDD[Row]"] = None
+# Check whether _repr_html is supported or not, we use it to avoid 
calling _jdf twice
+# by __repr__ and _repr_html_ while eager evaluation opens.
+self._support_repr_html = False
+
+@property
+def 

Re: [PR] [SPARK-47909][PYTHON][CONNECT] Parent DataFrame class for Spark Connect and Spark Classic [spark]

2024-04-19 Thread via GitHub


HyukjinKwon commented on code in PR #46129:
URL: https://github.com/apache/spark/pull/46129#discussion_r1573095371


##
python/pyspark/sql/classic/dataframe.py:
##
@@ -0,0 +1,1974 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import os
+import json
+import sys
+import random
+import warnings
+from collections.abc import Iterable
+from functools import reduce
+from typing import (
+Any,
+Callable,
+Dict,
+Iterator,
+List,
+Optional,
+Sequence,
+Tuple,
+Type,
+Union,
+cast,
+overload,
+TYPE_CHECKING,
+)
+
+from pyspark import _NoValue
+from pyspark.resource import ResourceProfile
+from pyspark._globals import _NoValueType
+from pyspark.errors import (
+PySparkTypeError,
+PySparkValueError,
+PySparkIndexError,
+PySparkAttributeError,
+)
+from pyspark.util import (
+is_remote_only,
+_load_from_socket,
+_local_iterator_from_socket,
+)
+from pyspark.serializers import BatchedSerializer, CPickleSerializer, 
UTF8Deserializer
+from pyspark.storagelevel import StorageLevel
+from pyspark.traceback_utils import SCCallSiteSync
+from pyspark.sql.column import Column, _to_seq, _to_list, _to_java_column
+from pyspark.sql.readwriter import DataFrameWriter, DataFrameWriterV2
+from pyspark.sql.streaming import DataStreamWriter
+from pyspark.sql.types import (
+StructType,
+Row,
+_parse_datatype_json_string,
+)
+from pyspark.sql.dataframe import (
+DataFrame as ParentDataFrame,
+DataFrameNaFunctions as ParentDataFrameNaFunctions,
+DataFrameStatFunctions as ParentDataFrameStatFunctions,
+)
+from pyspark.sql.utils import get_active_spark_context, toJArray
+from pyspark.sql.pandas.conversion import PandasConversionMixin
+from pyspark.sql.pandas.map_ops import PandasMapOpsMixin
+
+if TYPE_CHECKING:
+from py4j.java_gateway import JavaObject
+from pyspark.core.rdd import RDD
+from pyspark.core.context import SparkContext
+from pyspark._typing import PrimitiveType
+from pyspark.pandas.frame import DataFrame as PandasOnSparkDataFrame
+from pyspark.sql._typing import (
+ColumnOrName,
+ColumnOrNameOrOrdinal,
+LiteralType,
+OptionalPrimitiveType,
+)
+from pyspark.sql.context import SQLContext
+from pyspark.sql.session import SparkSession
+from pyspark.sql.group import GroupedData
+from pyspark.sql.observation import Observation
+
+
+class DataFrame(ParentDataFrame, PandasMapOpsMixin, PandasConversionMixin):
+def __new__(
+cls,
+jdf: "JavaObject",
+sql_ctx: Union["SQLContext", "SparkSession"],
+) -> "DataFrame":
+self = object.__new__(cls)
+self.__init__(jdf, sql_ctx)  # type: ignore[misc]
+return self
+
+def __init__(
+self,
+jdf: "JavaObject",
+sql_ctx: Union["SQLContext", "SparkSession"],
+):
+from pyspark.sql.context import SQLContext
+
+self._sql_ctx: Optional["SQLContext"] = None
+
+if isinstance(sql_ctx, SQLContext):
+assert not os.environ.get("SPARK_TESTING")  # Sanity check for our 
internal usage.
+assert isinstance(sql_ctx, SQLContext)
+# We should remove this if-else branch in the future release, and 
rename
+# sql_ctx to session in the constructor. This is an internal code 
path but
+# was kept with a warning because it's used intensively by 
third-party libraries.
+warnings.warn("DataFrame constructor is internal. Do not directly 
use it.")
+self._sql_ctx = sql_ctx
+session = sql_ctx.sparkSession
+else:
+session = sql_ctx
+self._session: "SparkSession" = session
+
+self._sc: "SparkContext" = sql_ctx._sc
+self._jdf: "JavaObject" = jdf
+self.is_cached = False
+# initialized lazily
+self._schema: Optional[StructType] = None
+self._lazy_rdd: Optional["RDD[Row]"] = None
+# Check whether _repr_html is supported or not, we use it to avoid 
calling _jdf twice
+# by __repr__ and _repr_html_ while eager evaluation opens.
+self._support_repr_html = False
+
+@property
+def 

Re: [PR] [SPARK-47909][PYTHON][CONNECT] Parent DataFrame class for Spark Connect and Spark Classic [spark]

2024-04-19 Thread via GitHub


HyukjinKwon commented on code in PR #46129:
URL: https://github.com/apache/spark/pull/46129#discussion_r1573094548


##
python/pyspark/sql/utils.py:
##
@@ -302,6 +302,33 @@ def wrapped(*args: Any, **kwargs: Any) -> Any:
 return cast(FuncT, wrapped)
 
 
+def dispatch_df_method(f: FuncT) -> FuncT:
+"""
+For the usecases of direct DataFrame.union(df, ...), it checks if self
+is a Connect DataFrame or Classic DataFrame, and dispatches.
+"""
+
+@functools.wraps(f)
+def wrapped(*args: Any, **kwargs: Any) -> Any:
+if is_remote() and "PYSPARK_NO_NAMESPACE_SHARE" not in os.environ:
+from pyspark.sql.connect.dataframe import DataFrame as 
ConnectDataFrame
+
+if isinstance(args[0], ConnectDataFrame):
+return getattr(ConnectDataFrame, f.__name__)(*args, **kwargs)
+else:
+from pyspark.sql.classic.dataframe import DataFrame as 
ClassicDataFrame

Review Comment:
   It should be covered by `is_remote` which is `True` when `is_remote_only` is 
`True`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47909][PYTHON][CONNECT] Parent DataFrame class for Spark Connect and Spark Classic [spark]

2024-04-19 Thread via GitHub


HyukjinKwon commented on code in PR #46129:
URL: https://github.com/apache/spark/pull/46129#discussion_r1573094799


##
python/pyspark/sql/connect/session.py:
##
@@ -325,7 +325,7 @@ def active(cls) -> "SparkSession":
 
 active.__doc__ = PySparkSession.active.__doc__
 
-def table(self, tableName: str) -> DataFrame:
+def table(self, tableName: str) -> ParentDataFrame:

Review Comment:
   this was the way MyPy least complained IIRC. Let me take a look again ..



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47909][PYTHON][CONNECT] Parent DataFrame class for Spark Connect and Spark Classic [spark]

2024-04-19 Thread via GitHub


ueshin commented on code in PR #46129:
URL: https://github.com/apache/spark/pull/46129#discussion_r1572787315


##
python/pyspark/sql/dataframe.py:
##
@@ -139,51 +123,29 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
 created via using the constructor.
 """
 
-def __init__(
-self,
+# HACK ALERT!! this is to reduce the backward compatibility concern, and 
returns
+# Spark Classic DataFrame by default. This is NOT an API, and NOT supposed 
to
+# be directly invoked. DO NOT use this constructor.
+_sql_ctx: Optional["SQLContext"]
+_session: "SparkSession"
+_sc: "SparkContext"
+_jdf: "JavaObject"
+is_cached: bool
+_schema: Optional[StructType]
+_lazy_rdd: Optional["RDD[Row]"]
+_support_repr_html: bool
+
+def __new__(
+cls,
 jdf: "JavaObject",
 sql_ctx: Union["SQLContext", "SparkSession"],
-):
-from pyspark.sql.context import SQLContext
-
-self._sql_ctx: Optional["SQLContext"] = None
-
-if isinstance(sql_ctx, SQLContext):
-assert not os.environ.get("SPARK_TESTING")  # Sanity check for our 
internal usage.
-assert isinstance(sql_ctx, SQLContext)
-# We should remove this if-else branch in the future release, and 
rename
-# sql_ctx to session in the constructor. This is an internal code 
path but
-# was kept with a warning because it's used intensively by 
third-party libraries.
-warnings.warn("DataFrame constructor is internal. Do not directly 
use it.")
-self._sql_ctx = sql_ctx
-session = sql_ctx.sparkSession
-else:
-session = sql_ctx
-self._session: "SparkSession" = session
-
-self._sc: "SparkContext" = sql_ctx._sc
-self._jdf: "JavaObject" = jdf
-self.is_cached = False
-# initialized lazily
-self._schema: Optional[StructType] = None
-self._lazy_rdd: Optional["RDD[Row]"] = None
-# Check whether _repr_html is supported or not, we use it to avoid 
calling _jdf twice
-# by __repr__ and _repr_html_ while eager evaluation opens.
-self._support_repr_html = False
-
-@property
-def sql_ctx(self) -> "SQLContext":
-from pyspark.sql.context import SQLContext
+) -> "DataFrame":
+from pyspark.sql.classic.dataframe import DataFrame
 
-warnings.warn(
-"DataFrame.sql_ctx is an internal property, and will be removed "
-"in future releases. Use DataFrame.sparkSession instead."
-)
-if self._sql_ctx is None:
-self._sql_ctx = SQLContext._get_or_create(self._sc)
-return self._sql_ctx
+return DataFrame.__new__(DataFrame, jdf, sql_ctx)
 
 @property
+@dispatch_df_method

Review Comment:
   The dispatch for `property` seems not working.
   
   ```py
   >>> class A:
   ...   @property
   ...   @dispatch_df_method
   ...   def a(self):
   ... return 1
   >>>
   >>> a = A()
   >>> A.a(a)
   Traceback (most recent call last):
 File "", line 1, in 
   TypeError: 'property' object is not callable
   ```
   
   I don't think we need this for `property` as this usage of `property` won't 
work anyway?



##
python/pyspark/sql/classic/dataframe.py:
##
@@ -0,0 +1,1974 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import os
+import json
+import sys
+import random
+import warnings
+from collections.abc import Iterable
+from functools import reduce
+from typing import (
+Any,
+Callable,
+Dict,
+Iterator,
+List,
+Optional,
+Sequence,
+Tuple,
+Type,
+Union,
+cast,
+overload,
+TYPE_CHECKING,
+)
+
+from pyspark import _NoValue
+from pyspark.resource import ResourceProfile
+from pyspark._globals import _NoValueType
+from pyspark.errors import (
+PySparkTypeError,
+PySparkValueError,
+PySparkIndexError,
+PySparkAttributeError,
+)
+from pyspark.util import (
+is_remote_only,
+_load_from_socket,
+_local_iterator_from_socket,
+)
+from pyspark.serializers import BatchedSerializer, CPickleSerializer, 
UTF8Deserializer
+from pyspark.storagelevel import 

Re: [PR] [SPARK-47909][PYTHON][CONNECT] Parent DataFrame class for Spark Connect and Spark Classic [spark]

2024-04-19 Thread via GitHub


HyukjinKwon commented on PR #46129:
URL: https://github.com/apache/spark/pull/46129#issuecomment-2067140476

   Will fix up the tests soon.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47909][PYTHON][CONNECT] Parent DataFrame class for Spark Connect and Spark Classic [spark]

2024-04-19 Thread via GitHub


HyukjinKwon commented on code in PR #46129:
URL: https://github.com/apache/spark/pull/46129#discussion_r1572006465


##
python/pyspark/sql/connect/dataframe.py:
##
@@ -2306,7 +2183,7 @@ def _test() -> None:
 )
 
 (failure_count, test_count) = doctest.testmod(
-pyspark.sql.connect.dataframe,
+pyspark.sql.dataframe,

Review Comment:
   It does inherit the docstrings but `doctest` cannot. So I manually switch it 
to `pyspark.sql.dataframe` to run the doctest.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47909][PYTHON][CONNECT] Parent DataFrame class for Spark Connect and Spark Classic [spark]

2024-04-19 Thread via GitHub


HyukjinKwon commented on code in PR #46129:
URL: https://github.com/apache/spark/pull/46129#discussion_r1571898280


##
python/pyspark/sql/classic/dataframe.py:
##
@@ -0,0 +1,1952 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import os
+import json
+import sys
+import random
+import warnings
+from collections.abc import Iterable
+from functools import reduce
+from typing import (
+Any,
+Callable,
+Dict,
+Iterator,
+List,
+Optional,
+Sequence,
+Tuple,
+Type,
+Union,
+cast,
+overload,
+TYPE_CHECKING,
+)
+
+from pyspark import _NoValue
+from pyspark._globals import _NoValueType
+from pyspark.errors import (
+PySparkTypeError,
+PySparkValueError,
+PySparkIndexError,
+PySparkAttributeError,
+)
+from pyspark.util import (
+is_remote_only,
+_load_from_socket,
+_local_iterator_from_socket,
+)
+from pyspark.serializers import BatchedSerializer, CPickleSerializer, 
UTF8Deserializer
+from pyspark.storagelevel import StorageLevel
+from pyspark.traceback_utils import SCCallSiteSync
+from pyspark.sql.column import Column, _to_seq, _to_list, _to_java_column
+from pyspark.sql.readwriter import DataFrameWriter, DataFrameWriterV2
+from pyspark.sql.streaming import DataStreamWriter
+from pyspark.sql.types import (
+StructType,
+Row,
+_parse_datatype_json_string,
+)
+from pyspark.sql.dataframe import (
+DataFrame as ParentDataFrame,
+DataFrameNaFunctions as ParentDataFrameNaFunctions,
+DataFrameStatFunctions as ParentDataFrameStatFunctions,
+)
+from pyspark.sql.utils import get_active_spark_context, toJArray
+from pyspark.sql.pandas.conversion import PandasConversionMixin
+from pyspark.sql.pandas.map_ops import PandasMapOpsMixin
+
+if TYPE_CHECKING:
+from py4j.java_gateway import JavaObject
+from pyspark.core.rdd import RDD
+from pyspark.core.context import SparkContext
+from pyspark._typing import PrimitiveType
+from pyspark.pandas.frame import DataFrame as PandasOnSparkDataFrame
+from pyspark.sql._typing import (
+ColumnOrName,
+ColumnOrNameOrOrdinal,
+LiteralType,
+OptionalPrimitiveType,
+)
+from pyspark.sql.context import SQLContext
+from pyspark.sql.session import SparkSession
+from pyspark.sql.group import GroupedData
+from pyspark.sql.observation import Observation
+
+
+class DataFrame(ParentDataFrame, PandasMapOpsMixin, PandasConversionMixin):
+def __new__(
+cls,
+jdf: "JavaObject",
+sql_ctx: Union["SQLContext", "SparkSession"],
+) -> "DataFrame":
+self = object.__new__(cls)
+self.__init__(jdf, sql_ctx)  # type: ignore[misc]
+return self
+
+def __init__(
+self,
+jdf: "JavaObject",
+sql_ctx: Union["SQLContext", "SparkSession"],
+):
+from pyspark.sql.context import SQLContext
+
+self._sql_ctx: Optional["SQLContext"] = None
+
+if isinstance(sql_ctx, SQLContext):
+assert not os.environ.get("SPARK_TESTING")  # Sanity check for our 
internal usage.
+assert isinstance(sql_ctx, SQLContext)
+# We should remove this if-else branch in the future release, and 
rename
+# sql_ctx to session in the constructor. This is an internal code 
path but
+# was kept with a warning because it's used intensively by 
third-party libraries.
+warnings.warn("DataFrame constructor is internal. Do not directly 
use it.")
+self._sql_ctx = sql_ctx
+session = sql_ctx.sparkSession
+else:
+session = sql_ctx
+self._session: "SparkSession" = session
+
+self._sc: "SparkContext" = sql_ctx._sc
+self._jdf: "JavaObject" = jdf
+self.is_cached = False
+# initialized lazily
+self._schema: Optional[StructType] = None
+self._lazy_rdd: Optional["RDD[Row]"] = None
+# Check whether _repr_html is supported or not, we use it to avoid 
calling _jdf twice
+# by __repr__ and _repr_html_ while eager evaluation opens.
+self._support_repr_html = False
+
+@property
+def sql_ctx(self) -> "SQLContext":
+from 

Re: [PR] [SPARK-47909][PYTHON][CONNECT] Parent DataFrame class for Spark Connect and Spark Classic [spark]

2024-04-19 Thread via GitHub


HyukjinKwon commented on code in PR #46129:
URL: https://github.com/apache/spark/pull/46129#discussion_r1571898280


##
python/pyspark/sql/classic/dataframe.py:
##
@@ -0,0 +1,1952 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import os
+import json
+import sys
+import random
+import warnings
+from collections.abc import Iterable
+from functools import reduce
+from typing import (
+Any,
+Callable,
+Dict,
+Iterator,
+List,
+Optional,
+Sequence,
+Tuple,
+Type,
+Union,
+cast,
+overload,
+TYPE_CHECKING,
+)
+
+from pyspark import _NoValue
+from pyspark._globals import _NoValueType
+from pyspark.errors import (
+PySparkTypeError,
+PySparkValueError,
+PySparkIndexError,
+PySparkAttributeError,
+)
+from pyspark.util import (
+is_remote_only,
+_load_from_socket,
+_local_iterator_from_socket,
+)
+from pyspark.serializers import BatchedSerializer, CPickleSerializer, 
UTF8Deserializer
+from pyspark.storagelevel import StorageLevel
+from pyspark.traceback_utils import SCCallSiteSync
+from pyspark.sql.column import Column, _to_seq, _to_list, _to_java_column
+from pyspark.sql.readwriter import DataFrameWriter, DataFrameWriterV2
+from pyspark.sql.streaming import DataStreamWriter
+from pyspark.sql.types import (
+StructType,
+Row,
+_parse_datatype_json_string,
+)
+from pyspark.sql.dataframe import (
+DataFrame as ParentDataFrame,
+DataFrameNaFunctions as ParentDataFrameNaFunctions,
+DataFrameStatFunctions as ParentDataFrameStatFunctions,
+)
+from pyspark.sql.utils import get_active_spark_context, toJArray
+from pyspark.sql.pandas.conversion import PandasConversionMixin
+from pyspark.sql.pandas.map_ops import PandasMapOpsMixin
+
+if TYPE_CHECKING:
+from py4j.java_gateway import JavaObject
+from pyspark.core.rdd import RDD
+from pyspark.core.context import SparkContext
+from pyspark._typing import PrimitiveType
+from pyspark.pandas.frame import DataFrame as PandasOnSparkDataFrame
+from pyspark.sql._typing import (
+ColumnOrName,
+ColumnOrNameOrOrdinal,
+LiteralType,
+OptionalPrimitiveType,
+)
+from pyspark.sql.context import SQLContext
+from pyspark.sql.session import SparkSession
+from pyspark.sql.group import GroupedData
+from pyspark.sql.observation import Observation
+
+
+class DataFrame(ParentDataFrame, PandasMapOpsMixin, PandasConversionMixin):
+def __new__(
+cls,
+jdf: "JavaObject",
+sql_ctx: Union["SQLContext", "SparkSession"],
+) -> "DataFrame":
+self = object.__new__(cls)
+self.__init__(jdf, sql_ctx)  # type: ignore[misc]
+return self
+
+def __init__(
+self,
+jdf: "JavaObject",
+sql_ctx: Union["SQLContext", "SparkSession"],
+):
+from pyspark.sql.context import SQLContext
+
+self._sql_ctx: Optional["SQLContext"] = None
+
+if isinstance(sql_ctx, SQLContext):
+assert not os.environ.get("SPARK_TESTING")  # Sanity check for our 
internal usage.
+assert isinstance(sql_ctx, SQLContext)
+# We should remove this if-else branch in the future release, and 
rename
+# sql_ctx to session in the constructor. This is an internal code 
path but
+# was kept with a warning because it's used intensively by 
third-party libraries.
+warnings.warn("DataFrame constructor is internal. Do not directly 
use it.")
+self._sql_ctx = sql_ctx
+session = sql_ctx.sparkSession
+else:
+session = sql_ctx
+self._session: "SparkSession" = session
+
+self._sc: "SparkContext" = sql_ctx._sc
+self._jdf: "JavaObject" = jdf
+self.is_cached = False
+# initialized lazily
+self._schema: Optional[StructType] = None
+self._lazy_rdd: Optional["RDD[Row]"] = None
+# Check whether _repr_html is supported or not, we use it to avoid 
calling _jdf twice
+# by __repr__ and _repr_html_ while eager evaluation opens.
+self._support_repr_html = False
+
+@property
+def sql_ctx(self) -> "SQLContext":
+from 

Re: [PR] [SPARK-47909][PYTHON][CONNECT] Parent DataFrame class for Spark Connect and Spark Classic [spark]

2024-04-19 Thread via GitHub


zhengruifeng commented on code in PR #46129:
URL: https://github.com/apache/spark/pull/46129#discussion_r1571874148


##
python/pyspark/sql/classic/dataframe.py:
##
@@ -0,0 +1,1952 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import os
+import json
+import sys
+import random
+import warnings
+from collections.abc import Iterable
+from functools import reduce
+from typing import (
+Any,
+Callable,
+Dict,
+Iterator,
+List,
+Optional,
+Sequence,
+Tuple,
+Type,
+Union,
+cast,
+overload,
+TYPE_CHECKING,
+)
+
+from pyspark import _NoValue
+from pyspark._globals import _NoValueType
+from pyspark.errors import (
+PySparkTypeError,
+PySparkValueError,
+PySparkIndexError,
+PySparkAttributeError,
+)
+from pyspark.util import (
+is_remote_only,
+_load_from_socket,
+_local_iterator_from_socket,
+)
+from pyspark.serializers import BatchedSerializer, CPickleSerializer, 
UTF8Deserializer
+from pyspark.storagelevel import StorageLevel
+from pyspark.traceback_utils import SCCallSiteSync
+from pyspark.sql.column import Column, _to_seq, _to_list, _to_java_column
+from pyspark.sql.readwriter import DataFrameWriter, DataFrameWriterV2
+from pyspark.sql.streaming import DataStreamWriter
+from pyspark.sql.types import (
+StructType,
+Row,
+_parse_datatype_json_string,
+)
+from pyspark.sql.dataframe import (
+DataFrame as ParentDataFrame,
+DataFrameNaFunctions as ParentDataFrameNaFunctions,
+DataFrameStatFunctions as ParentDataFrameStatFunctions,
+)
+from pyspark.sql.utils import get_active_spark_context, toJArray
+from pyspark.sql.pandas.conversion import PandasConversionMixin
+from pyspark.sql.pandas.map_ops import PandasMapOpsMixin
+
+if TYPE_CHECKING:
+from py4j.java_gateway import JavaObject
+from pyspark.core.rdd import RDD
+from pyspark.core.context import SparkContext
+from pyspark._typing import PrimitiveType
+from pyspark.pandas.frame import DataFrame as PandasOnSparkDataFrame
+from pyspark.sql._typing import (
+ColumnOrName,
+ColumnOrNameOrOrdinal,
+LiteralType,
+OptionalPrimitiveType,
+)
+from pyspark.sql.context import SQLContext
+from pyspark.sql.session import SparkSession
+from pyspark.sql.group import GroupedData
+from pyspark.sql.observation import Observation
+
+
+class DataFrame(ParentDataFrame, PandasMapOpsMixin, PandasConversionMixin):
+def __new__(
+cls,
+jdf: "JavaObject",
+sql_ctx: Union["SQLContext", "SparkSession"],
+) -> "DataFrame":
+self = object.__new__(cls)
+self.__init__(jdf, sql_ctx)  # type: ignore[misc]
+return self
+
+def __init__(
+self,
+jdf: "JavaObject",
+sql_ctx: Union["SQLContext", "SparkSession"],
+):
+from pyspark.sql.context import SQLContext
+
+self._sql_ctx: Optional["SQLContext"] = None
+
+if isinstance(sql_ctx, SQLContext):
+assert not os.environ.get("SPARK_TESTING")  # Sanity check for our 
internal usage.
+assert isinstance(sql_ctx, SQLContext)
+# We should remove this if-else branch in the future release, and 
rename
+# sql_ctx to session in the constructor. This is an internal code 
path but
+# was kept with a warning because it's used intensively by 
third-party libraries.
+warnings.warn("DataFrame constructor is internal. Do not directly 
use it.")
+self._sql_ctx = sql_ctx
+session = sql_ctx.sparkSession
+else:
+session = sql_ctx
+self._session: "SparkSession" = session
+
+self._sc: "SparkContext" = sql_ctx._sc
+self._jdf: "JavaObject" = jdf
+self.is_cached = False
+# initialized lazily
+self._schema: Optional[StructType] = None
+self._lazy_rdd: Optional["RDD[Row]"] = None
+# Check whether _repr_html is supported or not, we use it to avoid 
calling _jdf twice
+# by __repr__ and _repr_html_ while eager evaluation opens.
+self._support_repr_html = False
+
+@property
+def sql_ctx(self) -> "SQLContext":
+from 

Re: [PR] [SPARK-47909][PYTHON][CONNECT] Parent DataFrame class for Spark Connect and Spark Classic [spark]

2024-04-18 Thread via GitHub


HyukjinKwon commented on PR #46129:
URL: https://github.com/apache/spark/pull/46129#issuecomment-2065807249

   cc @ueshin @zhengruifeng @allisonwang-db @xinrong-meng @itholic @hvanhovell 
@grundprinzip  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org