Re: [PR] [MINOR] Improve SparkPi example [spark]

2024-04-20 Thread via GitHub


dongjoon-hyun commented on PR #45664:
URL: https://github.com/apache/spark/pull/45664#issuecomment-2067877239

   Sorry but I don't think this is a kind of `[MINOR]` category work item and  
I'm not sure about the contribution of this PR, @EnricoMi . Instead, I believe 
this example is very important than we think  because this is a kind of 
`HelloWorld`. I'd recommend to keep the example as neat and short as possible 
in the AS-IS status for our newcomers.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47914][SQL] Do not display the splits parameter in Rang [spark]

2024-04-20 Thread via GitHub


HyukjinKwon commented on PR #46136:
URL: https://github.com/apache/spark/pull/46136#issuecomment-2067849302

   Mind checking the test failures?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP] test java toUpperCase & toLowerCase [spark]

2024-04-20 Thread via GitHub


HyukjinKwon commented on code in PR #46147:
URL: https://github.com/apache/spark/pull/46147#discussion_r1573487011


##
sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/operation/GetColumnsOperation.java:
##
@@ -171,7 +172,7 @@ public void runInternal() throws HiveSQLException {
   List primaryKeys = metastoreClient.getPrimaryKeys(new 
PrimaryKeysRequest(dbName, table.getTableName()));
   Set pkColNames = new HashSet<>();
   for(SQLPrimaryKey key : primaryKeys) {
-pkColNames.add(key.getColumn_name().toLowerCase());
+pkColNames.add(key.getColumn_name().toLowerCase(Locale.ROOT));

Review Comment:
   Same for column names. I don't think we should apply root locale



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP] test java toUpperCase & toLowerCase [spark]

2024-04-20 Thread via GitHub


HyukjinKwon commented on code in PR #46147:
URL: https://github.com/apache/spark/pull/46147#discussion_r1573486776


##
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##
@@ -385,13 +386,15 @@ public UTF8String toUpperCase() {
   }
 
   private UTF8String toUpperCaseSlow() {
-return fromString(toString().toUpperCase());
+return fromString(toString().toUpperCase(Locale.ROOT));

Review Comment:
   I think we shouldn't apply locale when we handle actual data. My take is 
that the upper lower cases should be locale dependent.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47920] add doc for python streaming data source API [spark]

2024-04-20 Thread via GitHub


HyukjinKwon commented on code in PR #46139:
URL: https://github.com/apache/spark/pull/46139#discussion_r1573469889


##
python/docs/source/user_guide/sql/python_data_source.rst:
##
@@ -84,6 +87,46 @@ Define the reader logic to generate synthetic data. Use the 
`faker` library to p
 row.append(value)
 yield tuple(row)
 
+** Implement the Stream Reader**
+
+.. code-block:: python
+
+class RangePartition(InputPartition):
+def __init__(self, start, end):
+self.start = start
+self.end = end
+
+class FakeStreamReader(DataSourceStreamReader):
+current = 0
+
+def initialOffset(self):
+return {"offset": 0}
+
+def latestOffset(self):
+self.current += 2
+return {"offset": self.current}
+
+def partitions(self, start, end):
+return [RangePartition(start["offset"], end["offset"])]
+
+def commit(self, end):
+pass
+
+def read(self, partition):
+start, end = partition.start, partition.end
+for i in range(start, end):
+yield (i, str(i))
+
+This is a dummy streaming data reader that generate 2 rows in every 
microbatch. The streamReader instance has a integer offset that increase by 2 
in every microbatch.
+
+initialOffset() should return the initial start offset of the reader.

Review Comment:
   ```suggestion
   ``initialOffset()`` should return the initial start offset of the reader.
   ```
   
   or
   
   ```suggestion
   :meth:`pyspark.sql.datasource.DataSourceStreamReader.initialOffset`
   ```
   
   or
   
   ```suggestion
   .. currentmodule:: pyspark.sql.datasource
   
   :meth:`DataSourceStreamReader.initialOffset`
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47920] add doc for python streaming data source API [spark]

2024-04-20 Thread via GitHub


HyukjinKwon commented on code in PR #46139:
URL: https://github.com/apache/spark/pull/46139#discussion_r1573470003


##
python/docs/source/user_guide/sql/python_data_source.rst:
##
@@ -84,6 +87,46 @@ Define the reader logic to generate synthetic data. Use the 
`faker` library to p
 row.append(value)
 yield tuple(row)
 
+** Implement the Stream Reader**
+
+.. code-block:: python
+
+class RangePartition(InputPartition):
+def __init__(self, start, end):
+self.start = start
+self.end = end
+
+class FakeStreamReader(DataSourceStreamReader):
+current = 0
+
+def initialOffset(self):
+return {"offset": 0}
+
+def latestOffset(self):
+self.current += 2
+return {"offset": self.current}
+
+def partitions(self, start, end):
+return [RangePartition(start["offset"], end["offset"])]
+
+def commit(self, end):
+pass
+
+def read(self, partition):
+start, end = partition.start, partition.end
+for i in range(start, end):
+yield (i, str(i))
+
+This is a dummy streaming data reader that generate 2 rows in every 
microbatch. The streamReader instance has a integer offset that increase by 2 
in every microbatch.
+
+initialOffset() should return the initial start offset of the reader.
+
+latestOffset() return the current latest offset that the next microbatch will 
read to.
+
+partitions() plans the partitioning of the current microbatch defined by start 
and end offset, it needs to return a sequence of Partition object.

Review Comment:
   I would reference `Partition` as a class too as shown above.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45709][BUILD] Deploy packages when all packages are built [spark]

2024-04-20 Thread via GitHub


github-actions[bot] closed pull request #43561: [SPARK-45709][BUILD] Deploy 
packages when all packages are built
URL: https://github.com/apache/spark/pull/43561


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-19335][SPARK-38200][SQL] Add upserts for writing to JDBC [spark]

2024-04-20 Thread via GitHub


github-actions[bot] closed pull request #41518: [SPARK-19335][SPARK-38200][SQL] 
Add upserts for writing to JDBC
URL: https://github.com/apache/spark/pull/41518


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP][SPARK-46620][PS][CONNECT] Implement `Frame.asfreq` [spark]

2024-04-20 Thread via GitHub


github-actions[bot] commented on PR #44621:
URL: https://github.com/apache/spark/pull/44621#issuecomment-2067823584

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47909][PYTHON][CONNECT] Parent DataFrame class for Spark Connect and Spark Classic [spark]

2024-04-20 Thread via GitHub


HyukjinKwon commented on code in PR #46129:
URL: https://github.com/apache/spark/pull/46129#discussion_r1573466682


##
python/pyspark/sql/tests/connect/test_connect_plan.py:
##
@@ -333,6 +333,11 @@ def test_observe(self):
 from pyspark.sql.connect.observation import Observation
 
 class MockDF(DataFrame):

Review Comment:
   This might be a breaking change if somebody inherits `pyspark.sql.DataFrame` 
before, and it has it's own `__init__`. However, `__init__` is not really an 
API, and users shouldn't really customize/use/invoke them directly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47909][PYTHON][CONNECT] Parent DataFrame class for Spark Connect and Spark Classic [spark]

2024-04-20 Thread via GitHub


HyukjinKwon commented on PR #46129:
URL: https://github.com/apache/spark/pull/46129#issuecomment-2067819605

   Should be ready for a look.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47909][PYTHON][CONNECT] Parent DataFrame class for Spark Connect and Spark Classic [spark]

2024-04-20 Thread via GitHub


HyukjinKwon commented on code in PR #46129:
URL: https://github.com/apache/spark/pull/46129#discussion_r1573465793


##
python/pyspark/sql/classic/dataframe.py:
##
@@ -0,0 +1,1974 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import os
+import json
+import sys
+import random
+import warnings
+from collections.abc import Iterable
+from functools import reduce
+from typing import (
+Any,
+Callable,
+Dict,
+Iterator,
+List,
+Optional,
+Sequence,
+Tuple,
+Type,
+Union,
+cast,
+overload,
+TYPE_CHECKING,
+)
+
+from pyspark import _NoValue
+from pyspark.resource import ResourceProfile
+from pyspark._globals import _NoValueType
+from pyspark.errors import (
+PySparkTypeError,
+PySparkValueError,
+PySparkIndexError,
+PySparkAttributeError,
+)
+from pyspark.util import (
+is_remote_only,
+_load_from_socket,
+_local_iterator_from_socket,
+)
+from pyspark.serializers import BatchedSerializer, CPickleSerializer, 
UTF8Deserializer
+from pyspark.storagelevel import StorageLevel
+from pyspark.traceback_utils import SCCallSiteSync
+from pyspark.sql.column import Column, _to_seq, _to_list, _to_java_column
+from pyspark.sql.readwriter import DataFrameWriter, DataFrameWriterV2
+from pyspark.sql.streaming import DataStreamWriter
+from pyspark.sql.types import (
+StructType,
+Row,
+_parse_datatype_json_string,
+)
+from pyspark.sql.dataframe import (
+DataFrame as ParentDataFrame,
+DataFrameNaFunctions as ParentDataFrameNaFunctions,
+DataFrameStatFunctions as ParentDataFrameStatFunctions,
+)
+from pyspark.sql.utils import get_active_spark_context, toJArray
+from pyspark.sql.pandas.conversion import PandasConversionMixin
+from pyspark.sql.pandas.map_ops import PandasMapOpsMixin
+
+if TYPE_CHECKING:
+from py4j.java_gateway import JavaObject
+from pyspark.core.rdd import RDD
+from pyspark.core.context import SparkContext
+from pyspark._typing import PrimitiveType
+from pyspark.pandas.frame import DataFrame as PandasOnSparkDataFrame
+from pyspark.sql._typing import (
+ColumnOrName,
+ColumnOrNameOrOrdinal,
+LiteralType,
+OptionalPrimitiveType,
+)
+from pyspark.sql.context import SQLContext
+from pyspark.sql.session import SparkSession
+from pyspark.sql.group import GroupedData
+from pyspark.sql.observation import Observation
+
+
+class DataFrame(ParentDataFrame, PandasMapOpsMixin, PandasConversionMixin):
+def __new__(
+cls,
+jdf: "JavaObject",
+sql_ctx: Union["SQLContext", "SparkSession"],
+) -> "DataFrame":
+self = object.__new__(cls)
+self.__init__(jdf, sql_ctx)  # type: ignore[misc]
+return self
+
+def __init__(
+self,
+jdf: "JavaObject",
+sql_ctx: Union["SQLContext", "SparkSession"],
+):
+from pyspark.sql.context import SQLContext
+
+self._sql_ctx: Optional["SQLContext"] = None
+
+if isinstance(sql_ctx, SQLContext):
+assert not os.environ.get("SPARK_TESTING")  # Sanity check for our 
internal usage.
+assert isinstance(sql_ctx, SQLContext)
+# We should remove this if-else branch in the future release, and 
rename
+# sql_ctx to session in the constructor. This is an internal code 
path but
+# was kept with a warning because it's used intensively by 
third-party libraries.
+warnings.warn("DataFrame constructor is internal. Do not directly 
use it.")
+self._sql_ctx = sql_ctx
+session = sql_ctx.sparkSession
+else:
+session = sql_ctx
+self._session: "SparkSession" = session
+
+self._sc: "SparkContext" = sql_ctx._sc
+self._jdf: "JavaObject" = jdf
+self.is_cached = False
+# initialized lazily
+self._schema: Optional[StructType] = None
+self._lazy_rdd: Optional["RDD[Row]"] = None
+# Check whether _repr_html is supported or not, we use it to avoid 
calling _jdf twice
+# by __repr__ and _repr_html_ while eager evaluation opens.
+self._support_repr_html = False
+
+@property
+def 

Re: [PR] [SPARK-47909][PYTHON][CONNECT] Parent DataFrame class for Spark Connect and Spark Classic [spark]

2024-04-20 Thread via GitHub


HyukjinKwon commented on code in PR #46129:
URL: https://github.com/apache/spark/pull/46129#discussion_r1573465384


##
python/pyspark/sql/connect/session.py:
##
@@ -325,7 +325,7 @@ def active(cls) -> "SparkSession":
 
 active.__doc__ = PySparkSession.active.__doc__
 
-def table(self, tableName: str) -> DataFrame:
+def table(self, tableName: str) -> ParentDataFrame:

Review Comment:
   Here is one example of the error:
   
   ```
   python/pyspark/sql/classic/dataframe.py:276: error: Argument 1 of 
"exceptAll" is incompatible with supertype "DataFrame"; supertype defines the 
argument type as "DataFrame"  [override]
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47909][PYTHON][CONNECT] Parent DataFrame class for Spark Connect and Spark Classic [spark]

2024-04-20 Thread via GitHub


HyukjinKwon commented on code in PR #46129:
URL: https://github.com/apache/spark/pull/46129#discussion_r1573465371


##
python/pyspark/sql/connect/session.py:
##
@@ -325,7 +325,7 @@ def active(cls) -> "SparkSession":
 
 active.__doc__ = PySparkSession.active.__doc__
 
-def table(self, tableName: str) -> DataFrame:
+def table(self, tableName: str) -> ParentDataFrame:

Review Comment:
   Seems like the arguments cannot be more specific type, and return types 
can't be wider types 
(https://mypy.readthedocs.io/en/stable/common_issues.html#incompatible-overrides).
 So it complains about the argument.
   
   Let me just keep them all as parent dataframe for simplicity because those 
types aren't user-facing anyway.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47323][K8S] Support custom executor log urls [spark]

2024-04-20 Thread via GitHub


EnricoMi commented on PR #45464:
URL: https://github.com/apache/spark/pull/45464#issuecomment-2067722199

   @dongjoon-hyun What do you think?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47573][K8S] Support custom driver log url [spark]

2024-04-20 Thread via GitHub


EnricoMi commented on PR #45728:
URL: https://github.com/apache/spark/pull/45728#issuecomment-2067722179

   @dongjoon-hyun What do you think?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [MINOR] Improve SparkPi example [spark]

2024-04-20 Thread via GitHub


EnricoMi commented on PR #45664:
URL: https://github.com/apache/spark/pull/45664#issuecomment-2067659018

   @dongjoon-hyun 
https://github.com/apache/spark/pull/45664#issuecomment-2016909814


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org