Re: [PR] [MINOR] Improve SparkPi example [spark]
dongjoon-hyun commented on PR #45664: URL: https://github.com/apache/spark/pull/45664#issuecomment-2067877239 Sorry but I don't think this is a kind of `[MINOR]` category work item and I'm not sure about the contribution of this PR, @EnricoMi . Instead, I believe this example is very important than we think because this is a kind of `HelloWorld`. I'd recommend to keep the example as neat and short as possible in the AS-IS status for our newcomers. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47914][SQL] Do not display the splits parameter in Rang [spark]
HyukjinKwon commented on PR #46136: URL: https://github.com/apache/spark/pull/46136#issuecomment-2067849302 Mind checking the test failures? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP] test java toUpperCase & toLowerCase [spark]
HyukjinKwon commented on code in PR #46147: URL: https://github.com/apache/spark/pull/46147#discussion_r1573487011 ## sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/operation/GetColumnsOperation.java: ## @@ -171,7 +172,7 @@ public void runInternal() throws HiveSQLException { List primaryKeys = metastoreClient.getPrimaryKeys(new PrimaryKeysRequest(dbName, table.getTableName())); Set pkColNames = new HashSet<>(); for(SQLPrimaryKey key : primaryKeys) { -pkColNames.add(key.getColumn_name().toLowerCase()); +pkColNames.add(key.getColumn_name().toLowerCase(Locale.ROOT)); Review Comment: Same for column names. I don't think we should apply root locale -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP] test java toUpperCase & toLowerCase [spark]
HyukjinKwon commented on code in PR #46147: URL: https://github.com/apache/spark/pull/46147#discussion_r1573486776 ## common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java: ## @@ -385,13 +386,15 @@ public UTF8String toUpperCase() { } private UTF8String toUpperCaseSlow() { -return fromString(toString().toUpperCase()); +return fromString(toString().toUpperCase(Locale.ROOT)); Review Comment: I think we shouldn't apply locale when we handle actual data. My take is that the upper lower cases should be locale dependent. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47920] add doc for python streaming data source API [spark]
HyukjinKwon commented on code in PR #46139: URL: https://github.com/apache/spark/pull/46139#discussion_r1573469889 ## python/docs/source/user_guide/sql/python_data_source.rst: ## @@ -84,6 +87,46 @@ Define the reader logic to generate synthetic data. Use the `faker` library to p row.append(value) yield tuple(row) +** Implement the Stream Reader** + +.. code-block:: python + +class RangePartition(InputPartition): +def __init__(self, start, end): +self.start = start +self.end = end + +class FakeStreamReader(DataSourceStreamReader): +current = 0 + +def initialOffset(self): +return {"offset": 0} + +def latestOffset(self): +self.current += 2 +return {"offset": self.current} + +def partitions(self, start, end): +return [RangePartition(start["offset"], end["offset"])] + +def commit(self, end): +pass + +def read(self, partition): +start, end = partition.start, partition.end +for i in range(start, end): +yield (i, str(i)) + +This is a dummy streaming data reader that generate 2 rows in every microbatch. The streamReader instance has a integer offset that increase by 2 in every microbatch. + +initialOffset() should return the initial start offset of the reader. Review Comment: ```suggestion ``initialOffset()`` should return the initial start offset of the reader. ``` or ```suggestion :meth:`pyspark.sql.datasource.DataSourceStreamReader.initialOffset` ``` or ```suggestion .. currentmodule:: pyspark.sql.datasource :meth:`DataSourceStreamReader.initialOffset` ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47920] add doc for python streaming data source API [spark]
HyukjinKwon commented on code in PR #46139: URL: https://github.com/apache/spark/pull/46139#discussion_r1573470003 ## python/docs/source/user_guide/sql/python_data_source.rst: ## @@ -84,6 +87,46 @@ Define the reader logic to generate synthetic data. Use the `faker` library to p row.append(value) yield tuple(row) +** Implement the Stream Reader** + +.. code-block:: python + +class RangePartition(InputPartition): +def __init__(self, start, end): +self.start = start +self.end = end + +class FakeStreamReader(DataSourceStreamReader): +current = 0 + +def initialOffset(self): +return {"offset": 0} + +def latestOffset(self): +self.current += 2 +return {"offset": self.current} + +def partitions(self, start, end): +return [RangePartition(start["offset"], end["offset"])] + +def commit(self, end): +pass + +def read(self, partition): +start, end = partition.start, partition.end +for i in range(start, end): +yield (i, str(i)) + +This is a dummy streaming data reader that generate 2 rows in every microbatch. The streamReader instance has a integer offset that increase by 2 in every microbatch. + +initialOffset() should return the initial start offset of the reader. + +latestOffset() return the current latest offset that the next microbatch will read to. + +partitions() plans the partitioning of the current microbatch defined by start and end offset, it needs to return a sequence of Partition object. Review Comment: I would reference `Partition` as a class too as shown above. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45709][BUILD] Deploy packages when all packages are built [spark]
github-actions[bot] closed pull request #43561: [SPARK-45709][BUILD] Deploy packages when all packages are built URL: https://github.com/apache/spark/pull/43561 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-19335][SPARK-38200][SQL] Add upserts for writing to JDBC [spark]
github-actions[bot] closed pull request #41518: [SPARK-19335][SPARK-38200][SQL] Add upserts for writing to JDBC URL: https://github.com/apache/spark/pull/41518 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-46620][PS][CONNECT] Implement `Frame.asfreq` [spark]
github-actions[bot] commented on PR #44621: URL: https://github.com/apache/spark/pull/44621#issuecomment-2067823584 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47909][PYTHON][CONNECT] Parent DataFrame class for Spark Connect and Spark Classic [spark]
HyukjinKwon commented on code in PR #46129: URL: https://github.com/apache/spark/pull/46129#discussion_r1573466682 ## python/pyspark/sql/tests/connect/test_connect_plan.py: ## @@ -333,6 +333,11 @@ def test_observe(self): from pyspark.sql.connect.observation import Observation class MockDF(DataFrame): Review Comment: This might be a breaking change if somebody inherits `pyspark.sql.DataFrame` before, and it has it's own `__init__`. However, `__init__` is not really an API, and users shouldn't really customize/use/invoke them directly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47909][PYTHON][CONNECT] Parent DataFrame class for Spark Connect and Spark Classic [spark]
HyukjinKwon commented on PR #46129: URL: https://github.com/apache/spark/pull/46129#issuecomment-2067819605 Should be ready for a look. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47909][PYTHON][CONNECT] Parent DataFrame class for Spark Connect and Spark Classic [spark]
HyukjinKwon commented on code in PR #46129: URL: https://github.com/apache/spark/pull/46129#discussion_r1573465793 ## python/pyspark/sql/classic/dataframe.py: ## @@ -0,0 +1,1974 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +import os +import json +import sys +import random +import warnings +from collections.abc import Iterable +from functools import reduce +from typing import ( +Any, +Callable, +Dict, +Iterator, +List, +Optional, +Sequence, +Tuple, +Type, +Union, +cast, +overload, +TYPE_CHECKING, +) + +from pyspark import _NoValue +from pyspark.resource import ResourceProfile +from pyspark._globals import _NoValueType +from pyspark.errors import ( +PySparkTypeError, +PySparkValueError, +PySparkIndexError, +PySparkAttributeError, +) +from pyspark.util import ( +is_remote_only, +_load_from_socket, +_local_iterator_from_socket, +) +from pyspark.serializers import BatchedSerializer, CPickleSerializer, UTF8Deserializer +from pyspark.storagelevel import StorageLevel +from pyspark.traceback_utils import SCCallSiteSync +from pyspark.sql.column import Column, _to_seq, _to_list, _to_java_column +from pyspark.sql.readwriter import DataFrameWriter, DataFrameWriterV2 +from pyspark.sql.streaming import DataStreamWriter +from pyspark.sql.types import ( +StructType, +Row, +_parse_datatype_json_string, +) +from pyspark.sql.dataframe import ( +DataFrame as ParentDataFrame, +DataFrameNaFunctions as ParentDataFrameNaFunctions, +DataFrameStatFunctions as ParentDataFrameStatFunctions, +) +from pyspark.sql.utils import get_active_spark_context, toJArray +from pyspark.sql.pandas.conversion import PandasConversionMixin +from pyspark.sql.pandas.map_ops import PandasMapOpsMixin + +if TYPE_CHECKING: +from py4j.java_gateway import JavaObject +from pyspark.core.rdd import RDD +from pyspark.core.context import SparkContext +from pyspark._typing import PrimitiveType +from pyspark.pandas.frame import DataFrame as PandasOnSparkDataFrame +from pyspark.sql._typing import ( +ColumnOrName, +ColumnOrNameOrOrdinal, +LiteralType, +OptionalPrimitiveType, +) +from pyspark.sql.context import SQLContext +from pyspark.sql.session import SparkSession +from pyspark.sql.group import GroupedData +from pyspark.sql.observation import Observation + + +class DataFrame(ParentDataFrame, PandasMapOpsMixin, PandasConversionMixin): +def __new__( +cls, +jdf: "JavaObject", +sql_ctx: Union["SQLContext", "SparkSession"], +) -> "DataFrame": +self = object.__new__(cls) +self.__init__(jdf, sql_ctx) # type: ignore[misc] +return self + +def __init__( +self, +jdf: "JavaObject", +sql_ctx: Union["SQLContext", "SparkSession"], +): +from pyspark.sql.context import SQLContext + +self._sql_ctx: Optional["SQLContext"] = None + +if isinstance(sql_ctx, SQLContext): +assert not os.environ.get("SPARK_TESTING") # Sanity check for our internal usage. +assert isinstance(sql_ctx, SQLContext) +# We should remove this if-else branch in the future release, and rename +# sql_ctx to session in the constructor. This is an internal code path but +# was kept with a warning because it's used intensively by third-party libraries. +warnings.warn("DataFrame constructor is internal. Do not directly use it.") +self._sql_ctx = sql_ctx +session = sql_ctx.sparkSession +else: +session = sql_ctx +self._session: "SparkSession" = session + +self._sc: "SparkContext" = sql_ctx._sc +self._jdf: "JavaObject" = jdf +self.is_cached = False +# initialized lazily +self._schema: Optional[StructType] = None +self._lazy_rdd: Optional["RDD[Row]"] = None +# Check whether _repr_html is supported or not, we use it to avoid calling _jdf twice +# by __repr__ and _repr_html_ while eager evaluation opens. +self._support_repr_html = False + +@property +def
Re: [PR] [SPARK-47909][PYTHON][CONNECT] Parent DataFrame class for Spark Connect and Spark Classic [spark]
HyukjinKwon commented on code in PR #46129: URL: https://github.com/apache/spark/pull/46129#discussion_r1573465384 ## python/pyspark/sql/connect/session.py: ## @@ -325,7 +325,7 @@ def active(cls) -> "SparkSession": active.__doc__ = PySparkSession.active.__doc__ -def table(self, tableName: str) -> DataFrame: +def table(self, tableName: str) -> ParentDataFrame: Review Comment: Here is one example of the error: ``` python/pyspark/sql/classic/dataframe.py:276: error: Argument 1 of "exceptAll" is incompatible with supertype "DataFrame"; supertype defines the argument type as "DataFrame" [override] ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47909][PYTHON][CONNECT] Parent DataFrame class for Spark Connect and Spark Classic [spark]
HyukjinKwon commented on code in PR #46129: URL: https://github.com/apache/spark/pull/46129#discussion_r1573465371 ## python/pyspark/sql/connect/session.py: ## @@ -325,7 +325,7 @@ def active(cls) -> "SparkSession": active.__doc__ = PySparkSession.active.__doc__ -def table(self, tableName: str) -> DataFrame: +def table(self, tableName: str) -> ParentDataFrame: Review Comment: Seems like the arguments cannot be more specific type, and return types can't be wider types (https://mypy.readthedocs.io/en/stable/common_issues.html#incompatible-overrides). So it complains about the argument. Let me just keep them all as parent dataframe for simplicity because those types aren't user-facing anyway. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47323][K8S] Support custom executor log urls [spark]
EnricoMi commented on PR #45464: URL: https://github.com/apache/spark/pull/45464#issuecomment-2067722199 @dongjoon-hyun What do you think? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47573][K8S] Support custom driver log url [spark]
EnricoMi commented on PR #45728: URL: https://github.com/apache/spark/pull/45728#issuecomment-2067722179 @dongjoon-hyun What do you think? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [MINOR] Improve SparkPi example [spark]
EnricoMi commented on PR #45664: URL: https://github.com/apache/spark/pull/45664#issuecomment-2067659018 @dongjoon-hyun https://github.com/apache/spark/pull/45664#issuecomment-2016909814 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org