[jira] [Commented] (SPARK-29166) Add parameters to limit the number of dynamic partitions for data source table
[ https://issues.apache.org/jira/browse/SPARK-29166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16936732#comment-16936732 ] Ido Michael commented on SPARK-29166: - No problem [~cltlfcjin], great job! > Add parameters to limit the number of dynamic partitions for data source table > -- > > Key: SPARK-29166 > URL: https://issues.apache.org/jira/browse/SPARK-29166 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4, 3.0.0 >Reporter: Lantao Jin >Priority: Major > > Dynamic partition in Hive table has some restrictions to limit the max number > of partitions. See > https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-DynamicPartitionInserts > It's very useful to prevent to create mistake partitions like ID. Also it can > protect the NameNode from mass RPC calls of creating. > Data source table also needs similar limitation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29197) Remove saveModeForDSV2 in DataFrameWriter
[ https://issues.apache.org/jira/browse/SPARK-29197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16936731#comment-16936731 ] Ido Michael commented on SPARK-29197: - Great job Burak! And thanks for the quick reply! > Remove saveModeForDSV2 in DataFrameWriter > - > > Key: SPARK-29197 > URL: https://issues.apache.org/jira/browse/SPARK-29197 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Burak Yavuz >Priority: Blocker > > It is very confusing that the default save mode is different between the > internal implementation of a Data source. The reason that we had to have > saveModeForDSV2 was that there was no easy way to check the existence of a > Table in DataSource v2. Now, we have catalogs for that. Therefore we should > be able to remove the different save modes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28001) Dataframe throws 'socket.timeout: timed out' exception
[ https://issues.apache.org/jira/browse/SPARK-28001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16935321#comment-16935321 ] Ido Michael commented on SPARK-28001: - I can take a look Can you please also post the dataset? Ido > Dataframe throws 'socket.timeout: timed out' exception > -- > > Key: SPARK-28001 > URL: https://issues.apache.org/jira/browse/SPARK-28001 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.3 > Environment: Processor: Intel Core i7-7700 CPU @ 3.60Ghz > RAM: 16 GB > OS: Windows 10 Enterprise 64-bit > Python: 3.7.2 > PySpark: 3.4.3 > Cluster manager: Spark Standalone >Reporter: Marius Stanescu >Priority: Critical > > I load data from Azure Table Storage, create a DataFrame and perform a couple > of operations via two user-defined functions, then call show() to display the > results. If I load a very small batch of items, like 5, everything is working > fine, but if I load a batch grater then 10 items from Azure Table Storage > then I get the 'socket.timeout: timed out' exception. > Here is the code: > > {code} > import time > import json > import requests > from requests.auth import HTTPBasicAuth > from azure.cosmosdb.table.tableservice import TableService > from azure.cosmosdb.table.models import Entity > from pyspark.sql import SparkSession > from pyspark.sql.functions import udf, struct > from pyspark.sql.types import BooleanType > def main(): > batch_size = 25 > azure_table_account_name = '***' > azure_table_account_key = '***' > azure_table_name = '***' > spark = SparkSession \ > .builder \ > .appName(agent_name) \ > .config("spark.sql.crossJoin.enabled", "true") \ > .getOrCreate() > table_service = TableService(account_name=azure_table_account_name, > account_key=azure_table_account_key) > continuation_token = None > while True: > messages = table_service.query_entities( > azure_table_name, > select="RowKey, PartitionKey, messageId, ownerSmtp, Timestamp", > num_results=batch_size, > marker=continuation_token, > timeout=60) > continuation_token = messages.next_marker > messages_list = list(messages) > > if not len(messages_list): > time.sleep(5) > pass > > messages_df = spark.createDataFrame(messages_list) > > register_records_df = messages_df \ > .withColumn('Registered', register_record('RowKey', > 'PartitionKey', 'messageId', 'ownerSmtp', 'Timestamp')) > > only_registered_records_df = register_records_df \ > .filter(register_records_df.Registered == True) \ > .drop(register_records_df.Registered) > > update_message_status_df = only_registered_records_df \ > .withColumn('TableEntryDeleted', delete_table_entity('RowKey', > 'PartitionKey')) > > results_df = update_message_status_df.select( > update_message_status_df.RowKey, > update_message_status_df.PartitionKey, > update_message_status_df.TableEntryDeleted) > #results_df.explain() > results_df.show(n=batch_size, truncate=False) > @udf(returnType=BooleanType()) > def register_record(rowKey, partitionKey, messageId, ownerSmtp, timestamp): > # call an API > try: > url = '{}/data/record/{}'.format('***', rowKey) > headers = { 'Content-type': 'application/json' } > response = requests.post( > url, > headers=headers, > auth=HTTPBasicAuth('***', '***'), > data=prepare_record_data(rowKey, partitionKey, > messageId, ownerSmtp, timestamp)) > > return bool(response) > except: > return False > def prepare_record_data(rowKey, partitionKey, messageId, ownerSmtp, > timestamp): > record_data = { > "Title": messageId, > "Type": '***', > "Source": '***', > "Creator": ownerSmtp, > "Publisher": '***', > "Date": timestamp.strftime('%Y-%m-%dT%H:%M:%SZ') > } > return json.dumps(record_data) > @udf(returnType=BooleanType()) > def delete_table_entity(row_key, partition_key): > azure_table_account_name = '***' > azure_table_account_key = '***' > azure_table_name = '***' > try: > table_service = TableService(account_name=azure_table_account_name, > account_key=azure_table_account_key) > table_service.delete_entity(azure_table_name, partition_key, row_key) > return True > except: > return False > if __name__ == "__main__": > main() > {cod
[jira] [Commented] (SPARK-29197) Remove saveModeForDSV2 in DataFrameWriter
[ https://issues.apache.org/jira/browse/SPARK-29197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16935320#comment-16935320 ] Ido Michael commented on SPARK-29197: - I can take it, I could find the final class DataFrameWriter in spark-sql, but couldn't find saveModeForDSV2 in it, where is it? Do I need to remove all of the save modes? Ido > Remove saveModeForDSV2 in DataFrameWriter > - > > Key: SPARK-29197 > URL: https://issues.apache.org/jira/browse/SPARK-29197 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Burak Yavuz >Priority: Blocker > > It is very confusing that the default save mode is different between the > internal implementation of a Data source. The reason that we had to have > saveModeForDSV2 was that there was no easy way to check the existence of a > Table in DataSource v2. Now, we have catalogs for that. Therefore we should > be able to remove the different save modes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29157) DataSourceV2: Add DataFrameWriterV2 to Python API
[ https://issues.apache.org/jira/browse/SPARK-29157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16935307#comment-16935307 ] Ido Michael commented on SPARK-29157: - It looks like https://issues.apache.org/jira/browse/SPARK-28612 was resolved. What there is to do here? Ido > DataSourceV2: Add DataFrameWriterV2 to Python API > - > > Key: SPARK-29157 > URL: https://issues.apache.org/jira/browse/SPARK-29157 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Ryan Blue >Priority: Major > > After SPARK-28612 is committed, we need to add the corresponding PySpark API. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29166) Add parameters to limit the number of dynamic partitions for data source table
[ https://issues.apache.org/jira/browse/SPARK-29166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16935306#comment-16935306 ] Ido Michael commented on SPARK-29166: - I can take this if no one started to work on it? Ido > Add parameters to limit the number of dynamic partitions for data source table > -- > > Key: SPARK-29166 > URL: https://issues.apache.org/jira/browse/SPARK-29166 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4, 3.0.0 >Reporter: Lantao Jin >Priority: Major > > Dynamic partition in Hive table has some restrictions to limit the max number > of partitions. See > https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-DynamicPartitionInserts > It's very useful to prevent to create mistake partitions like ID. Also it can > protect the NameNode from mass RPC calls of creating. > Data source table also needs similar limitation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28155) do not leak SaveMode to file source v2
[ https://issues.apache.org/jira/browse/SPARK-28155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16895354#comment-16895354 ] Ido Michael commented on SPARK-28155: - Hi, Why was it reopened? What's left to do? > do not leak SaveMode to file source v2 > -- > > Key: SPARK-28155 > URL: https://issues.apache.org/jira/browse/SPARK-28155 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Blocker > > Currently there is a hack in `DataFrameWriter`, which passes `SaveMode` to > file source v2. This should be removed and file source v2 should not accept > SaveMode. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org