[jira] [Commented] (SPARK-29166) Add parameters to limit the number of dynamic partitions for data source table

2019-09-24 Thread Ido Michael (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16936732#comment-16936732
 ] 

Ido Michael commented on SPARK-29166:
-

No problem [~cltlfcjin], great job!

> Add parameters to limit the number of dynamic partitions for data source table
> --
>
> Key: SPARK-29166
> URL: https://issues.apache.org/jira/browse/SPARK-29166
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Lantao Jin
>Priority: Major
>
> Dynamic partition in Hive table has some restrictions to limit the max number 
> of partitions. See 
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-DynamicPartitionInserts
> It's very useful to prevent to create mistake partitions like ID. Also it can 
> protect the NameNode from mass RPC calls of creating.
> Data source table also needs similar limitation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29197) Remove saveModeForDSV2 in DataFrameWriter

2019-09-24 Thread Ido Michael (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16936731#comment-16936731
 ] 

Ido Michael commented on SPARK-29197:
-

Great job Burak!

And thanks for the quick reply!

> Remove saveModeForDSV2 in DataFrameWriter
> -
>
> Key: SPARK-29197
> URL: https://issues.apache.org/jira/browse/SPARK-29197
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Burak Yavuz
>Priority: Blocker
>
> It is very confusing that the default save mode is different between the 
> internal implementation of a Data source. The reason that we had to have 
> saveModeForDSV2 was that there was no easy way to check the existence of a 
> Table in DataSource v2. Now, we have catalogs for that. Therefore we should 
> be able to remove the different save modes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28001) Dataframe throws 'socket.timeout: timed out' exception

2019-09-22 Thread Ido Michael (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16935321#comment-16935321
 ] 

Ido Michael commented on SPARK-28001:
-

I can take a look

Can you please also post the dataset?

 

Ido

> Dataframe throws 'socket.timeout: timed out' exception
> --
>
> Key: SPARK-28001
> URL: https://issues.apache.org/jira/browse/SPARK-28001
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3
> Environment: Processor: Intel Core i7-7700 CPU @ 3.60Ghz
> RAM: 16 GB
> OS: Windows 10 Enterprise 64-bit
> Python: 3.7.2
> PySpark: 3.4.3
> Cluster manager: Spark Standalone
>Reporter: Marius Stanescu
>Priority: Critical
>
> I load data from Azure Table Storage, create a DataFrame and perform a couple 
> of operations via two user-defined functions, then call show() to display the 
> results. If I load a very small batch of items, like 5, everything is working 
> fine, but if I load a batch grater then 10 items from Azure Table Storage 
> then I get the 'socket.timeout: timed out' exception.
> Here is the code:
>  
> {code}
> import time
> import json
> import requests
> from requests.auth import HTTPBasicAuth
> from azure.cosmosdb.table.tableservice import TableService
> from azure.cosmosdb.table.models import Entity
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import udf, struct
> from pyspark.sql.types import BooleanType
> def main():
> batch_size = 25
> azure_table_account_name = '***'
> azure_table_account_key = '***'
> azure_table_name = '***'
> spark = SparkSession \
> .builder \
> .appName(agent_name) \
> .config("spark.sql.crossJoin.enabled", "true") \
> .getOrCreate()
> table_service = TableService(account_name=azure_table_account_name, 
> account_key=azure_table_account_key)
> continuation_token = None
> while True:
> messages = table_service.query_entities(
> azure_table_name,
> select="RowKey, PartitionKey, messageId, ownerSmtp, Timestamp",
> num_results=batch_size,
> marker=continuation_token,
> timeout=60)
> continuation_token = messages.next_marker
> messages_list = list(messages)
> 
> if not len(messages_list):
> time.sleep(5)
> pass
> 
> messages_df = spark.createDataFrame(messages_list)
> 
> register_records_df = messages_df \
> .withColumn('Registered', register_record('RowKey', 
> 'PartitionKey', 'messageId', 'ownerSmtp', 'Timestamp'))
> 
> only_registered_records_df = register_records_df \
> .filter(register_records_df.Registered == True) \
> .drop(register_records_df.Registered)
> 
> update_message_status_df = only_registered_records_df \
> .withColumn('TableEntryDeleted', delete_table_entity('RowKey', 
> 'PartitionKey'))
> 
> results_df = update_message_status_df.select(
> update_message_status_df.RowKey,
> update_message_status_df.PartitionKey,
> update_message_status_df.TableEntryDeleted)
> #results_df.explain()
> results_df.show(n=batch_size, truncate=False)
> @udf(returnType=BooleanType())
> def register_record(rowKey, partitionKey, messageId, ownerSmtp, timestamp):
>   # call an API
> try:
>   url = '{}/data/record/{}'.format('***', rowKey)
>   headers = { 'Content-type': 'application/json' }
>   response = requests.post(
>   url,
>   headers=headers,
>   auth=HTTPBasicAuth('***', '***'),
>   data=prepare_record_data(rowKey, partitionKey, 
> messageId, ownerSmtp, timestamp))
> 
> return bool(response)
> except:
> return False
> def prepare_record_data(rowKey, partitionKey, messageId, ownerSmtp, 
> timestamp):
> record_data = {
> "Title": messageId,
> "Type": '***',
> "Source": '***',
> "Creator": ownerSmtp,
> "Publisher": '***',
> "Date": timestamp.strftime('%Y-%m-%dT%H:%M:%SZ')
> }
> return json.dumps(record_data)
> @udf(returnType=BooleanType())
> def delete_table_entity(row_key, partition_key):
> azure_table_account_name = '***'
> azure_table_account_key = '***'
> azure_table_name = '***'
> try:
> table_service = TableService(account_name=azure_table_account_name, 
> account_key=azure_table_account_key)
> table_service.delete_entity(azure_table_name, partition_key, row_key)
> return True
> except:
> return False
> if __name__ == "__main__":
> main()
> {cod

[jira] [Commented] (SPARK-29197) Remove saveModeForDSV2 in DataFrameWriter

2019-09-22 Thread Ido Michael (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16935320#comment-16935320
 ] 

Ido Michael commented on SPARK-29197:
-

I can take it,

 

I could find the final class DataFrameWriter in spark-sql, but couldn't find 
saveModeForDSV2 in it, where is it?

 

Do I need to remove all of the save modes?

 

Ido

> Remove saveModeForDSV2 in DataFrameWriter
> -
>
> Key: SPARK-29197
> URL: https://issues.apache.org/jira/browse/SPARK-29197
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Burak Yavuz
>Priority: Blocker
>
> It is very confusing that the default save mode is different between the 
> internal implementation of a Data source. The reason that we had to have 
> saveModeForDSV2 was that there was no easy way to check the existence of a 
> Table in DataSource v2. Now, we have catalogs for that. Therefore we should 
> be able to remove the different save modes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29157) DataSourceV2: Add DataFrameWriterV2 to Python API

2019-09-22 Thread Ido Michael (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16935307#comment-16935307
 ] 

Ido Michael commented on SPARK-29157:
-

It looks like https://issues.apache.org/jira/browse/SPARK-28612 was resolved.

 

What there is to do here?

Ido

> DataSourceV2: Add DataFrameWriterV2 to Python API
> -
>
> Key: SPARK-29157
> URL: https://issues.apache.org/jira/browse/SPARK-29157
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Priority: Major
>
> After SPARK-28612 is committed, we need to add the corresponding PySpark API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29166) Add parameters to limit the number of dynamic partitions for data source table

2019-09-22 Thread Ido Michael (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16935306#comment-16935306
 ] 

Ido Michael commented on SPARK-29166:
-

I can take this if no one started to work on it?

Ido

> Add parameters to limit the number of dynamic partitions for data source table
> --
>
> Key: SPARK-29166
> URL: https://issues.apache.org/jira/browse/SPARK-29166
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Lantao Jin
>Priority: Major
>
> Dynamic partition in Hive table has some restrictions to limit the max number 
> of partitions. See 
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-DynamicPartitionInserts
> It's very useful to prevent to create mistake partitions like ID. Also it can 
> protect the NameNode from mass RPC calls of creating.
> Data source table also needs similar limitation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28155) do not leak SaveMode to file source v2

2019-07-29 Thread Ido Michael (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16895354#comment-16895354
 ] 

Ido Michael commented on SPARK-28155:
-

Hi,

Why was it reopened?

What's left to do?

 

> do not leak SaveMode to file source v2
> --
>
> Key: SPARK-28155
> URL: https://issues.apache.org/jira/browse/SPARK-28155
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Blocker
>
> Currently there is a hack in `DataFrameWriter`, which passes `SaveMode` to 
> file source v2. This should be removed and file source v2 should not accept 
> SaveMode.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org