date:20190710

[jira] [Created] (SPARK-28345) PythonUDF predicate should be able to pushdown to join

2019-07-10 Thread Liang-Chi Hsieh (JIRA)

Liang-Chi Hsieh created SPARK-28345:
---

 Summary: PythonUDF predicate should be able to pushdown to join
 Key: SPARK-28345
 URL: https://issues.apache.org/jira/browse/SPARK-28345
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Affects Versions: 3.0.0
Reporter: Liang-Chi Hsieh


A Filter predicate using PythonUDF can't be push down into join condition, 
currently. A predicate like that should be able to push down to join condition. 
For PythonUDFs that can't be evaluated in join condition, 
{{PullOutPythonUDFInJoinCondition}} will pull them out later.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-28269) ArrowStreamPandasSerializer get stack

2019-07-10 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16882639#comment-16882639
 ] 

Hyukjin Kwon edited comment on SPARK-28269 at 7/11/19 4:52 AM:
---

Workaround to me was call `copy()` on this line:

{code:java}
-return pd.DataFrame(np.random.randint(0, 100, size=(rows, cols_count)), 
columns=range(cols_count))
+return pd.DataFrame(np.random.randint(0, 100, size=(rows, cols_count)), 
columns=range(cols_count)).copy()
{code}



was (Author: hyukjin.kwon):
Workaround to me was call `copy()` on this line:

{code:java}
-return pd.DataFrame(np.random.randint(0, 100, size=(rows, cols_count)), 
columns=range(cols_count))
+return pd.DataFrame(np.random.randint(0, 100, size=(rows, cols_count)), 
columns=range(cols_count)).copy()
{code}

or converting it to list so that we avoid to use memoryview. But given my 
experience, this is less performant.

{code}
-return pd.DataFrame(np.random.randint(0, 100, size=(rows, cols_count)), 
columns=range(cols_count))
+return pd.DataFrame(list(np.random.randint(0, 100, size=(rows, 
cols_count))), columns=range(cols_count))
{code}

> ArrowStreamPandasSerializer get stack
> -
>
> Key: SPARK-28269
> URL: https://issues.apache.org/jira/browse/SPARK-28269
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3
>Reporter: Modi Tamam
>Priority: Major
> Attachments: Untitled.xcf
>
>
> I'm working with Pyspark version 2.4.3.
> I have a big data frame:
>  * ~15M rows
>  * ~130 columns
>  * ~2.5 GB - I've converted it to a Pandas data frame, then, pickling it 
> (pandas_df.toPickle() ) resulted with a file of size 2.5GB.
> I have some code that groups this data frame and applying a Pandas-UDF:
>  
> {code:java}
> from pyspark.sql import Row
> from pyspark.sql.functions import lit, pandas_udf, PandasUDFType, to_json
> from pyspark.sql.types import *
> from pyspark.sql import functions as F
> initial_list = range(4500)
> rdd = sc.parallelize(initial_list)
> rdd = rdd.map(lambda x: Row(val=x))
> initial_spark_df = spark.createDataFrame(rdd)
> cols_count = 132
> rows = 1000
> # --- Start Generating the big data frame---
> # Generating the schema
> schema = StructType([StructField(str(i), IntegerType()) for i in 
> range(cols_count)])
> @pandas_udf(returnType=schema,functionType=PandasUDFType.GROUPED_MAP)
> def random_pd_df_generator(df):
> import numpy as np
> import pandas as pd
> return pd.DataFrame(np.random.randint(0, 100, size=(rows, cols_count)), 
> columns=range(cols_count))
> full_spark_df = initial_spark_df.groupBy("val").apply(random_pd_df_generator)
> # --- End Generating the big data frame---
> # ---Start the bug reproduction---
> grouped_col = "col_0"
> @pandas_udf("%s string" %grouped_col, PandasUDFType.GROUPED_MAP)
> def very_simpl_udf(pdf):
> import pandas as pd
> ret_val = pd.DataFrame({grouped_col: [str(pdf[grouped_col].iloc[0])]})
> return ret_val
> # In order to create a huge dataset, I've set all of the grouped_col value to 
> a single value, then, grouped it into a single dataset.
> # Here is where to program gets stuck
> full_spark_df.withColumn(grouped_col,F.lit('0')).groupBy(grouped_col).apply(very_simpl_udf).show()
> assert False, "If we're, means that the issue wasn't reproduced"
> {code}
>  
> The above code gets stuck on the ArrowStreamPandasSerializer: (on the first 
> line when reading batch from the reader)
>  
> {code:java}
> for batch in reader:
>  yield [self.arrow_to_pandas(c) for c in  
> pa.Table.from_batches([batch]).itercolumns()]{code}
>  
>  You can just run the first code snippet and it will reproduce.
> Open a Pyspark shell with this configuration:
> {code:java}
> pyspark --conf "spark.python.worker.memory=3G" --conf 
> "spark.executor.memory=20G" --conf 
> "spark.executor.extraJavaOptions=-XX:+UseG1GC" --conf 
> "spark.driver.memory=10G"{code}
>  
> Versions:
>  * pandas - 0.24.2
>  * pyarrow - 0.13.0
>  * Spark - 2.4.2
>  * Python - 2.7.16



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-28269) ArrowStreamPandasSerializer get stack

2019-07-10 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16882639#comment-16882639
 ] 

Hyukjin Kwon edited comment on SPARK-28269 at 7/11/19 4:51 AM:
---

Workaround to me was call `copy()` on this line:

{code:java}
-return pd.DataFrame(np.random.randint(0, 100, size=(rows, cols_count)), 
columns=range(cols_count))
+return pd.DataFrame(np.random.randint(0, 100, size=(rows, cols_count)), 
columns=range(cols_count)).copy()
{code}

or converting it to list so that we avoid to use memoryview. But given my 
experience, this is less performant.

{code}
-return pd.DataFrame(np.random.randint(0, 100, size=(rows, cols_count)), 
columns=range(cols_count))
+return pd.DataFrame(list(np.random.randint(0, 100, size=(rows, 
cols_count))), columns=range(cols_count))
{code}


was (Author: hyukjin.kwon):
Workaround to me was call `copy()` on this line:

{code:java}
-return pd.DataFrame(np.random.randint(0, 100, size=(rows, cols_count)), 
columns=range(cols_count))
+return pd.DataFrame(np.random.randint(0, 100, size=(rows, cols_count)), 
columns=range(cols_count)).copy()
{code}


> ArrowStreamPandasSerializer get stack
> -
>
> Key: SPARK-28269
> URL: https://issues.apache.org/jira/browse/SPARK-28269
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3
>Reporter: Modi Tamam
>Priority: Major
> Attachments: Untitled.xcf
>
>
> I'm working with Pyspark version 2.4.3.
> I have a big data frame:
>  * ~15M rows
>  * ~130 columns
>  * ~2.5 GB - I've converted it to a Pandas data frame, then, pickling it 
> (pandas_df.toPickle() ) resulted with a file of size 2.5GB.
> I have some code that groups this data frame and applying a Pandas-UDF:
>  
> {code:java}
> from pyspark.sql import Row
> from pyspark.sql.functions import lit, pandas_udf, PandasUDFType, to_json
> from pyspark.sql.types import *
> from pyspark.sql import functions as F
> initial_list = range(4500)
> rdd = sc.parallelize(initial_list)
> rdd = rdd.map(lambda x: Row(val=x))
> initial_spark_df = spark.createDataFrame(rdd)
> cols_count = 132
> rows = 1000
> # --- Start Generating the big data frame---
> # Generating the schema
> schema = StructType([StructField(str(i), IntegerType()) for i in 
> range(cols_count)])
> @pandas_udf(returnType=schema,functionType=PandasUDFType.GROUPED_MAP)
> def random_pd_df_generator(df):
> import numpy as np
> import pandas as pd
> return pd.DataFrame(np.random.randint(0, 100, size=(rows, cols_count)), 
> columns=range(cols_count))
> full_spark_df = initial_spark_df.groupBy("val").apply(random_pd_df_generator)
> # --- End Generating the big data frame---
> # ---Start the bug reproduction---
> grouped_col = "col_0"
> @pandas_udf("%s string" %grouped_col, PandasUDFType.GROUPED_MAP)
> def very_simpl_udf(pdf):
> import pandas as pd
> ret_val = pd.DataFrame({grouped_col: [str(pdf[grouped_col].iloc[0])]})
> return ret_val
> # In order to create a huge dataset, I've set all of the grouped_col value to 
> a single value, then, grouped it into a single dataset.
> # Here is where to program gets stuck
> full_spark_df.withColumn(grouped_col,F.lit('0')).groupBy(grouped_col).apply(very_simpl_udf).show()
> assert False, "If we're, means that the issue wasn't reproduced"
> {code}
>  
> The above code gets stuck on the ArrowStreamPandasSerializer: (on the first 
> line when reading batch from the reader)
>  
> {code:java}
> for batch in reader:
>  yield [self.arrow_to_pandas(c) for c in  
> pa.Table.from_batches([batch]).itercolumns()]{code}
>  
>  You can just run the first code snippet and it will reproduce.
> Open a Pyspark shell with this configuration:
> {code:java}
> pyspark --conf "spark.python.worker.memory=3G" --conf 
> "spark.executor.memory=20G" --conf 
> "spark.executor.extraJavaOptions=-XX:+UseG1GC" --conf 
> "spark.driver.memory=10G"{code}
>  
> Versions:
>  * pandas - 0.24.2
>  * pyarrow - 0.13.0
>  * Spark - 2.4.2
>  * Python - 2.7.16



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28344) fail the query if detect ambiguous self join

2019-07-10 Thread Wenchen Fan (JIRA)

Wenchen Fan created SPARK-28344:
---

 Summary: fail the query if detect ambiguous self join
 Key: SPARK-28344
 URL: https://issues.apache.org/jira/browse/SPARK-28344
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28269) ArrowStreamPandasSerializer get stack

2019-07-10 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16882641#comment-16882641
 ] 

Hyukjin Kwon commented on SPARK-28269:
--

Seems like Arrow stream batches are not properly created within Python side. 
NumPy array uses memoryview {{np.random.randint(0, 100, size=(rows, 
cols_count)}} and Pandas leverages it as well IIRC. This can be a cause.

> ArrowStreamPandasSerializer get stack
> -
>
> Key: SPARK-28269
> URL: https://issues.apache.org/jira/browse/SPARK-28269
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3
>Reporter: Modi Tamam
>Priority: Major
> Attachments: Untitled.xcf
>
>
> I'm working with Pyspark version 2.4.3.
> I have a big data frame:
>  * ~15M rows
>  * ~130 columns
>  * ~2.5 GB - I've converted it to a Pandas data frame, then, pickling it 
> (pandas_df.toPickle() ) resulted with a file of size 2.5GB.
> I have some code that groups this data frame and applying a Pandas-UDF:
>  
> {code:java}
> from pyspark.sql import Row
> from pyspark.sql.functions import lit, pandas_udf, PandasUDFType, to_json
> from pyspark.sql.types import *
> from pyspark.sql import functions as F
> initial_list = range(4500)
> rdd = sc.parallelize(initial_list)
> rdd = rdd.map(lambda x: Row(val=x))
> initial_spark_df = spark.createDataFrame(rdd)
> cols_count = 132
> rows = 1000
> # --- Start Generating the big data frame---
> # Generating the schema
> schema = StructType([StructField(str(i), IntegerType()) for i in 
> range(cols_count)])
> @pandas_udf(returnType=schema,functionType=PandasUDFType.GROUPED_MAP)
> def random_pd_df_generator(df):
> import numpy as np
> import pandas as pd
> return pd.DataFrame(np.random.randint(0, 100, size=(rows, cols_count)), 
> columns=range(cols_count))
> full_spark_df = initial_spark_df.groupBy("val").apply(random_pd_df_generator)
> # --- End Generating the big data frame---
> # ---Start the bug reproduction---
> grouped_col = "col_0"
> @pandas_udf("%s string" %grouped_col, PandasUDFType.GROUPED_MAP)
> def very_simpl_udf(pdf):
> import pandas as pd
> ret_val = pd.DataFrame({grouped_col: [str(pdf[grouped_col].iloc[0])]})
> return ret_val
> # In order to create a huge dataset, I've set all of the grouped_col value to 
> a single value, then, grouped it into a single dataset.
> # Here is where to program gets stuck
> full_spark_df.withColumn(grouped_col,F.lit('0')).groupBy(grouped_col).apply(very_simpl_udf).show()
> assert False, "If we're, means that the issue wasn't reproduced"
> {code}
>  
> The above code gets stuck on the ArrowStreamPandasSerializer: (on the first 
> line when reading batch from the reader)
>  
> {code:java}
> for batch in reader:
>  yield [self.arrow_to_pandas(c) for c in  
> pa.Table.from_batches([batch]).itercolumns()]{code}
>  
>  You can just run the first code snippet and it will reproduce.
> Open a Pyspark shell with this configuration:
> {code:java}
> pyspark --conf "spark.python.worker.memory=3G" --conf 
> "spark.executor.memory=20G" --conf 
> "spark.executor.extraJavaOptions=-XX:+UseG1GC" --conf 
> "spark.driver.memory=10G"{code}
>  
> Versions:
>  * pandas - 0.24.2
>  * pyarrow - 0.13.0
>  * Spark - 2.4.2
>  * Python - 2.7.16



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28269) ArrowStreamPandasSerializer get stack

2019-07-10 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16882639#comment-16882639
 ] 

Hyukjin Kwon commented on SPARK-28269:
--

Workaround to me was call `copy()` on this line:

{code:java}
-return pd.DataFrame(np.random.randint(0, 100, size=(rows, cols_count)), 
columns=range(cols_count))
+return pd.DataFrame(np.random.randint(0, 100, size=(rows, cols_count)), 
columns=range(cols_count)).copy()
{code}


> ArrowStreamPandasSerializer get stack
> -
>
> Key: SPARK-28269
> URL: https://issues.apache.org/jira/browse/SPARK-28269
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3
>Reporter: Modi Tamam
>Priority: Major
> Attachments: Untitled.xcf
>
>
> I'm working with Pyspark version 2.4.3.
> I have a big data frame:
>  * ~15M rows
>  * ~130 columns
>  * ~2.5 GB - I've converted it to a Pandas data frame, then, pickling it 
> (pandas_df.toPickle() ) resulted with a file of size 2.5GB.
> I have some code that groups this data frame and applying a Pandas-UDF:
>  
> {code:java}
> from pyspark.sql import Row
> from pyspark.sql.functions import lit, pandas_udf, PandasUDFType, to_json
> from pyspark.sql.types import *
> from pyspark.sql import functions as F
> initial_list = range(4500)
> rdd = sc.parallelize(initial_list)
> rdd = rdd.map(lambda x: Row(val=x))
> initial_spark_df = spark.createDataFrame(rdd)
> cols_count = 132
> rows = 1000
> # --- Start Generating the big data frame---
> # Generating the schema
> schema = StructType([StructField(str(i), IntegerType()) for i in 
> range(cols_count)])
> @pandas_udf(returnType=schema,functionType=PandasUDFType.GROUPED_MAP)
> def random_pd_df_generator(df):
> import numpy as np
> import pandas as pd
> return pd.DataFrame(np.random.randint(0, 100, size=(rows, cols_count)), 
> columns=range(cols_count))
> full_spark_df = initial_spark_df.groupBy("val").apply(random_pd_df_generator)
> # --- End Generating the big data frame---
> # ---Start the bug reproduction---
> grouped_col = "col_0"
> @pandas_udf("%s string" %grouped_col, PandasUDFType.GROUPED_MAP)
> def very_simpl_udf(pdf):
> import pandas as pd
> ret_val = pd.DataFrame({grouped_col: [str(pdf[grouped_col].iloc[0])]})
> return ret_val
> # In order to create a huge dataset, I've set all of the grouped_col value to 
> a single value, then, grouped it into a single dataset.
> # Here is where to program gets stuck
> full_spark_df.withColumn(grouped_col,F.lit('0')).groupBy(grouped_col).apply(very_simpl_udf).show()
> assert False, "If we're, means that the issue wasn't reproduced"
> {code}
>  
> The above code gets stuck on the ArrowStreamPandasSerializer: (on the first 
> line when reading batch from the reader)
>  
> {code:java}
> for batch in reader:
>  yield [self.arrow_to_pandas(c) for c in  
> pa.Table.from_batches([batch]).itercolumns()]{code}
>  
>  You can just run the first code snippet and it will reproduce.
> Open a Pyspark shell with this configuration:
> {code:java}
> pyspark --conf "spark.python.worker.memory=3G" --conf 
> "spark.executor.memory=20G" --conf 
> "spark.executor.extraJavaOptions=-XX:+UseG1GC" --conf 
> "spark.driver.memory=10G"{code}
>  
> Versions:
>  * pandas - 0.24.2
>  * pyarrow - 0.13.0
>  * Spark - 2.4.2
>  * Python - 2.7.16



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28337) spark jars do not contain commons-jxpath jar, cause ClassNotFound exception

2019-07-10 Thread Wang Yanlin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16882619#comment-16882619
 ] 

Wang Yanlin commented on SPARK-28337:
-

my pom configuration for shade commons-configuration and common-jxpath.


http://maven.apache.org/POM/4.0.0;
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance;
 xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
http://maven.apache.org/xsd/maven-4.0.0.xsd;>
 
 yanzhi-app
 com.alipay.yanzhi
 1.0-SNAPSHOT
 
 4.0.0

 spark-app
 jar

 
 com.yanzhi_project
 

 
 
 commons-jxpath
 commons-jxpath
 1.3
 
 
 commons-configuration
 commons-configuration
 1.5
 compile
 
 

 
 
 
 org.apache.maven.plugins
 maven-shade-plugin
 
 false
 
 
 commons-jxpath:commons-jxpath
 commons-configuration:commons-configuration
 
 
 
 
 *:*
 
 META-INF/*.SF
 META-INF/*.DSA
 META-INF/*.RSA
 
 
 
 
 
 
 
 package
 
 shade
 
 
 
 
 
 



> spark jars do not contain commons-jxpath jar, cause ClassNotFound exception
> ---
>
> Key: SPARK-28337
> URL: https://issues.apache.org/jira/browse/SPARK-28337
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 3.0.0
>Reporter: Wang Yanlin
>Priority: Major
> Attachments: exception_stack_trace.jpg, shde_in_pom.jpg
>
>
> When submit a spark application using XPathExpressionEngine, a ClassNotFound 
> exception will be thrown, even though I shade then class of 
> commons-jxpath:commons-jxpath into my jar.
> The main class of my jar is as follows:
> object StaticTestMain {
>     def main(args: Array[String]): Unit = {
>        println(s"yanzhi class loader info ${getClass.getClassLoader}")
>         val engine = new XPathExpressionEngine() // exception happends in 
> this line
>         // some other codes.
>     }
>  }
> And configuration of shade plugin as the attached picture file 
> "shade_in_pom.jpg"
> The exception stack trace is in the attached picture file 
> "exception_stack_trace.jpg".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28015) Check stringToDate() consumes entire input for the yyyy and yyyy-[m]m formats

2019-07-10 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28015:
--
Fix Version/s: 2.4.4

> Check stringToDate() consumes entire input for the  and -[m]m formats
> -
>
> Key: SPARK-28015
> URL: https://issues.apache.org/jira/browse/SPARK-28015
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.3, 2.2.3, 2.3.3, 2.4.3
>Reporter: Yuming Wang
>Assignee: Maxim Gekk
>Priority: Major
>  Labels: correctness
> Fix For: 2.4.4, 3.0.0
>
>
> Invalid date formats should throw an exception:
> {code:sql}
> SELECT date '1999 08 01'
> 1999-01-01
> {code}
> Supported date formats:
> https://github.com/apache/spark/blob/ab8710b57916a129fcb89464209361120d224535/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L365-L374
> Since Spark 1.6.3 ~ 2.4.3, the behavior is the same.
> {code}
> spark-sql> SELECT CAST('1999 08 01' AS DATE);
> 1999-01-01
> {code}
> Hive returns NULL.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28269) ArrowStreamPandasSerializer get stack

2019-07-10 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-28269:
-
Description: 
I'm working with Pyspark version 2.4.3.

I have a big data frame:
 * ~15M rows
 * ~130 columns
 * ~2.5 GB - I've converted it to a Pandas data frame, then, pickling it 
(pandas_df.toPickle() ) resulted with a file of size 2.5GB.

I have some code that groups this data frame and applying a Pandas-UDF:

 
{code:java}
from pyspark.sql import Row
from pyspark.sql.functions import lit, pandas_udf, PandasUDFType, to_json
from pyspark.sql.types import *
from pyspark.sql import functions as F

initial_list = range(4500)
rdd = sc.parallelize(initial_list)
rdd = rdd.map(lambda x: Row(val=x))
initial_spark_df = spark.createDataFrame(rdd)

cols_count = 132
rows = 1000

# --- Start Generating the big data frame---
# Generating the schema
schema = StructType([StructField(str(i), IntegerType()) for i in 
range(cols_count)])

@pandas_udf(returnType=schema,functionType=PandasUDFType.GROUPED_MAP)
def random_pd_df_generator(df):
import numpy as np
import pandas as pd
return pd.DataFrame(np.random.randint(0, 100, size=(rows, cols_count)), 
columns=range(cols_count))


full_spark_df = initial_spark_df.groupBy("val").apply(random_pd_df_generator)
# --- End Generating the big data frame---

# ---Start the bug reproduction---
grouped_col = "col_0"

@pandas_udf("%s string" %grouped_col, PandasUDFType.GROUPED_MAP)
def very_simpl_udf(pdf):
import pandas as pd
ret_val = pd.DataFrame({grouped_col: [str(pdf[grouped_col].iloc[0])]})
return ret_val

# In order to create a huge dataset, I've set all of the grouped_col value to a 
single value, then, grouped it into a single dataset.
# Here is where to program gets stuck
full_spark_df.withColumn(grouped_col,F.lit('0')).groupBy(grouped_col).apply(very_simpl_udf).show()
assert False, "If we're, means that the issue wasn't reproduced"

{code}
 

The above code gets stuck on the ArrowStreamPandasSerializer: (on the first 
line when reading batch from the reader)

 
{code:java}
for batch in reader:
 yield [self.arrow_to_pandas(c) for c in  
pa.Table.from_batches([batch]).itercolumns()]{code}
 

 You can just run the first code snippet and it will reproduce.

Open a Pyspark shell with this configuration:
{code:java}
pyspark --conf "spark.python.worker.memory=3G" --conf 
"spark.executor.memory=20G" --conf 
"spark.executor.extraJavaOptions=-XX:+UseG1GC" --conf 
"spark.driver.memory=10G"{code}
 

Versions:
 * pandas - 0.24.2
 * pyarrow - 0.13.0
 * Spark - 2.4.2
 * Python - 2.7.16

  was:
I'm working with Pyspark version 2.4.3.

I have a big data frame:
 * ~15M rows
 * ~130 columns
 * ~2.5 GB - I've converted it to a Pandas data frame, then, pickling it 
(pandas_df.toPickle() ) resulted with a file of size 2.5GB.

I have some code that groups this data frame and applying a Pandas-UDF:

 
{code:java}
from pyspark.sql import Row
from pyspark.sql.functions import lit, pandas_udf, PandasUDFType, to_json
from pyspark.sql.types import *
from pyspark.sql import functions as F

initial_list = range(4500)
rdd = sc.parallelize(initial_list)
rdd = rdd.map(lambda x: Row(val=x))
initial_spark_df = spark.createDataFrame(rdd)

cols_count = 132
rows = 1000

# --- Start Generating the big data frame---
# Generating the schema
schema = StructType([StructField(str(i), IntegerType()) for i in 
range(cols_count)])

@pandas_udf(returnType=schema,functionType=PandasUDFType.GROUPED_MAP)
def random_pd_df_generator(df):
import numpy as np
import pandas as pd
return pd.DataFrame(np.random.randint(0, 100, size=(rows, cols_count)), 
columns=range(cols_count))


full_spark_df = initial_spark_df.groupBy("val").apply(random_pd_df_generator)
# --- End Generating the big data frame---

# ---Start the bug reproduction---
grouped_col = "col_0"

@pandas_udf("%s string" %grouped_col, PandasUDFType.GROUPED_MAP)
def very_simpl_udf(pdf):
import pandas as pd
ret_val = pd.DataFrame({grouped_col: [str(pdf[grouped_col].iloc[0])]})
return ret_val

# In order to create a huge dataset, I've set all of the grouped_col value to a 
single value, then, grouped it into a single dataset.
# Here is where to program gets stuck
full_spark_df.withColumn(grouped_col,F.lit('0')).groupBy(grouped_col).apply(very_simpl_udf).show()
assert False, "If we're, means that the issue wasn't reproduced"

{code}
 

The above code gets stacked on the ArrowStreamPandasSerializer: (on the first 
line when reading batch from the reader)

 
{code:java}
for batch in reader:
 yield [self.arrow_to_pandas(c) for c in  
pa.Table.from_batches([batch]).itercolumns()]{code}
 

 You can just run the first code

[jira] [Assigned] (SPARK-28306) Once optimizer rule NormalizeFloatingNumbers is not idempotent

2019-07-10 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-28306:
---

Assignee: Yesheng Ma

> Once optimizer rule NormalizeFloatingNumbers is not idempotent
> --
>
> Key: SPARK-28306
> URL: https://issues.apache.org/jira/browse/SPARK-28306
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yesheng Ma
>Assignee: Yesheng Ma
>Priority: Major
>
> When the rule NormalizeFloatingNumbers is called multiple times, it will add 
> additional transform operator to an expression, which is not appropriate. To 
> fix it, we have to make it idempotent, i.e. yield the same logical plan 
> regardless of multiple runs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28306) Once optimizer rule NormalizeFloatingNumbers is not idempotent

2019-07-10 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-28306.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 25080
[https://github.com/apache/spark/pull/25080]

> Once optimizer rule NormalizeFloatingNumbers is not idempotent
> --
>
> Key: SPARK-28306
> URL: https://issues.apache.org/jira/browse/SPARK-28306
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yesheng Ma
>Assignee: Yesheng Ma
>Priority: Major
> Fix For: 3.0.0
>
>
> When the rule NormalizeFloatingNumbers is called multiple times, it will add 
> additional transform operator to an expression, which is not appropriate. To 
> fix it, we have to make it idempotent, i.e. yield the same logical plan 
> regardless of multiple runs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28300) Kmeans is failing when we run parallely passing an RDD

2019-07-10 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-28300.
--
Resolution: Invalid

Looks like a question. Let's interact with mailing list first before filing it 
as an issue. Check out [https://spark.apache.org/community.html].

Also, if possible, test it at least above 2.3.x. Lower versions are EOL 
releases.

> Kmeans is failing when we run parallely passing an RDD
> --
>
> Key: SPARK-28300
> URL: https://issues.apache.org/jira/browse/SPARK-28300
> Project: Spark
>  Issue Type: Task
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Ramanjaneya Naidu Nalla
>Priority: Major
>
> Hi,
> I am facing when we run the spark KMEAN algorithm parallelising it by sending 
> sample RDD .
> KMEAN algorithm run is failing on executor when we pass cluster sample as RDD 
> type (
> RDD[linalg.Vector] to executors. It is failing  because RDD[linalg.Vector]  
> unavailable at executor side.
> Can we pass RDD  to executor to make KMEAN run in parallely ?
> Please suggest any suggestion how to achieve KMEAN running parrelly on 
> executors?
> Please find  below code snippet and error in the logs
> Regards,
> Raman.
> +Code snippet+
> Driver side  code ::
> val kmeansCluster = sc.parallelize(List.range(kStart, kEnd + 1)).map(k => {
>  val sharedContext = SharedClusteringData[linalg.Vector,KMeansModel](job, 
> spark, sampleId, Some(k),
>  ClusteringType.KMEANS.name() + "clustering processes for:" + k)
>  //val sharedContextLoadSamplesCount = 
> sharedContextLoadSample.clusterSample.get
>  //log.info(s"cluster sample count is 
> ${sharedContextLoadSamplesCount.count()}")
>  sharedContext.selectedFeatureIdx = 
> Some(loadSample.value.selectedFeatureIdx.get)
>  sharedContext.dropColIdx = Some(loadSample.value.dropColIdx.get)
>  sharedContext.dataset = loadSample.value.dataset)
>  sharedContext.clusterSample= loadSample.value.clusterSample
>  println("In Driver program :::")
>  sharedContext.clusterSample.foreach(x=>println(x))
>  println("In Driver program END :::")
>  RunClustering.runKMean(sharedContext) match {
>  case Success(true) =>
>  log.info(s"${ClusteringType.KMEANS.name()} is completed for k =$k ")
>  case Success(false) =>
>  log.error(s"${ClusteringType.KMEANS.name()} is failed for k = $k")
>  case Failure(ex) =>
>  log.error(s"${ClusteringType.KMEANS.name} clustering failed for $k")
>  log.error(ex.getStackTrace.mkString("\n"))
>  }
>  (k, sharedContext.isSuccessful, sharedContext.message)
>  })
> +Executor side+ 
>  def buildCluster[S, M](k: Int, clusterSample: RDD[S], maxIteration: Int): 
> Try[M] =
> { Try(KMeans.train(kmeanSample, k, maxIteration).asInstanceOf[M]) }
> Logs::
>  
> org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$sc(RDD.scala:89) 
> org.apache.spark.rdd.RDD.count(RDD.scala:1158) 
> com.mplatform.consumer.clustering.buildcluster.BuildKMeansCluster.getClustering(BuildKMeansCluster.scala:33)
>  
> com.mplatform.consumer.clustering.buildcluster.BuildCluster.run(BuildCluster.scala:14)
>  
> com.mplatform.consumer.clustering.clusteringprocessor.RunClustering$$anonfun$runKMean$1.apply$mcZ$sp(RunClustering.scala:14)
>  
> com.mplatform.consumer.clustering.clusteringprocessor.RunClustering$$anonfun$runKMean$1.apply(RunClustering.scala:11)
>  
> com.mplatform.consumer.clustering.clusteringprocessor.RunClustering$$anonfun$runKMean$1.apply(RunClustering.scala:11)
>  scala.util.Try$.apply(Try.scala:192) 
> com.mplatform.consumer.clustering.clusteringprocessor.RunClustering$.runKMean(RunClustering.scala:11)
>  
> com.mplatform.consumer.clustering.clusteringprocessor.ClusterProcessor$$anonfun$1.apply(ClusterProcessor.scala:81)
>  
> com.mplatform.consumer.clustering.clusteringprocessor.ClusterProcessor$$anonfun$1.apply(ClusterProcessor.scala:69)
>  scala.collection.Iterator$$anon$11.next(Iterator.scala:409) 
> scala.collection.Iterator$class.foreach(Iterator.scala:893) 
> scala.collection.AbstractIterator.foreach(Iterator.scala:1336) 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) 
> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) 
> scala.collection.AbstractIterator.to(Iterator.scala:1336) 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) 
> scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336) 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) 
> scala.collection.AbstractIterator.toArray(Iterator.scala:1336) 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936) 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)

[jira] [Commented] (SPARK-28320) Spark job eventually fails after several "attempted to access non-existent accumulator" in DAGScheduler

2019-07-10 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16882613#comment-16882613
 ] 

Hyukjin Kwon commented on SPARK-28320:
--

Is it possible to provide a reproducer? Seems difficult to verify without 
knowing how to reproduce.

> Spark job eventually fails after several "attempted to access non-existent 
> accumulator" in DAGScheduler
> ---
>
> Key: SPARK-28320
> URL: https://issues.apache.org/jira/browse/SPARK-28320
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Martin Studer
>Priority: Major
>
> I'm running into an issue where a Spark 2.3.0 (Hortonworks HDP 2.6.5) job 
> eventually fails with
> {noformat}
> ERROR ApplicationMaster: User application exited with status 1
> INFO ApplicationMaster: Final app status: FAILED, exitCode: 1, (reason: User 
> application exited with status 1)
> INFO SparkContext: Invoking stop() from shutdown hook
> {noformat}
> after receiving several exception of the form
> {noformat}
> ERROR DAGScheduler: Failed to update accumulators for task 0
> org.apache.spark.SparkException: attempted to access non-existent accumulator 
> 39052
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1130)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1124)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> org.apache.spark.scheduler.DAGScheduler.updateAccumulators(DAGScheduler.scala:1124)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1207)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1817)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1769)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1758)
> {noformat}
> In addition to "attempted to access non-existent accumulator" I have also 
> noticed some (but much less) instances of "Attempted to access garbage 
> collected accumulator":
> {noformat}
> ERROR DAGScheduler: Failed to update accumulators for task 0
> java.lang.IllegalStateException: Attempted to access garbage collected 
> accumulator 38352
> at 
> org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:265)
> at 
> org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:261)
> at scala.Option.map(Option.scala:146)
> at 
> org.apache.spark.util.AccumulatorContext$.get(AccumulatorV2.scala:261)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1127)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1124)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> org.apache.spark.scheduler.DAGScheduler.updateAccumulators(DAGScheduler.scala:1124)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1207)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1817)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1769)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1758)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> {noformat}
> To provide some more context: This happens in a recursive algorithm 
> implemented in pyspark where I leverage data frame checkpointing to truncate 
> the lineage graph. Checkpointing is done asynchronously by invoking the count 
> action on a different thread when recursing (using Python thread pools).
> While "attempted to access garbage collected accumulator" seems to be an 
> unexpected (illegal state) exception, it's unclear to me whether "attempted 
> to access non-existent accumulator" is an expected exception in some 
> circumstances, specifically related to checkpointing.
> The issue looks somewhat related to 
> https://issues.apache.org/jira/browse/SPARK-22371 but that issue does not 
> mention "attempted to access non-existent accumulator".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To

[jira] [Commented] (SPARK-28343) PostgreSQL test should change some default config

2019-07-10 Thread Yuming Wang (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16882611#comment-16882611
 ] 

Yuming Wang commented on SPARK-28343:
-

I'm working on.

> PostgreSQL test should change some default config
> -
>
> Key: SPARK-28343
> URL: https://issues.apache.org/jira/browse/SPARK-28343
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> {noformat}
> set spark.sql.crossJoin.enabled=true;
> set spark.sql.parser.ansi.enabled=true;
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28327) Spark SQL can't support union with left query have queryOrganization

2019-07-10 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16882601#comment-16882601
 ] 

Hyukjin Kwon commented on SPARK-28327:
--

Currently the feature party is being matched against PostgreSQL. Not sure if 
this one complies ANSI standard.
Given the current context, seems not an issue.

> Spark SQL can't support union with left query  have queryOrganization
> -
>
> Key: SPARK-28327
> URL: https://issues.apache.org/jira/browse/SPARK-28327
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> Spark SQL can't support SQL like 
> {code:java}
> SELECT A FROM TABLE_1 LIMIT 1
> UNION 
> SELECT A FROM TABLE_2 LIMIT 2{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28327) Spark SQL can't support union with left query have queryOrganization

2019-07-10 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-28327.
--
Resolution: Won't Fix

> Spark SQL can't support union with left query  have queryOrganization
> -
>
> Key: SPARK-28327
> URL: https://issues.apache.org/jira/browse/SPARK-28327
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> Spark SQL can't support SQL like 
> {code:java}
> SELECT A FROM TABLE_1 LIMIT 1
> UNION 
> SELECT A FROM TABLE_2 LIMIT 2{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28336) Tried running same code in local machine in IDE pycharm it running fine but issue arises when i setup all on EC2 my RDD has Json Value and convert it to data frame and

2019-07-10 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-28336.
--
Resolution: Invalid

Looks like a question. Let's interact with mailing list first before filing it 
as an issue. See [https://spark.apache.org/community.html]

> Tried running same code in local machine in IDE pycharm it running fine but 
> issue arises when i setup all on EC2 my RDD has Json Value and convert it to 
> data frame and show dataframe by Show method it fails to show my data frame.
> -
>
> Key: SPARK-28336
> URL: https://issues.apache.org/jira/browse/SPARK-28336
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, DStreams, EC2, PySpark, Spark Submit
>Affects Versions: 2.4.3
> Environment: Using EC2 Ubuntu 18.04.2 LTS
> Spark version : Spark 2.4.3 built for Hadoop 2.7.3
> Kafka version : kafka_2.12-2.2.1
>Reporter: Aditya
>Priority: Minor
>  Labels: kafka
>
> I am a beginner to pyspark and I am creating a pilot project in spark i used 
> pycharm IDE for developing my project and it runs fine on my IDE Let me 
> explain my project I am producing JSON in Kafka topic and consuming topic in 
> spark and converting RDD VALUE(which is i JSON) converting to data frame 
> using this method (productInfo = sqlContext.read.json(rdd)) and working 
> perfectly on my local machine after converting RDD to DataFrame I am 
> displaying that DataFrame to my console using .Show() method and working fine.
> But my problem arises when I setup all this(Kafka,apache-spark) in EC2(Ubuntu 
> 18.04.2 LTS) and tried to execute using spark-submit console stop when it 
> reached my show() method and display nothing again starts and stops at show() 
> method I can't figure out what is error not showing any error in console and 
> also check if my data is coming in RDD or not it is in RDD
> {color:#ff}My Code: {color}
> {code:java}
> # coding: utf-8 
> from pyspark import SparkContext
> from pyspark import SparkConf
> from pyspark.streaming import StreamingContext
> from pyspark.streaming.kafka import KafkaUtils
> from pyspark.sql import Row, DataFrame, SQLContext
> import pandas as pd
> def getSqlContextInstance(sparkContext):
> if ('sqlContextSingletonInstance' not in globals()):
> globals()['sqlContextSingletonInstance'] = SQLContext(sparkContext)
> return globals()['sqlContextSingletonInstance']
> def process(time, rdd):
> print("= %s =" % str(time))
> try:
> #print("--Also cross check my data is present in rdd I 
> checked by printing ")
> #results = rdd.collect()
> #for result in results:
> #print(result)
> # Get the singleton instance of SparkSession
> sqlContext = getSqlContextInstance(rdd.context)
> productInfo = sqlContext.read.json(rdd)
> # problem comes here when i try to show it
> productInfo.show()
> except:
> pass
> if _name_ == '_main_':
> conf = SparkConf().set("spark.cassandra.connection.host", "127.0.0.1")
> sc = SparkContext(conf = conf)
> sc.setLogLevel("WARN")
> sqlContext = SQLContext(sc)
> ssc = StreamingContext(sc,10)
> kafkaStream = KafkaUtils.createStream(ssc, 'localhost:2181', 
> 'spark-streaming', {'new_topic':1})
> lines = kafkaStream.map(lambda x: x[1])
> lines.foreachRDD(process)
> #lines.pprint()
> ssc.start()
> ssc.awaitTermination()
> {code}
>  
> {color:#ff}My console:{color}
> {code:java}
> ./spark-submit ReadingJsonFromKafkaAndWritingToScylla_CSV_Example.py
>  19/07/10 11:13:15 WARN NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
>  Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
>  19/07/10 11:13:15 INFO SparkContext: Running Spark version 2.4.3
>  19/07/10 11:13:15 INFO SparkContext: Submitted application: 
> ReadingJsonFromKafkaAndWritingToScylla_CSV_Example.py
>  19/07/10 11:13:15 INFO SecurityManager: Changing view acls to: kafka
>  19/07/10 11:13:15 INFO SecurityManager: Changing modify acls to: kafka
>  19/07/10 11:13:15 INFO SecurityManager: Changing view acls groups to: 
>  19/07/10 11:13:15 INFO SecurityManager: Changing modify acls groups to: 
>  19/07/10 11:13:15 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(kafka); groups 
> with view permissions: Set(); users with modify permissions: Set(kafka); 
> groups with modify permissions: Set()
>  19/07/10 11:13:16 INFO Utils: Successfully started service

[jira] [Updated] (SPARK-28336) Tried running same code in local machine in IDE pycharm it running fine but issue arises when i setup all on EC2 my RDD has Json Value and convert it to data frame and s

2019-07-10 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-28336:
-
Description: 
I am a beginner to pyspark and I am creating a pilot project in spark i used 
pycharm IDE for developing my project and it runs fine on my IDE Let me explain 
my project I am producing JSON in Kafka topic and consuming topic in spark and 
converting RDD VALUE(which is i JSON) converting to data frame using this 
method (productInfo = sqlContext.read.json(rdd)) and working perfectly on my 
local machine after converting RDD to DataFrame I am displaying that DataFrame 
to my console using .Show() method and working fine.

But my problem arises when I setup all this(Kafka,apache-spark) in EC2(Ubuntu 
18.04.2 LTS) and tried to execute using spark-submit console stop when it 
reached my show() method and display nothing again starts and stops at show() 
method I can't figure out what is error not showing any error in console and 
also check if my data is coming in RDD or not it is in RDD

{color:#ff}My Code: {color}
{code:java}
# coding: utf-8 
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
from pyspark.sql import Row, DataFrame, SQLContext
import pandas as pd

def getSqlContextInstance(sparkContext):
if ('sqlContextSingletonInstance' not in globals()):
globals()['sqlContextSingletonInstance'] = SQLContext(sparkContext)
return globals()['sqlContextSingletonInstance']

def process(time, rdd):
print("= %s =" % str(time))

try:
#print("--Also cross check my data is present in rdd I checked 
by printing ")
#results = rdd.collect()
#for result in results:
#print(result)

# Get the singleton instance of SparkSession
sqlContext = getSqlContextInstance(rdd.context)
productInfo = sqlContext.read.json(rdd)

# problem comes here when i try to show it
productInfo.show()
except:
pass

if _name_ == '_main_':
conf = SparkConf().set("spark.cassandra.connection.host", "127.0.0.1")
sc = SparkContext(conf = conf)
sc.setLogLevel("WARN")
sqlContext = SQLContext(sc)
ssc = StreamingContext(sc,10)
kafkaStream = KafkaUtils.createStream(ssc, 'localhost:2181', 
'spark-streaming', {'new_topic':1})
lines = kafkaStream.map(lambda x: x[1])
lines.foreachRDD(process)
#lines.pprint()
ssc.start()
ssc.awaitTermination()
{code}
 

{color:#ff}My console:{color}
{code:java}
./spark-submit ReadingJsonFromKafkaAndWritingToScylla_CSV_Example.py
 19/07/10 11:13:15 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
 Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
 19/07/10 11:13:15 INFO SparkContext: Running Spark version 2.4.3
 19/07/10 11:13:15 INFO SparkContext: Submitted application: 
ReadingJsonFromKafkaAndWritingToScylla_CSV_Example.py
 19/07/10 11:13:15 INFO SecurityManager: Changing view acls to: kafka
 19/07/10 11:13:15 INFO SecurityManager: Changing modify acls to: kafka
 19/07/10 11:13:15 INFO SecurityManager: Changing view acls groups to: 
 19/07/10 11:13:15 INFO SecurityManager: Changing modify acls groups to: 
 19/07/10 11:13:15 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(kafka); groups 
with view permissions: Set(); users with modify permissions: Set(kafka); groups 
with modify permissions: Set()
 19/07/10 11:13:16 INFO Utils: Successfully started service 'sparkDriver' on 
port 41655.
 19/07/10 11:13:16 INFO SparkEnv: Registering MapOutputTracker
 19/07/10 11:13:16 INFO SparkEnv: Registering BlockManagerMaster
 19/07/10 11:13:16 INFO BlockManagerMasterEndpoint: Using 
org.apache.spark.storage.DefaultTopologyMapper for getting topology information
 19/07/10 11:13:16 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint 
up
 19/07/10 11:13:16 INFO DiskBlockManager: Created local directory at 
/tmp/blockmgr-33f848fe-88d7-4c8f-8440-8384e094c59c
 19/07/10 11:13:16 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
 19/07/10 11:13:16 INFO SparkEnv: Registering OutputCommitCoordinator
 19/07/10 11:13:16 WARN Utils: Service 'SparkUI' could not bind on port 4040. 
Attempting port 4041.
 19/07/10 11:13:16 WARN Utils: Service 'SparkUI' could not bind on port 4041. 
Attempting port 4042.
 19/07/10 11:13:16 WARN Utils: Service 'SparkUI' could not bind on port 4042. 
Attempting port 4043.
 19/07/10 11:13:16 WARN Utils: Service 'SparkUI' could not bind on port 4043. 
Attempting port 4044.
 19/07/10 11:13:16 WARN Utils: Service 'SparkUI' could not bind on port 4044. 
Attempting port 4045.
 19/07/10 11:13:16 WARN Utils: Service 'SparkUI' could not bind on port 4045. 
Attempting port 4046.

[jira] [Updated] (SPARK-28336) Tried running same code in local machine in IDE pycharm it running fine but issue arises when i setup all on EC2 my RDD has Json Value and convert it to data frame and s

2019-07-10 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-28336:
-
Labels: kafka  (was: beginner kafka newbie)

> Tried running same code in local machine in IDE pycharm it running fine but 
> issue arises when i setup all on EC2 my RDD has Json Value and convert it to 
> data frame and show dataframe by Show method it fails to show my data frame.
> -
>
> Key: SPARK-28336
> URL: https://issues.apache.org/jira/browse/SPARK-28336
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, DStreams, EC2, PySpark, Spark Submit
>Affects Versions: 2.4.3
> Environment: Using EC2 Ubuntu 18.04.2 LTS
> Spark version : Spark 2.4.3 built for Hadoop 2.7.3
> Kafka version : kafka_2.12-2.2.1
>Reporter: Aditya
>Priority: Minor
>  Labels: kafka
>
> I am a beginner to pyspark and I am creating a pilot project in spark i used 
> pycharm IDE for developing my project and it runs fine on my IDE Let me 
> explain my project I am producing JSON in Kafka topic and consuming topic in 
> spark and converting RDD VALUE(which is i JSON) converting to data frame 
> using this method (productInfo = sqlContext.read.json(rdd)) and working 
> perfectly on my local machine after converting RDD to DataFrame I am 
> displaying that DataFrame to my console using .Show() method and working fine.
> But my problem arises when I setup all this(Kafka,apache-spark) in EC2(Ubuntu 
> 18.04.2 LTS) and tried to execute using spark-submit console stop when it 
> reached my show() method and display nothing again starts and stops at show() 
> method I can't figure out what is error not showing any error in console and 
> also check if my data is coming in RDD or not it is in RDD
> {color:#ff}My Code: {color}
> {code:java}
> # coding: utf-8 
> from pyspark import SparkContext
> from pyspark import SparkConf
> from pyspark.streaming import StreamingContext
> from pyspark.streaming.kafka import KafkaUtils
> from pyspark.sql import Row, DataFrame, SQLContext
> import pandas as pd
> def getSqlContextInstance(sparkContext):
> if ('sqlContextSingletonInstance' not in globals()):
> globals()['sqlContextSingletonInstance'] = SQLContext(sparkContext)
> return globals()['sqlContextSingletonInstance']
> def process(time, rdd):
> print("= %s =" % str(time))
> try:
> #print("--Also cross check my data is present in rdd I 
> checked by printing ")
> #results = rdd.collect()
> #for result in results:
> #print(result)
> # Get the singleton instance of SparkSession
> sqlContext = getSqlContextInstance(rdd.context)
> productInfo = sqlContext.read.json(rdd)
> # problem comes here when i try to show it
> productInfo.show()
> except:
> pass
> if _name_ == '_main_':
> conf = SparkConf().set("spark.cassandra.connection.host", "127.0.0.1")
> sc = SparkContext(conf = conf)
> sc.setLogLevel("WARN")
> sqlContext = SQLContext(sc)
> ssc = StreamingContext(sc,10)
> kafkaStream = KafkaUtils.createStream(ssc, 'localhost:2181', 
> 'spark-streaming', {'new_topic':1})
> lines = kafkaStream.map(lambda x: x[1])
> lines.foreachRDD(process)
> #lines.pprint()
> ssc.start()
> ssc.awaitTermination()
> {code}
>  
> {color:#ff}My console:{color}
> {code:java}
> ./spark-submit ReadingJsonFromKafkaAndWritingToScylla_CSV_Example.py
>  19/07/10 11:13:15 WARN NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
>  Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
>  19/07/10 11:13:15 INFO SparkContext: Running Spark version 2.4.3
>  19/07/10 11:13:15 INFO SparkContext: Submitted application: 
> ReadingJsonFromKafkaAndWritingToScylla_CSV_Example.py
>  19/07/10 11:13:15 INFO SecurityManager: Changing view acls to: kafka
>  19/07/10 11:13:15 INFO SecurityManager: Changing modify acls to: kafka
>  19/07/10 11:13:15 INFO SecurityManager: Changing view acls groups to: 
>  19/07/10 11:13:15 INFO SecurityManager: Changing modify acls groups to: 
>  19/07/10 11:13:15 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(kafka); groups 
> with view permissions: Set(); users with modify permissions: Set(kafka); 
> groups with modify permissions: Set()
>  19/07/10 11:13:16 INFO Utils: Successfully started service 'sparkDriver' on 
> port 41655.
>  19/07/10 11:13:16 INFO SparkEnv: Registering MapOutputTracker
>  19/07/10 11:13:16 INFO

[jira] [Updated] (SPARK-28336) Tried running same code in local machine in IDE pycharm it running fine but issue arises when i setup all on EC2 my RDD has Json Value and convert it to data frame and s

2019-07-10 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-28336:
-
Description: 
I am a beginner to pyspark and I am creating a pilot project in spark i used 
pycharm IDE for developing my project and it runs fine on my IDE Let me explain 
my project I am producing JSON in Kafka topic and consuming topic in spark and 
converting RDD VALUE(which is i JSON) converting to data frame using this 
method (productInfo = sqlContext.read.json(rdd)) and working perfectly on my 
local machine after converting RDD to DataFrame I am displaying that DataFrame 
to my console using .Show() method and working fine.

But my problem arises when I setup all this(Kafka,apache-spark) in EC2(Ubuntu 
18.04.2 LTS) and tried to execute using spark-submit console stop when it 
reached my show() method and display nothing again starts and stops at show() 
method I can't figure out what is error not showing any error in console and 
also check if my data is coming in RDD or not it is in RDD

{color:#ff}My Code: {color}

{code}
# coding: utf-8 
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
from pyspark.sql import Row, DataFrame, SQLContext
import pandas as pd

def getSqlContextInstance(sparkContext):
if ('sqlContextSingletonInstance' not in globals()):
globals()['sqlContextSingletonInstance'] = SQLContext(sparkContext)
return globals()['sqlContextSingletonInstance']

def process(time, rdd):
print("= %s =" % str(time))

try:
#print("--Also cross check my data is present in rdd I checked 
by printing ")
#results = rdd.collect()
#for result in results:
#print(result)

# Get the singleton instance of SparkSession
sqlContext = getSqlContextInstance(rdd.context)
productInfo = sqlContext.read.json(rdd)

# problem comes here when i try to show it
productInfo.show()
except:
pass

if _name_ == '_main_':
conf = SparkConf().set("spark.cassandra.connection.host", "127.0.0.1")
sc = SparkContext(conf = conf)
sc.setLogLevel("WARN")
sqlContext = SQLContext(sc)
ssc = StreamingContext(sc,10)
kafkaStream = KafkaUtils.createStream(ssc, 'localhost:2181', 
'spark-streaming', {'new_topic':1})
lines = kafkaStream.map(lambda x: x[1])
lines.foreachRDD(process)
#lines.pprint()
ssc.start()
ssc.awaitTermination()
{code}
 

{color:#ff}My console:{color}

./spark-submit ReadingJsonFromKafkaAndWritingToScylla_CSV_Example.py
 19/07/10 11:13:15 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
 Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
 19/07/10 11:13:15 INFO SparkContext: Running Spark version 2.4.3
 19/07/10 11:13:15 INFO SparkContext: Submitted application: 
ReadingJsonFromKafkaAndWritingToScylla_CSV_Example.py
 19/07/10 11:13:15 INFO SecurityManager: Changing view acls to: kafka
 19/07/10 11:13:15 INFO SecurityManager: Changing modify acls to: kafka
 19/07/10 11:13:15 INFO SecurityManager: Changing view acls groups to: 
 19/07/10 11:13:15 INFO SecurityManager: Changing modify acls groups to: 
 19/07/10 11:13:15 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(kafka); groups 
with view permissions: Set(); users with modify permissions: Set(kafka); groups 
with modify permissions: Set()
 19/07/10 11:13:16 INFO Utils: Successfully started service 'sparkDriver' on 
port 41655.
 19/07/10 11:13:16 INFO SparkEnv: Registering MapOutputTracker
 19/07/10 11:13:16 INFO SparkEnv: Registering BlockManagerMaster
 19/07/10 11:13:16 INFO BlockManagerMasterEndpoint: Using 
org.apache.spark.storage.DefaultTopologyMapper for getting topology information
 19/07/10 11:13:16 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint 
up
 19/07/10 11:13:16 INFO DiskBlockManager: Created local directory at 
/tmp/blockmgr-33f848fe-88d7-4c8f-8440-8384e094c59c
 19/07/10 11:13:16 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
 19/07/10 11:13:16 INFO SparkEnv: Registering OutputCommitCoordinator
 19/07/10 11:13:16 WARN Utils: Service 'SparkUI' could not bind on port 4040. 
Attempting port 4041.
 19/07/10 11:13:16 WARN Utils: Service 'SparkUI' could not bind on port 4041. 
Attempting port 4042.
 19/07/10 11:13:16 WARN Utils: Service 'SparkUI' could not bind on port 4042. 
Attempting port 4043.
 19/07/10 11:13:16 WARN Utils: Service 'SparkUI' could not bind on port 4043. 
Attempting port 4044.
 19/07/10 11:13:16 WARN Utils: Service 'SparkUI' could not bind on port 4044. 
Attempting port 4045.
 19/07/10 11:13:16 WARN Utils: Service 'SparkUI' could not bind on port 4045. 
Attempting port 4046.
 19/07/10

[jira] [Created] (SPARK-28343) PostgreSQL test should change some default config

2019-07-10 Thread Yuming Wang (JIRA)

Yuming Wang created SPARK-28343:
---

 Summary: PostgreSQL test should change some default config
 Key: SPARK-28343
 URL: https://issues.apache.org/jira/browse/SPARK-28343
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 3.0.0
Reporter: Yuming Wang



{noformat}
set spark.sql.crossJoin.enabled=true;
set spark.sql.parser.ansi.enabled=true;
{noformat}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28342) Replace REL_12_BETA1 to REL_12_BETA2 in PostgresSQL SQL tests

2019-07-10 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-28342.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

This is resolved via https://github.com/apache/spark/pull/25105

> Replace REL_12_BETA1 to REL_12_BETA2 in PostgresSQL SQL tests
> -
>
> Key: SPARK-28342
> URL: https://issues.apache.org/jira/browse/SPARK-28342
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Trivial
> Fix For: 3.0.0
>
>
> See [https://github.com/apache/spark/pull/25086#discussion_r302208451]
> We should replace REL_12_BETA1 to REL_12_BETA2.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28342) Replace REL_12_BETA1 to REL_12_BETA2 in PostgresSQL SQL tests

2019-07-10 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28342:


Assignee: Apache Spark  (was: Hyukjin Kwon)

> Replace REL_12_BETA1 to REL_12_BETA2 in PostgresSQL SQL tests
> -
>
> Key: SPARK-28342
> URL: https://issues.apache.org/jira/browse/SPARK-28342
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Trivial
>
> See [https://github.com/apache/spark/pull/25086#discussion_r302208451]
> We should replace REL_12_BETA1 to REL_12_BETA2.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28342) Replace REL_12_BETA1 to REL_12_BETA2 in PostgresSQL SQL tests

2019-07-10 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28342:


Assignee: Hyukjin Kwon  (was: Apache Spark)

> Replace REL_12_BETA1 to REL_12_BETA2 in PostgresSQL SQL tests
> -
>
> Key: SPARK-28342
> URL: https://issues.apache.org/jira/browse/SPARK-28342
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Trivial
>
> See [https://github.com/apache/spark/pull/25086#discussion_r302208451]
> We should replace REL_12_BETA1 to REL_12_BETA2.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28342) Replace REL_12_BETA1 to REL_12_BETA2 in PostgresSQL SQL tests

2019-07-10 Thread Hyukjin Kwon (JIRA)

Hyukjin Kwon created SPARK-28342:


 Summary: Replace REL_12_BETA1 to REL_12_BETA2 in PostgresSQL SQL 
tests
 Key: SPARK-28342
 URL: https://issues.apache.org/jira/browse/SPARK-28342
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Hyukjin Kwon
Assignee: Hyukjin Kwon


See [https://github.com/apache/spark/pull/25086#discussion_r302208451]

We should replace REL_12_BETA1 to REL_12_BETA2.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28341) remove session catalog config

2019-07-10 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28341:


Assignee: Wenchen Fan  (was: Apache Spark)

> remove session catalog config
> -
>
> Key: SPARK-28341
> URL: https://issues.apache.org/jira/browse/SPARK-28341
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28341) remove session catalog config

2019-07-10 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28341:


Assignee: Apache Spark  (was: Wenchen Fan)

> remove session catalog config
> -
>
> Key: SPARK-28341
> URL: https://issues.apache.org/jira/browse/SPARK-28341
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28341) remove session catalog config

2019-07-10 Thread Wenchen Fan (JIRA)

Wenchen Fan created SPARK-28341:
---

 Summary: remove session catalog config
 Key: SPARK-28341
 URL: https://issues.apache.org/jira/browse/SPARK-28341
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28327) Spark SQL can't support union with left query have queryOrganization

2019-07-10 Thread angerszhu (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16882581#comment-16882581
 ] 

angerszhu commented on SPARK-28327:
---

[~yumwang]Thank you for you

Seems current SparkSQL's SQL standard is more in favor of PostgreSQL ? 

 

Add "()" truly can help this case .  Our analysts use hive before, So sometimes 
 they are confused.

> Spark SQL can't support union with left query  have queryOrganization
> -
>
> Key: SPARK-28327
> URL: https://issues.apache.org/jira/browse/SPARK-28327
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> Spark SQL can't support SQL like 
> {code:java}
> SELECT A FROM TABLE_1 LIMIT 1
> UNION 
> SELECT A FROM TABLE_2 LIMIT 2{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28339) Rename Spark SQL adaptive execution configuration name

2019-07-10 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-28339.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 25102
[https://github.com/apache/spark/pull/25102]

> Rename Spark SQL adaptive execution configuration name
> --
>
> Key: SPARK-28339
> URL: https://issues.apache.org/jira/browse/SPARK-28339
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Carson Wang
>Assignee: Carson Wang
>Priority: Minor
> Fix For: 3.0.0
>
>
> Rename spark.sql.runtime.reoptimization.enabled to spark.sql.adaptive.enabled 
> as the configuration name for adaptive execution.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28339) Rename Spark SQL adaptive execution configuration name

2019-07-10 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-28339:
---

Assignee: Carson Wang

> Rename Spark SQL adaptive execution configuration name
> --
>
> Key: SPARK-28339
> URL: https://issues.apache.org/jira/browse/SPARK-28339
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Carson Wang
>Assignee: Carson Wang
>Priority: Minor
>
> Rename spark.sql.runtime.reoptimization.enabled to spark.sql.adaptive.enabled 
> as the configuration name for adaptive execution.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28272) Convert and port 'pgSQL/aggregates_part3.sql' into UDF test base

2019-07-10 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16882568#comment-16882568
 ] 

Hyukjin Kwon commented on SPARK-28272:
--

Argh, sorry. It's blocked by SPARK-27988.

> Convert and port 'pgSQL/aggregates_part3.sql' into UDF test base
> 
>
> Key: SPARK-28272
> URL: https://issues.apache.org/jira/browse/SPARK-28272
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> see SPARK-27988



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28015) Check stringToDate() consumes entire input for the yyyy and yyyy-[m]m formats

2019-07-10 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-28015.
---
   Resolution: Fixed
 Assignee: Maxim Gekk
Fix Version/s: 3.0.0

This is resolved via https://github.com/apache/spark/pull/25097

> Check stringToDate() consumes entire input for the  and -[m]m formats
> -
>
> Key: SPARK-28015
> URL: https://issues.apache.org/jira/browse/SPARK-28015
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.3, 2.2.3, 2.3.3, 2.4.3
>Reporter: Yuming Wang
>Assignee: Maxim Gekk
>Priority: Major
>  Labels: correctness
> Fix For: 3.0.0
>
>
> Invalid date formats should throw an exception:
> {code:sql}
> SELECT date '1999 08 01'
> 1999-01-01
> {code}
> Supported date formats:
> https://github.com/apache/spark/blob/ab8710b57916a129fcb89464209361120d224535/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L365-L374
> Since Spark 1.6.3 ~ 2.4.3, the behavior is the same.
> {code}
> spark-sql> SELECT CAST('1999 08 01' AS DATE);
> 1999-01-01
> {code}
> Hive returns NULL.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27919) DataSourceV2: Add v2 session catalog

2019-07-10 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-27919.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24768
[https://github.com/apache/spark/pull/24768]

> DataSourceV2: Add v2 session catalog
> 
>
> Key: SPARK-27919
> URL: https://issues.apache.org/jira/browse/SPARK-27919
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Major
> Fix For: 3.0.0
>
>
> When no default catalog is set, the session catalog (v1) is responsible for 
> table identifiers with no catalog part. When CTAS creates a table with a v2 
> provider, a v2 catalog is required and the default catalog is used. But this 
> may cause Spark to create a table in a catalog that it cannot use to look up 
> the table.
> In this case, a v2 catalog that delegates to the session catalog should be 
> used instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28270) Convert and port 'pgSQL/aggregates_part1.sql' into UDF test base

2019-07-10 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-28270:


Assignee: Hyukjin Kwon

> Convert and port 'pgSQL/aggregates_part1.sql' into UDF test base
> 
>
> Key: SPARK-28270
> URL: https://issues.apache.org/jira/browse/SPARK-28270
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> see SPARK-27770



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28270) Convert and port 'pgSQL/aggregates_part1.sql' into UDF test base

2019-07-10 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-28270.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 25069
[https://github.com/apache/spark/pull/25069]

> Convert and port 'pgSQL/aggregates_part1.sql' into UDF test base
> 
>
> Key: SPARK-28270
> URL: https://issues.apache.org/jira/browse/SPARK-28270
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.0
>
>
> see SPARK-27770



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27919) DataSourceV2: Add v2 session catalog

2019-07-10 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-27919:
---

Assignee: Ryan Blue

> DataSourceV2: Add v2 session catalog
> 
>
> Key: SPARK-27919
> URL: https://issues.apache.org/jira/browse/SPARK-27919
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Major
>
> When no default catalog is set, the session catalog (v1) is responsible for 
> table identifiers with no catalog part. When CTAS creates a table with a v2 
> provider, a v2 catalog is required and the default catalog is used. But this 
> may cause Spark to create a table in a catalog that it cannot use to look up 
> the table.
> In this case, a v2 catalog that delegates to the session catalog should be 
> used instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28281) Convert and port 'having.sql' into UDF test base

2019-07-10 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-28281:


Assignee: Huaxin Gao  (was: Hyukjin Kwon)

> Convert and port 'having.sql' into UDF test base
> 
>
> Key: SPARK-28281
> URL: https://issues.apache.org/jira/browse/SPARK-28281
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28281) Convert and port 'having.sql' into UDF test base

2019-07-10 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-28281:


Assignee: Hyukjin Kwon

> Convert and port 'having.sql' into UDF test base
> 
>
> Key: SPARK-28281
> URL: https://issues.apache.org/jira/browse/SPARK-28281
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28107) Interval type conversion syntax support

2019-07-10 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-28107.
---
   Resolution: Fixed
 Assignee: Zhu, Lipeng
Fix Version/s: 3.0.0

This is resolved via https://github.com/apache/spark/pull/25000

> Interval type conversion syntax support
> ---
>
> Key: SPARK-28107
> URL: https://issues.apache.org/jira/browse/SPARK-28107
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zhu, Lipeng
>Assignee: Zhu, Lipeng
>Priority: Major
> Fix For: 3.0.0
>
>
> According to the 03 ANSI SQL standard, for the interval type conversion. 
> SparkSQL now can only support 
>  * Interval year to month
>  * Interval day to second
>  * Interval hour to second
> There are some other syntax which are both supported in PostgreSQL and 03 
> ANSI SQL.
>  * Interval day to hour
>  * Interval day to minute
>  * Interval hour to minute
>  * Interval minute to second



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28281) Convert and port 'having.sql' into UDF test base

2019-07-10 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-28281.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 25093
[https://github.com/apache/spark/pull/25093]

> Convert and port 'having.sql' into UDF test base
> 
>
> Key: SPARK-28281
> URL: https://issues.apache.org/jira/browse/SPARK-28281
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28285) Convert and port 'outer-join.sql' into UDF test base

2019-07-10 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28285:


Assignee: Apache Spark

> Convert and port 'outer-join.sql' into UDF test base
> 
>
> Key: SPARK-28285
> URL: https://issues.apache.org/jira/browse/SPARK-28285
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28271) Convert and port 'pgSQL/aggregates_part2.sql' into UDF test base

2019-07-10 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-28271.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 25086
[https://github.com/apache/spark/pull/25086]

> Convert and port 'pgSQL/aggregates_part2.sql' into UDF test base
> 
>
> Key: SPARK-28271
> URL: https://issues.apache.org/jira/browse/SPARK-28271
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Terry Kim
>Priority: Major
> Fix For: 3.0.0
>
>
> see SPARK-27883



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28271) Convert and port 'pgSQL/aggregates_part2.sql' into UDF test base

2019-07-10 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-28271:


Assignee: Terry Kim

> Convert and port 'pgSQL/aggregates_part2.sql' into UDF test base
> 
>
> Key: SPARK-28271
> URL: https://issues.apache.org/jira/browse/SPARK-28271
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Terry Kim
>Priority: Major
>
> see SPARK-27883



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28275) Convert and port 'count.sql' into UDF test base

2019-07-10 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-28275.
--
   Resolution: Fixed
 Assignee: Vinod KC
Fix Version/s: 3.0.0

Fixed at [https://github.com/apache/spark/pull/25089]

> Convert and port 'count.sql' into UDF test base
> ---
>
> Key: SPARK-28275
> URL: https://issues.apache.org/jira/browse/SPARK-28275
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Vinod KC
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28323) PythonUDF should be able to use in join condition

2019-07-10 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-28323.
--
   Resolution: Fixed
 Assignee: Liang-Chi Hsieh
Fix Version/s: 3.0.0

Fixed at [https://github.com/apache/spark/pull/25091]

> PythonUDF should be able to use in join condition
> -
>
> Key: SPARK-28323
> URL: https://issues.apache.org/jira/browse/SPARK-28323
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 3.0.0
>
>
> There is a bug in {{ExtractPythonUDFs}} that produces wrong result 
> attributes. It causes a failure when using PythonUDFs among multiple child 
> plans, e.g., join. An example is using PythonUDFs in join condition.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27922) Convert and port 'natural-join.sql' into UDF test base

2019-07-10 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-27922.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 25088
[https://github.com/apache/spark/pull/25088]

> Convert and port 'natural-join.sql' into UDF test base
> --
>
> Key: SPARK-27922
> URL: https://issues.apache.org/jira/browse/SPARK-27922
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Manu Zhang
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27922) Convert and port 'natural-join.sql' into UDF test base

2019-07-10 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-27922:


Assignee: Manu Zhang

> Convert and port 'natural-join.sql' into UDF test base
> --
>
> Key: SPARK-27922
> URL: https://issues.apache.org/jira/browse/SPARK-27922
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Manu Zhang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28234) Spark Resources - add python support to get resources

2019-07-10 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-28234.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 25087
[https://github.com/apache/spark/pull/25087]

> Spark Resources - add python support to get resources
> -
>
> Key: SPARK-28234
> URL: https://issues.apache.org/jira/browse/SPARK-28234
> Project: Spark
>  Issue Type: Story
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Major
> Fix For: 3.0.0
>
>
> Add the equivalent python api for sc.resources and TaskContext.resources



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-27991) ShuffleBlockFetcherIterator should take Netty constant-factor overheads into account when limiting number of simultaneous block fetches

2019-07-10 Thread Josh Rosen (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16882532#comment-16882532
 ] 

Josh Rosen edited comment on SPARK-27991 at 7/11/19 12:28 AM:
--

I've tried to come up with a standalone reproduction of this issue, but so far 
I've been unable to find one that triggers this error. I've tried creating jobs 
which run 1+ mappers shuffling tiny blocks to a single reducer, resulting 
in thousands of requests in flight, but this has failed to trigger the error 
posted above.

However, I _did_ manage to get a more complete backtrace from a different 
internal workload:
{code:java}
Caused by: io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 
16777216 byte(s) of direct memory (used: 7918845952, max: 7923040256)
at 
io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:640)
at 
io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:594)
at 
io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:764)
at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:740)
at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:244)
at io.netty.buffer.PoolArena.allocate(PoolArena.java:226)
at io.netty.buffer.PoolArena.allocate(PoolArena.java:146)
at 
io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:324)
at 
io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:185)
at 
io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:176)
at 
io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:137)
at 
io.netty.channel.DefaultMaxMessagesRecvByteBufAllocator$MaxMessageHandle.allocate(DefaultMaxMessagesRecvByteBufAllocator.java:80)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:122)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at 
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
... 1 more{code}
Something that jumps out to me is the 
{{DefaultMaxMessagesRecvByteBufAllocator}} (and the 
{{AdaptiveRecvByteBufAllocator}} in SPARK-24989): maybe there's something about 
these failing workloads which is leading to significant space wasting in 
receive buffers, causing tiny blocks to experience huge bloat in space 
requirements?


was (Author: joshrosen):
I've tried to come up with a standalone reproduction of this issue, but so far 
I've been unable to find one that triggers this error. I've tried creating jobs 
which run 1+ mappers shuffling tiny blocks to a single reducer, resulting 
in thousands of requests in flight, but this has failed to trigger the error 
posted above.

However, I _did_ manage to get a more complete backtrace from a different 
internal workload:
{code:java}
Caused by: io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 
16777216 byte(s) of direct memory (used: 7918845952, max: 7923040256)
at 
io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:640)
at 
io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:594)
at 
io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:764)
at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:740)
at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:244)
at io.netty.buffer.PoolArena.allocate(PoolArena.java:226)
at io.netty.buffer.PoolArena.allocate(PoolArena.java:146)
at 
io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:324)
at 
io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:185)
at 
io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:176)
at 
io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:137)
at 
io.netty.channel.DefaultMaxMessagesRecvByteBufAllocator$MaxMessageHandle.allocate(DefaultMaxMessagesRecvByteBufAllocator.java:80)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:122)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)
at

[jira] [Commented] (SPARK-27991) ShuffleBlockFetcherIterator should take Netty constant-factor overheads into account when limiting number of simultaneous block fetches

2019-07-10 Thread Josh Rosen (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16882532#comment-16882532
 ] 

Josh Rosen commented on SPARK-27991:


I've tried to come up with a standalone reproduction of this issue, but so far 
I've been unable to find one that triggers this error. I've tried creating jobs 
which run 1+ mappers shuffling tiny blocks to a single reducer, resulting 
in thousands of requests in flight, but this has failed to trigger the error 
posted above.

However, I _did_ manage to get a more complete backtrace from a different 
internal workload:
{code:java}
Caused by: io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 
16777216 byte(s) of direct memory (used: 7918845952, max: 7923040256)
at 
io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:640)
at 
io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:594)
at 
io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:764)
at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:740)
at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:244)
at io.netty.buffer.PoolArena.allocate(PoolArena.java:226)
at io.netty.buffer.PoolArena.allocate(PoolArena.java:146)
at 
io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:324)
at 
io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:185)
at 
io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:176)
at 
io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:137)
at 
io.netty.channel.DefaultMaxMessagesRecvByteBufAllocator$MaxMessageHandle.allocate(DefaultMaxMessagesRecvByteBufAllocator.java:80)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:122)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at 
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
... 1 more{code}
Something that jumps out to me is the 
{{DefaultMaxMessagesRecvByteBufAllocator}} (and the 
{{AdaptiveRecvByteBufAllocator}} SPARK-24989): maybe there's something about 
these failing workloads which is leading to significant space wasting in 
receive buffers, causing tiny blocks to experience huge bloat in space 
requirements?

> ShuffleBlockFetcherIterator should take Netty constant-factor overheads into 
> account when limiting number of simultaneous block fetches
> ---
>
> Key: SPARK-27991
> URL: https://issues.apache.org/jira/browse/SPARK-27991
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.4.0
>Reporter: Josh Rosen
>Priority: Major
>
> ShuffleBlockFetcherIterator has logic to limit the number of simultaneous 
> block fetches. By default, this logic tries to keep the number of outstanding 
> block fetches [beneath a data size 
> limit|https://github.com/apache/spark/blob/v2.4.3/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala#L274]
>  ({{maxBytesInFlight}}). However, this limiting does not take fixed overheads 
> into account: even though a remote block might be, say, 4KB, there are 
> certain fixed-size internal overheads due to Netty buffer sizes which may 
> cause the actual space requirements to be larger.
> As a result, if a map stage produces a huge number of extremely tiny blocks 
> then we may see errors like
> {code:java}
> org.apache.spark.shuffle.FetchFailedException: failed to allocate 16777216 
> byte(s) of direct memory (used: 39325794304, max: 39325794304)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:554)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:485)
> [...]
> Caused by: io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 
> 16777216 byte(s) of direct memory (used: 39325794304, max: 39325794304)
> at 
> io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:640)
> at 
>

[jira] [Commented] (SPARK-28015) Check stringToDate() consumes entire input for the yyyy and yyyy-[m]m formats

2019-07-10 Thread Yuming Wang (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16882524#comment-16882524
 ] 

Yuming Wang commented on SPARK-28015:
-

Thank you [~dongjoon]

> Check stringToDate() consumes entire input for the  and -[m]m formats
> -
>
> Key: SPARK-28015
> URL: https://issues.apache.org/jira/browse/SPARK-28015
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.3, 2.2.3, 2.3.3, 2.4.3
>Reporter: Yuming Wang
>Priority: Major
>  Labels: correctness
>
> Invalid date formats should throw an exception:
> {code:sql}
> SELECT date '1999 08 01'
> 1999-01-01
> {code}
> Supported date formats:
> https://github.com/apache/spark/blob/ab8710b57916a129fcb89464209361120d224535/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L365-L374
> Since Spark 1.6.3 ~ 2.4.3, the behavior is the same.
> {code}
> spark-sql> SELECT CAST('1999 08 01' AS DATE);
> 1999-01-01
> {code}
> Hive returns NULL.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28340) Noisy exceptions when tasks are killed: "DiskBlockObjectWriter: Uncaught exception while reverting partial writes to file: java.nio.channels.ClosedByInterruptException"

2019-07-10 Thread Josh Rosen (JIRA)

Josh Rosen created SPARK-28340:
--

 Summary: Noisy exceptions when tasks are killed: 
"DiskBlockObjectWriter: Uncaught exception while reverting partial writes to 
file: java.nio.channels.ClosedByInterruptException"
 Key: SPARK-28340
 URL: https://issues.apache.org/jira/browse/SPARK-28340
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Josh Rosen


If a Spark task is killed while writing blocks to disk (due to intentional job 
kills, automated killing of redundant speculative tasks, etc) then Spark may 
log exceptions like
{code:java}
19/07/10 21:31:08 ERROR storage.DiskBlockObjectWriter: Uncaught exception while 
reverting partial writes to file /
java.nio.channels.ClosedByInterruptException
at 
java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
at sun.nio.ch.FileChannelImpl.truncate(FileChannelImpl.java:372)
at 
org.apache.spark.storage.DiskBlockObjectWriter$$anonfun$revertPartialWritesAndClose$2.apply$mcV$sp(DiskBlockObjectWriter.scala:218)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1369)
at 
org.apache.spark.storage.DiskBlockObjectWriter.revertPartialWritesAndClose(DiskBlockObjectWriter.scala:214)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.stop(BypassMergeSortShuffleWriter.java:237)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:105)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748){code}
If {{BypassMergeSortShuffleWriter}} is being used then a single cancelled task 
can result in hundreds of these stacktraces being logged.

Here are some StackOverflow questions asking about this:
 * [https://stackoverflow.com/questions/40027870/spark-jobserver-job-crash]
 * 
[https://stackoverflow.com/questions/50646953/why-is-java-nio-channels-closedbyinterruptexceptio-called-when-caling-multiple]
 * 
[https://stackoverflow.com/questions/41867053/java-nio-channels-closedbyinterruptexception-in-spark]
 * 
[https://stackoverflow.com/questions/56845041/are-closedbyinterruptexception-exceptions-expected-when-spark-speculation-kills]
 

Can we prevent this exception from occurring? If not, can we treat this 
"expected exception" in a special manner to avoid log spam? My concern is that 
the presence of large numbers of spurious exceptions is confusing to users when 
they are inspecting Spark logs to diagnose other issues.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28338) spark.read.format("csv") treat empty string as null if csv file don't have quotes in data

2019-07-10 Thread Jayadevan M (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jayadevan M updated SPARK-28338:

Description: 
The csv input file

+cat sample.csv+ 
 Name,Lastname,Age
 abc,,32
 pqr,xxx,30

 

+spark-shell+

spark.read.format("csv").option("header", 
"true").load("/media/ub_share/projects/*.csv").head(3)
 res14: Array[org.apache.spark.sql.Row] = Array([abc,null,32], [pqr,xxx,30])

 

scala> spark.read.format("csv").option("header", "true").option("nullValue", 
"?").load("/media/ub_share/projects/*.csv").head(3)
 res15: Array[org.apache.spark.sql.Row] = Array([abc,null,32], [pqr,xxx,30])

 

The empty string get converted to null. Its works fine if the csv file have 
quotes in columns.

  was:
The csv input file

cat sample.csv 
Name,Lastname,Age
abc,,32
pqr,xxx,30

 

spark-shell

spark.read.format("csv").option("header", 
"true").load("/media/ub_share/projects/*.csv").head(3)
res14: Array[org.apache.spark.sql.Row] = Array([abc,null,32], [pqr,xxx,30])

 

scala> spark.read.format("csv").option("header", "true").option("nullValue", 
"?").load("/media/ub_share/projects/*.csv").head(3)
res15: Array[org.apache.spark.sql.Row] = Array([abc,null,32], [pqr,xxx,30])


> spark.read.format("csv") treat empty string as null if csv file don't have 
> quotes in data
> -
>
> Key: SPARK-28338
> URL: https://issues.apache.org/jira/browse/SPARK-28338
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Jayadevan M
>Priority: Major
>
> The csv input file
> +cat sample.csv+ 
>  Name,Lastname,Age
>  abc,,32
>  pqr,xxx,30
>  
> +spark-shell+
> spark.read.format("csv").option("header", 
> "true").load("/media/ub_share/projects/*.csv").head(3)
>  res14: Array[org.apache.spark.sql.Row] = Array([abc,null,32], [pqr,xxx,30])
>  
> scala> spark.read.format("csv").option("header", "true").option("nullValue", 
> "?").load("/media/ub_share/projects/*.csv").head(3)
>  res15: Array[org.apache.spark.sql.Row] = Array([abc,null,32], [pqr,xxx,30])
>  
> The empty string get converted to null. Its works fine if the csv file have 
> quotes in columns.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28338) spark.read.format("csv") treat empty string as null if csv file don't have quotes in data

2019-07-10 Thread Jayadevan M (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jayadevan M updated SPARK-28338:

Summary: spark.read.format("csv") treat empty string as null if csv file 
don't have quotes in data  (was: spark.read.format("csv") treat empty string as 
null if csv file don't quotes in data)

> spark.read.format("csv") treat empty string as null if csv file don't have 
> quotes in data
> -
>
> Key: SPARK-28338
> URL: https://issues.apache.org/jira/browse/SPARK-28338
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Jayadevan M
>Priority: Major
>
> The csv input file
> cat sample.csv 
> Name,Lastname,Age
> abc,,32
> pqr,xxx,30
>  
> spark-shell
> spark.read.format("csv").option("header", 
> "true").load("/media/ub_share/projects/*.csv").head(3)
> res14: Array[org.apache.spark.sql.Row] = Array([abc,null,32], [pqr,xxx,30])
>  
> scala> spark.read.format("csv").option("header", "true").option("nullValue", 
> "?").load("/media/ub_share/projects/*.csv").head(3)
> res15: Array[org.apache.spark.sql.Row] = Array([abc,null,32], [pqr,xxx,30])



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28339) Rename Spark SQL adaptive execution configuration name

2019-07-10 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28339:


Assignee: (was: Apache Spark)

> Rename Spark SQL adaptive execution configuration name
> --
>
> Key: SPARK-28339
> URL: https://issues.apache.org/jira/browse/SPARK-28339
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Carson Wang
>Priority: Minor
>
> Rename spark.sql.runtime.reoptimization.enabled to spark.sql.adaptive.enabled 
> as the configuration name for adaptive execution.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28339) Rename Spark SQL adaptive execution configuration name

2019-07-10 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28339:


Assignee: Apache Spark

> Rename Spark SQL adaptive execution configuration name
> --
>
> Key: SPARK-28339
> URL: https://issues.apache.org/jira/browse/SPARK-28339
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Carson Wang
>Assignee: Apache Spark
>Priority: Minor
>
> Rename spark.sql.runtime.reoptimization.enabled to spark.sql.adaptive.enabled 
> as the configuration name for adaptive execution.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28339) Rename Spark SQL adaptive execution configuration name

2019-07-10 Thread Carson Wang (JIRA)

Carson Wang created SPARK-28339:
---

 Summary: Rename Spark SQL adaptive execution configuration name
 Key: SPARK-28339
 URL: https://issues.apache.org/jira/browse/SPARK-28339
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Carson Wang


Rename spark.sql.runtime.reoptimization.enabled to spark.sql.adaptive.enabled 
as the configuration name for adaptive execution.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28015) Check stringToDate() consumes entire input for the yyyy and yyyy-[m]m formats

2019-07-10 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28015:
--
Labels: correctness  (was: )

> Check stringToDate() consumes entire input for the  and -[m]m formats
> -
>
> Key: SPARK-28015
> URL: https://issues.apache.org/jira/browse/SPARK-28015
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.3, 2.2.3, 2.3.3, 2.4.3
>Reporter: Yuming Wang
>Priority: Major
>  Labels: correctness
>
> Invalid date formats should throw an exception:
> {code:sql}
> SELECT date '1999 08 01'
> 1999-01-01
> {code}
> Supported date formats:
> https://github.com/apache/spark/blob/ab8710b57916a129fcb89464209361120d224535/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L365-L374
> Since Spark 1.6.3 ~ 2.4.3, the behavior is the same.
> {code}
> spark-sql> SELECT CAST('1999 08 01' AS DATE);
> 1999-01-01
> {code}
> Hive returns NULL.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28015) Check stringToDate() consumes entire input for the yyyy and yyyy-[m]m formats

2019-07-10 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28015:
--
Description: 
Invalid date formats should throw an exception:
{code:sql}
SELECT date '1999 08 01'
1999-01-01
{code}

Supported date formats:
https://github.com/apache/spark/blob/ab8710b57916a129fcb89464209361120d224535/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L365-L374

Since Spark 1.6.3 ~ 2.4.3, the behavior is the same.
{code}
spark-sql> SELECT CAST('1999 08 01' AS DATE);
1999-01-01
{code}

Hive returns NULL.

  was:
Invalid date formats should throw an exception:
{code:sql}
SELECT date '1999 08 01'
1999-01-01
{code}

Supported date formats:
https://github.com/apache/spark/blob/ab8710b57916a129fcb89464209361120d224535/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L365-L374


> Check stringToDate() consumes entire input for the  and -[m]m formats
> -
>
> Key: SPARK-28015
> URL: https://issues.apache.org/jira/browse/SPARK-28015
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.3, 2.2.3, 2.3.3, 2.4.3
>Reporter: Yuming Wang
>Priority: Major
>
> Invalid date formats should throw an exception:
> {code:sql}
> SELECT date '1999 08 01'
> 1999-01-01
> {code}
> Supported date formats:
> https://github.com/apache/spark/blob/ab8710b57916a129fcb89464209361120d224535/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L365-L374
> Since Spark 1.6.3 ~ 2.4.3, the behavior is the same.
> {code}
> spark-sql> SELECT CAST('1999 08 01' AS DATE);
> 1999-01-01
> {code}
> Hive returns NULL.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28266) data correctness issue: data duplication when `path` serde property is present

2019-07-10 Thread Ruslan Dautkhanov (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16882381#comment-16882381
 ] 

Ruslan Dautkhanov commented on SPARK-28266:
---

This issue happens `spark.sql.sources.provider` table property is NOT present, 
and `path` serde property is present -

Spark duplicates records in this case.

 

> data correctness issue: data duplication when `path` serde property is present
> --
>
> Key: SPARK-28266
> URL: https://issues.apache.org/jira/browse/SPARK-28266
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.3.3, 
> 2.3.4, 2.4.4, 2.4.0, 2.4.1, 2.4.2, 3.0.0, 2.4.3
>Reporter: Ruslan Dautkhanov
>Priority: Major
>  Labels: correctness
>
> Spark duplicates returned datasets when `path` serde is present in a parquet 
> table. 
> Confirmed versions affected: Spark 2.2, Spark 2.3, Spark 2.4.
> Confirmed unaffected versions: Spark 2.1 and earlier (tested with Spark 1.6 
> at least).
> Reproducer:
> {code:python}
> >>> spark.sql("create table ruslan_test.test55 as select 1 as id")
> DataFrame[]
> >>> spark.table("ruslan_test.test55").explain()
> == Physical Plan ==
> HiveTableScan [id#16], HiveTableRelation `ruslan_test`.`test55`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#16]
> >>> spark.table("ruslan_test.test55").count()
> 1
> {code}
> (all is good at this point, now exist session and run in Hive for example - )
> {code:sql}
> ALTER TABLE ruslan_test.test55 SET SERDEPROPERTIES ( 
> 'path'='hdfs://epsdatalake/hivewarehouse/ruslan_test.db/test55' )
> {code}
> So LOCATION and serde `path` property would point to the same location.
> Now see count returns two records instead of one:
> {code:python}
> >>> spark.table("ruslan_test.test55").count()
> 2
> >>> spark.table("ruslan_test.test55").explain()
> == Physical Plan ==
> *(1) FileScan parquet ruslan_test.test55[id#9] Batched: true, Format: 
> Parquet, Location: 
> InMemoryFileIndex[hdfs://epsdatalake/hivewarehouse/ruslan_test.db/test55, 
> hdfs://epsdatalake/hive..., PartitionFilters: [], PushedFilters: [], 
> ReadSchema: struct
> >>>
> {code}
> Also notice that the presence of `path` serde property makes TABLE location 
> show up twice - 
> {quote}
> InMemoryFileIndex[hdfs://epsdatalake/hivewarehouse/ruslan_test.db/test55, 
> hdfs://epsdatalake/hive..., 
> {quote}
> We have some applications that create parquet tables in Hive with `path` 
> serde property
> and it makes data duplicate in query results. 
> Hive, Impala etc and Spark version 2.1 and earlier read such tables fine, but 
> not Spark 2.2 and later releases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28277) Convert and port 'except.sql' into UDF test base

2019-07-10 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28277:


Assignee: Apache Spark

> Convert and port 'except.sql' into UDF test base
> 
>
> Key: SPARK-28277
> URL: https://issues.apache.org/jira/browse/SPARK-28277
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28277) Convert and port 'except.sql' into UDF test base

2019-07-10 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28277:


Assignee: (was: Apache Spark)

> Convert and port 'except.sql' into UDF test base
> 
>
> Key: SPARK-28277
> URL: https://issues.apache.org/jira/browse/SPARK-28277
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28199) Move Trigger implementations to Triggers.scala and avoid exposing these to the end users

2019-07-10 Thread Jungtaek Lim (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-28199:
-
Description: 
Even ProcessingTime is deprecated in 2.2.0, it's being used in Spark codebase, 
and actually the alternative Spark proposes use deprecated methods which feels 
like circular - never be able to remove usage.

In fact, ProcessingTime is deprecated because we want to only expose 
Trigger.xxx instead of exposing actual implementations, and I think we miss 
some other implementations as well.

This issue targets to move all Trigger implementations to Triggers.scala, and 
hide from end users.

  was:
Even ProcessingTime is deprecated in 2.2.0, it's being used in Spark codebase, 
and actually the alternative Spark proposes use deprecated methods which feels 
like circular - never be able to remove usage.

This issue targets to deal with removing usage of ProcessingTime in Spark 
codebase, via adding new class to replace ProcessingTime.


> Move Trigger implementations to Triggers.scala and avoid exposing these to 
> the end users
> 
>
> Key: SPARK-28199
> URL: https://issues.apache.org/jira/browse/SPARK-28199
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Minor
>  Labels: release-notes
>
> Even ProcessingTime is deprecated in 2.2.0, it's being used in Spark 
> codebase, and actually the alternative Spark proposes use deprecated methods 
> which feels like circular - never be able to remove usage.
> In fact, ProcessingTime is deprecated because we want to only expose 
> Trigger.xxx instead of exposing actual implementations, and I think we miss 
> some other implementations as well.
> This issue targets to move all Trigger implementations to Triggers.scala, and 
> hide from end users.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28199) Move Trigger implementations to Triggers.scala and avoid exposing these to the end users

2019-07-10 Thread Jungtaek Lim (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-28199:
-
Summary: Move Trigger implementations to Triggers.scala and avoid exposing 
these to the end users  (was: Remove usage of ProcessingTime in Spark codebase)

> Move Trigger implementations to Triggers.scala and avoid exposing these to 
> the end users
> 
>
> Key: SPARK-28199
> URL: https://issues.apache.org/jira/browse/SPARK-28199
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Minor
>  Labels: release-notes
>
> Even ProcessingTime is deprecated in 2.2.0, it's being used in Spark 
> codebase, and actually the alternative Spark proposes use deprecated methods 
> which feels like circular - never be able to remove usage.
> This issue targets to deal with removing usage of ProcessingTime in Spark 
> codebase, via adding new class to replace ProcessingTime.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28324) The LOG function using 10 as the base, but Spark using E

2019-07-10 Thread Sean Owen (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16882288#comment-16882288
 ] 

Sean Owen commented on SPARK-28324:
---

I don't think we should change this as it will break code and there isn't a 
'standard' here AFAICT. As you say, Hive treats this as log base e, as does 
Java, Scala, etc. You can add log10() etc. 

> The LOG function using 10 as the base, but Spark using E
> 
>
> Key: SPARK-28324
> URL: https://issues.apache.org/jira/browse/SPARK-28324
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> Spark SQL:
> {code:sql}
> spark-sql> select log(10);
> 2.302585092994046
> {code}
> PostgreSQL:
> {code:sql}
> postgres=# select log(10);
>  log
> -
>1
> (1 row)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4591) Algorithm/model parity for spark.ml (Scala)

2019-07-10 Thread Sean Owen (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-4591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16882286#comment-16882286
 ] 

Sean Owen commented on SPARK-4591:
--

What else would go under this umbrella?

> Algorithm/model parity for spark.ml (Scala)
> ---
>
> Key: SPARK-4591
> URL: https://issues.apache.org/jira/browse/SPARK-4591
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Reporter: Xiangrui Meng
>Priority: Critical
>
> This is an umbrella JIRA for porting spark.mllib implementations to use the 
> DataFrame-based API defined under spark.ml.  We want to achieve critical 
> feature parity for the next release.
> h3. Instructions for 3 subtask types
> *Review tasks*: detailed review of a subpackage to identify feature gaps 
> between spark.mllib and spark.ml.
> * Should be listed as a subtask of this umbrella.
> * Review subtasks cover major algorithm groups.  To pick up a review subtask, 
> please:
> ** Comment that you are working on it.
> ** Compare the public APIs of spark.ml vs. spark.mllib.
> ** Comment on all missing items within spark.ml: algorithms, models, methods, 
> features, etc.
> ** Check for existing JIRAs covering those items.  If there is no existing 
> JIRA, create one, and link it to your comment.
> *Critical tasks*: higher priority missing features which are required for 
> this umbrella JIRA.
> * Should be linked as "requires" links.
> *Other tasks*: lower priority missing features which can be completed after 
> the critical tasks.
> * Should be linked as "contains" links.
> h4. Excluded items
> This does *not* include:
> * Python: We can compare Scala vs. Python in spark.ml itself.
> * Moving linalg to spark.ml: [SPARK-13944]
> * Streaming ML: Requires stabilizing some internal APIs of structured 
> streaming first
> h3. TODO list
> *Critical issues*
> * [SPARK-14501]: Frequent Pattern Mining
> * [SPARK-14709]: linear SVM
> * [SPARK-15784]: Power Iteration Clustering (PIC)
> *Lower priority issues*
> * Missing methods within algorithms (see Issue Links below)
> * evaluation submodule
> * stat submodule (should probably be covered in DataFrames)
> * Developer-facing submodules:
> ** optimization (including [SPARK-17136])
> ** random, rdd
> ** util
> *To be prioritized*
> * single-instance prediction: [SPARK-10413]
> * pmml [SPARK-11171]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28015) Check stringToDate() consumes entire input for the yyyy and yyyy-[m]m formats

2019-07-10 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28015:
--
Affects Version/s: 1.6.3

> Check stringToDate() consumes entire input for the  and -[m]m formats
> -
>
> Key: SPARK-28015
> URL: https://issues.apache.org/jira/browse/SPARK-28015
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.3, 2.2.3, 2.3.3, 2.4.3
>Reporter: Yuming Wang
>Priority: Major
>
> Invalid date formats should throw an exception:
> {code:sql}
> SELECT date '1999 08 01'
> 1999-01-01
> {code}
> Supported date formats:
> https://github.com/apache/spark/blob/ab8710b57916a129fcb89464209361120d224535/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L365-L374



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-28015) Check stringToDate() consumes entire input for the yyyy and yyyy-[m]m formats

2019-07-10 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16882275#comment-16882275
 ] 

Dongjoon Hyun edited comment on SPARK-28015 at 7/10/19 5:21 PM:


I added `1.6~2.3` as `Affected Versions`, too.
{code}
scala> sql("SELECT CAST('1999 08 01' AS DATE)").show
++
|CAST(1999 08 01 AS DATE)|
++
|  1999-01-01|
++
{code}


was (Author: dongjoon):
I added `2.0~2.3` as `Affected Versions`, too.
{code}
scala> sql("SELECT CAST('1999 08 01' AS DATE)").show
++
|CAST(1999 08 01 AS DATE)|
++
|  1999-01-01|
++
{code}

> Check stringToDate() consumes entire input for the  and -[m]m formats
> -
>
> Key: SPARK-28015
> URL: https://issues.apache.org/jira/browse/SPARK-28015
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.3, 2.4.3
>Reporter: Yuming Wang
>Priority: Major
>
> Invalid date formats should throw an exception:
> {code:sql}
> SELECT date '1999 08 01'
> 1999-01-01
> {code}
> Supported date formats:
> https://github.com/apache/spark/blob/ab8710b57916a129fcb89464209361120d224535/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L365-L374



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24462) Text socket micro-batch reader throws error when a query is restarted with saved state

2019-07-10 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-24462.
---
Resolution: Duplicate

> Text socket micro-batch reader throws error when a query is restarted with 
> saved state
> --
>
> Key: SPARK-24462
> URL: https://issues.apache.org/jira/browse/SPARK-24462
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Arun Mahadevan
>Priority: Critical
>
> Exception thrown:
>  
> {noformat}
> scala> 18/06/01 22:47:04 ERROR MicroBatchExecution: Query [id = 
> 0bdc4428-5d21-4237-9d64-898ae65f28f3, runId = 
> f6822423-2bd2-47c1-8ed6-799d1c481195] terminated with error
> java.lang.RuntimeException: Offsets committed out of order: 2 followed by -1
>  at scala.sys.package$.error(package.scala:27)
>  at 
> org.apache.spark.sql.execution.streaming.sources.TextSocketMicroBatchReader.commit(socket.scala:197)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1$$anonfun$apply$mcZ$sp$2$$anonfun$apply$mcV$sp$5.apply(MicroBatchExecution.scala:377)
>  
> {noformat}
>  
> Sample code that reproduces the error on restarting the query.
>  
> {code:java}
>  
> import java.sql.Timestamp
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.functions._
> import spark.implicits._
> import org.apache.spark.sql.streaming.Trigger
> val lines = spark.readStream.format("socket").option("host", 
> "localhost").option("port", ).option("includeTimestamp", true).load()
> val words = lines.as[(String, Timestamp)].flatMap(line => line._1.split(" 
> ").map(word => (word, line._2))).toDF("word", "timestamp")
> val windowedCounts = words.groupBy(window($"timestamp", "20 minutes", "20 
> minutes"), $"word").count().orderBy("window")
> val query = 
> windowedCounts.writeStream.outputMode("complete").option("checkpointLocation",
>  "/tmp/debug").format("console").option("truncate", "false").start()
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28015) Check stringToDate() consumes entire input for the yyyy and yyyy-[m]m formats

2019-07-10 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16882280#comment-16882280
 ] 

Dongjoon Hyun commented on SPARK-28015:
---

Hi, [~yumwang]. I updated the JIRA title according to the PR.
- We will throw exceptions for `date` prefix
- We will return NULL for CASTING.

> Check stringToDate() consumes entire input for the  and -[m]m formats
> -
>
> Key: SPARK-28015
> URL: https://issues.apache.org/jira/browse/SPARK-28015
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.3, 2.4.3
>Reporter: Yuming Wang
>Priority: Major
>
> Invalid date formats should throw an exception:
> {code:sql}
> SELECT date '1999 08 01'
> 1999-01-01
> {code}
> Supported date formats:
> https://github.com/apache/spark/blob/ab8710b57916a129fcb89464209361120d224535/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L365-L374



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28015) Check stringToDate() consumes entire input for the yyyy and yyyy-[m]m formats

2019-07-10 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28015:
--
Summary: Check stringToDate() consumes entire input for the  and 
-[m]m formats  (was: Invalid date formats should throw an exception)

> Check stringToDate() consumes entire input for the  and -[m]m formats
> -
>
> Key: SPARK-28015
> URL: https://issues.apache.org/jira/browse/SPARK-28015
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.3, 2.4.3
>Reporter: Yuming Wang
>Priority: Major
>
> Invalid date formats should throw an exception:
> {code:sql}
> SELECT date '1999 08 01'
> 1999-01-01
> {code}
> Supported date formats:
> https://github.com/apache/spark/blob/ab8710b57916a129fcb89464209361120d224535/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L365-L374



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-28015) Invalid date formats should throw an exception

2019-07-10 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16882275#comment-16882275
 ] 

Dongjoon Hyun edited comment on SPARK-28015 at 7/10/19 5:08 PM:


I added `2.0~2.3` as `Affected Versions`, too.
{code}
scala> sql("SELECT CAST('1999 08 01' AS DATE)").show
++
|CAST(1999 08 01 AS DATE)|
++
|  1999-01-01|
++
{code}


was (Author: dongjoon):
I added `2.0~2.3`, too.
{code}
scala> sql("SELECT CAST('1999 08 01' AS DATE)").show
++
|CAST(1999 08 01 AS DATE)|
++
|  1999-01-01|
++
{code}

> Invalid date formats should throw an exception
> --
>
> Key: SPARK-28015
> URL: https://issues.apache.org/jira/browse/SPARK-28015
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.3, 2.4.3
>Reporter: Yuming Wang
>Priority: Major
>
> Invalid date formats should throw an exception:
> {code:sql}
> SELECT date '1999 08 01'
> 1999-01-01
> {code}
> Supported date formats:
> https://github.com/apache/spark/blob/ab8710b57916a129fcb89464209361120d224535/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L365-L374



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28015) Invalid date formats should throw an exception

2019-07-10 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28015:
--
Affects Version/s: 2.0.2
   2.1.3
   2.2.3

> Invalid date formats should throw an exception
> --
>
> Key: SPARK-28015
> URL: https://issues.apache.org/jira/browse/SPARK-28015
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.3, 2.4.3
>Reporter: Yuming Wang
>Priority: Major
>
> Invalid date formats should throw an exception:
> {code:sql}
> SELECT date '1999 08 01'
> 1999-01-01
> {code}
> Supported date formats:
> https://github.com/apache/spark/blob/ab8710b57916a129fcb89464209361120d224535/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L365-L374



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28015) Invalid date formats should throw an exception

2019-07-10 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16882275#comment-16882275
 ] 

Dongjoon Hyun commented on SPARK-28015:
---

I added `2.0~2.3`, too.
{code}
scala> sql("SELECT CAST('1999 08 01' AS DATE)").show
++
|CAST(1999 08 01 AS DATE)|
++
|  1999-01-01|
++
{code}

> Invalid date formats should throw an exception
> --
>
> Key: SPARK-28015
> URL: https://issues.apache.org/jira/browse/SPARK-28015
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.3, 2.4.3
>Reporter: Yuming Wang
>Priority: Major
>
> Invalid date formats should throw an exception:
> {code:sql}
> SELECT date '1999 08 01'
> 1999-01-01
> {code}
> Supported date formats:
> https://github.com/apache/spark/blob/ab8710b57916a129fcb89464209361120d224535/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L365-L374



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28015) Invalid date formats should throw an exception

2019-07-10 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28015:
--
Affects Version/s: 2.3.3

> Invalid date formats should throw an exception
> --
>
> Key: SPARK-28015
> URL: https://issues.apache.org/jira/browse/SPARK-28015
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.3, 2.4.3
>Reporter: Yuming Wang
>Priority: Major
>
> Invalid date formats should throw an exception:
> {code:sql}
> SELECT date '1999 08 01'
> 1999-01-01
> {code}
> Supported date formats:
> https://github.com/apache/spark/blob/ab8710b57916a129fcb89464209361120d224535/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L365-L374



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28335) Flaky test: org.apache.spark.streaming.kafka010.DirectKafkaStreamSuite.offset recovery from kafka

2019-07-10 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-28335.
---
   Resolution: Fixed
Fix Version/s: 2.4.4
   2.3.4
   3.0.0

Issue resolved by pull request 25100
[https://github.com/apache/spark/pull/25100]

> Flaky test: org.apache.spark.streaming.kafka010.DirectKafkaStreamSuite.offset 
> recovery from kafka
> -
>
> Key: SPARK-28335
> URL: https://issues.apache.org/jira/browse/SPARK-28335
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Tests
>Affects Versions: 2.1.3, 2.2.3, 2.3.3, 3.0.0, 2.4.3
>Reporter: Gabor Somogyi
>Assignee: Gabor Somogyi
>Priority: Minor
> Fix For: 3.0.0, 2.3.4, 2.4.4
>
> Attachments: bad.log
>
>
> {code:java}
> org.scalatest.exceptions.TestFailedException: {} was empty
>   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)
>   at 
> org.apache.spark.streaming.kafka010.DirectKafkaStreamSuite$$anonfun$6.apply$mcV$sp(DirectKafkaStreamSuite.scala:466)
>   at 
> org.apache.spark.streaming.kafka010.DirectKafkaStreamSuite$$anonfun$6.apply(DirectKafkaStreamSuite.scala:416)
>   at 
> org.apache.spark.streaming.kafka010.DirectKafkaStreamSuite$$anonfun$6.apply(DirectKafkaStreamSuite.scala:416)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
>   at or
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28335) Flaky test: org.apache.spark.streaming.kafka010.DirectKafkaStreamSuite.offset recovery from kafka

2019-07-10 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-28335:
-

Assignee: Gabor Somogyi

> Flaky test: org.apache.spark.streaming.kafka010.DirectKafkaStreamSuite.offset 
> recovery from kafka
> -
>
> Key: SPARK-28335
> URL: https://issues.apache.org/jira/browse/SPARK-28335
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Tests
>Affects Versions: 2.1.3, 2.2.3, 2.3.3, 3.0.0, 2.4.3
>Reporter: Gabor Somogyi
>Assignee: Gabor Somogyi
>Priority: Minor
> Attachments: bad.log
>
>
> {code:java}
> org.scalatest.exceptions.TestFailedException: {} was empty
>   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)
>   at 
> org.apache.spark.streaming.kafka010.DirectKafkaStreamSuite$$anonfun$6.apply$mcV$sp(DirectKafkaStreamSuite.scala:466)
>   at 
> org.apache.spark.streaming.kafka010.DirectKafkaStreamSuite$$anonfun$6.apply(DirectKafkaStreamSuite.scala:416)
>   at 
> org.apache.spark.streaming.kafka010.DirectKafkaStreamSuite$$anonfun$6.apply(DirectKafkaStreamSuite.scala:416)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
>   at or
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28290) Use `SslContextFactory.Server` instead of `SslContextFactory`

2019-07-10 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-28290.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 25067
[https://github.com/apache/spark/pull/25067]

> Use `SslContextFactory.Server` instead of `SslContextFactory`
> -
>
> Key: SPARK-28290
> URL: https://issues.apache.org/jira/browse/SPARK-28290
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.0.0
>
>
> `SslContextFactory` is deprecated at Jetty 9.4. This issue replaces it with 
> `SslContextFactory.Server`.
> - 
> https://www.eclipse.org/jetty/javadoc/9.4.19.v20190610/org/eclipse/jetty/util/ssl/SslContextFactory.html
> - 
> https://www.eclipse.org/jetty/javadoc/9.3.24.v20180605/org/eclipse/jetty/util/ssl/SslContextFactory.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28290) Use `SslContextFactory.Server` instead of `SslContextFactory`

2019-07-10 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-28290:
-

Assignee: Dongjoon Hyun

> Use `SslContextFactory.Server` instead of `SslContextFactory`
> -
>
> Key: SPARK-28290
> URL: https://issues.apache.org/jira/browse/SPARK-28290
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>
> `SslContextFactory` is deprecated at Jetty 9.4. This issue replaces it with 
> `SslContextFactory.Server`.
> - 
> https://www.eclipse.org/jetty/javadoc/9.4.19.v20190610/org/eclipse/jetty/util/ssl/SslContextFactory.html
> - 
> https://www.eclipse.org/jetty/javadoc/9.3.24.v20180605/org/eclipse/jetty/util/ssl/SslContextFactory.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28266) data correctness issue: data duplication when `path` serde property is present

2019-07-10 Thread Ruslan Dautkhanov (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16882200#comment-16882200
 ] 

Ruslan Dautkhanov commented on SPARK-28266:
---

Suspecting change in SPARK-22158 causes this 

> data correctness issue: data duplication when `path` serde property is present
> --
>
> Key: SPARK-28266
> URL: https://issues.apache.org/jira/browse/SPARK-28266
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.3.3, 
> 2.3.4, 2.4.4, 2.4.0, 2.4.1, 2.4.2, 3.0.0, 2.4.3
>Reporter: Ruslan Dautkhanov
>Priority: Major
>  Labels: correctness
>
> Spark duplicates returned datasets when `path` serde is present in a parquet 
> table. 
> Confirmed versions affected: Spark 2.2, Spark 2.3, Spark 2.4.
> Confirmed unaffected versions: Spark 2.1 and earlier (tested with Spark 1.6 
> at least).
> Reproducer:
> {code:python}
> >>> spark.sql("create table ruslan_test.test55 as select 1 as id")
> DataFrame[]
> >>> spark.table("ruslan_test.test55").explain()
> == Physical Plan ==
> HiveTableScan [id#16], HiveTableRelation `ruslan_test`.`test55`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#16]
> >>> spark.table("ruslan_test.test55").count()
> 1
> {code}
> (all is good at this point, now exist session and run in Hive for example - )
> {code:sql}
> ALTER TABLE ruslan_test.test55 SET SERDEPROPERTIES ( 
> 'path'='hdfs://epsdatalake/hivewarehouse/ruslan_test.db/test55' )
> {code}
> So LOCATION and serde `path` property would point to the same location.
> Now see count returns two records instead of one:
> {code:python}
> >>> spark.table("ruslan_test.test55").count()
> 2
> >>> spark.table("ruslan_test.test55").explain()
> == Physical Plan ==
> *(1) FileScan parquet ruslan_test.test55[id#9] Batched: true, Format: 
> Parquet, Location: 
> InMemoryFileIndex[hdfs://epsdatalake/hivewarehouse/ruslan_test.db/test55, 
> hdfs://epsdatalake/hive..., PartitionFilters: [], PushedFilters: [], 
> ReadSchema: struct
> >>>
> {code}
> Also notice that the presence of `path` serde property makes TABLE location 
> show up twice - 
> {quote}
> InMemoryFileIndex[hdfs://epsdatalake/hivewarehouse/ruslan_test.db/test55, 
> hdfs://epsdatalake/hive..., 
> {quote}
> We have some applications that create parquet tables in Hive with `path` 
> serde property
> and it makes data duplicate in query results. 
> Hive, Impala etc and Spark version 2.1 and earlier read such tables fine, but 
> not Spark 2.2 and later releases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-28280) Convert and port 'group-by.sql' into UDF test base

2019-07-10 Thread Stavros Kontopoulos (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16881970#comment-16881970
 ] 

Stavros Kontopoulos edited comment on SPARK-28280 at 7/10/19 3:31 PM:
--

[~hyukjin.kwon]  udf returns None where there is NULL (returns string in 
general) and affects calculations like count(), I guess this is legitimate 
about this umbrella jira. Still it breaks the tests.


was (Author: skonto):
[~hyukjin.kwon]  udf returns None where there is NULL (returns string in 
general) and affects calculations like count(), I guess this is legitimate 
about this umbrella jira. 

> Convert and port 'group-by.sql' into UDF test base
> --
>
> Key: SPARK-28280
> URL: https://issues.apache.org/jira/browse/SPARK-28280
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27560) HashPartitioner uses Object.hashCode which is not seeded

2019-07-10 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-27560.
---
Resolution: Not A Problem

> HashPartitioner uses Object.hashCode which is not seeded
> 
>
> Key: SPARK-27560
> URL: https://issues.apache.org/jira/browse/SPARK-27560
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.4.0
> Environment: Notebook is running spark v2.4.0 local[*]
> Python 3.6.6 (default, Sep  6 2018, 13:10:03)
> [GCC 4.2.1 Compatible Apple LLVM 9.1.0 (clang-902.0.39.2)] on darwin
> I imagine this would reproduce on all operating systems and most versions of 
> spark though.
>Reporter: Andrew McHarg
>Priority: Minor
>
> Forgive the quality of the bug report here, I am a pyspark user and not super 
> familiar with the internals of spark, yet it seems I have a strange corner 
> case with the HashPartitioner.
> This may already be known but repartition with HashPartitioner seems to 
> assign everything the same partition if data that was partitioned by the same 
> column is only partially read (say one partition). I suppose it is obvious 
> concequence of Object.hashCode being deterministic but took some while to 
> track down. 
> Steps to repro:
>  # Get dataframe with a bunch of uuids say 1
>  # repartition(100, 'uuid_column')
>  # save to parquet
>  # read from parquet
>  # collect()[:100] then filter using pyspark.sql.functions isin (yes I know 
> this is bad and sampleBy should probably be used here)
>  # repartition(10, 'uuid_column')
>  # Resulting dataframe will have all of its data in one single partition
> Jupyter notebook for the above: 
> https://gist.github.com/robo-hamburger/4752a40cb643318464e58ab66cf7d23e
> I think an easy fix would be to seed the HashPartitioner like many hashtable 
> libraries do to avoid denial of service attacks. It also might be the case 
> this is obvious behavior for more experienced spark users :)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26440) Show total CPU time across all tasks on stage pages

2019-07-10 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26440.
---
Resolution: Won't Fix

> Show total CPU time across all tasks on stage pages
> ---
>
> Key: SPARK-26440
> URL: https://issues.apache.org/jira/browse/SPARK-26440
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.4.0
>Reporter: Chenxiao Mao
>Priority: Minor
>
> Task CPU time is added since 
> [SPARK-12221|https://issues.apache.org/jira/browse/SPARK-12221]. However, 
> total CPU time across all tasks is not displayed on stage pages. This could 
> be used to check whether a stage is CPU intensive or not.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26497) Show users where the pre-packaged SparkR and PySpark Dockerfiles are in the image build script.

2019-07-10 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26497.
---
Resolution: Later

> Show users where the pre-packaged SparkR and PySpark Dockerfiles are in the 
> image build script.
> ---
>
> Key: SPARK-26497
> URL: https://issues.apache.org/jira/browse/SPARK-26497
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Kubernetes
>Affects Versions: 3.0.0
>Reporter: holdenk
>Assignee: holdenk
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26097) Show partitioning details in DAG UI

2019-07-10 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26097.
---
Resolution: Later

> Show partitioning details in DAG UI
> ---
>
> Key: SPARK-26097
> URL: https://issues.apache.org/jira/browse/SPARK-26097
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Idan Zalzberg
>Priority: Minor
> Attachments: image (8).png
>
>
> We run complex SQL queries using Spark SQL, often we have to tackle a join 
> skew or incorrect partition count. The problem is that while the Spark UI 
> shows the existence of the problem and what *stage* it is part of, it's hard 
> to infer back to the original SQL query that was given (e.g. what is the 
> specific join operation that is actually skewed).
> One way to resolve this is to relate the Exchange nodes in the DAG to the 
> partitioning that they represent, this is actually a trivial change in code 
> (less than one line) that we believe can greatly benefit the research of 
> performance issues.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26097) Show partitioning details in DAG UI

2019-07-10 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-26097:
--
Priority: Minor  (was: Major)

This can be reopened with a PR that would address the different approach 
described in the last PR.

> Show partitioning details in DAG UI
> ---
>
> Key: SPARK-26097
> URL: https://issues.apache.org/jira/browse/SPARK-26097
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Idan Zalzberg
>Priority: Minor
> Attachments: image (8).png
>
>
> We run complex SQL queries using Spark SQL, often we have to tackle a join 
> skew or incorrect partition count. The problem is that while the Spark UI 
> shows the existence of the problem and what *stage* it is part of, it's hard 
> to infer back to the original SQL query that was given (e.g. what is the 
> specific join operation that is actually skewed).
> One way to resolve this is to relate the Exchange nodes in the DAG to the 
> partitioning that they represent, this is actually a trivial change in code 
> (less than one line) that we believe can greatly benefit the research of 
> performance issues.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28310) ANSI SQL grammar support: first_value/last_value(expression, [RESPECT NULLS | IGNORE NULLS])

2019-07-10 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-28310:
-

Assignee: Zhu, Lipeng

> ANSI SQL grammar support: first_value/last_value(expression, [RESPECT NULLS | 
> IGNORE NULLS])
> 
>
> Key: SPARK-28310
> URL: https://issues.apache.org/jira/browse/SPARK-28310
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zhu, Lipeng
>Assignee: Zhu, Lipeng
>Priority: Minor
>
> According to the ANSI SQL 2011:
> {code:sql}
>  ::= 
>  ::= RESPECT NULLS | IGNORE NULLS
>  ::=
> [  treatment>
> ]
>  ::=
> FIRST_VALUE | LAST_VALUE
> {code}
> Teradata - 
> [https://docs.teradata.com/reader/756LNiPSFdY~4JcCCcR5Cw/SUwCpTupqmlBJvi2mipOaA]
>  
> Oracle - 
> [https://docs.oracle.com/en/database/oracle/oracle-database/18/sqlrf/FIRST_VALUE.html#GUID-D454EC3F-370C-4C64-9B11-33FCB10D95EC]
> Redshift – 
> [https://docs.aws.amazon.com/redshift/latest/dg/r_WF_first_value.html]
>  
> Postgresql didn't implement the Ignore/respect nulls. 
> [https://www.postgresql.org/docs/devel/functions-window.html]
> h3. Note
> The SQL standard defines a {{RESPECT NULLS}} or {{IGNORE NULLS}} option for 
> {{lead}}, {{lag}}, {{first_value}}, {{last_value}}, and {{nth_value}}. This 
> is not implemented in PostgreSQL: the behavior is always the same as the 
> standard's default, namely {{RESPECT NULLS}}.
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28310) ANSI SQL grammar support: first_value/last_value(expression, [RESPECT NULLS | IGNORE NULLS])

2019-07-10 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-28310.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 25082
[https://github.com/apache/spark/pull/25082]

> ANSI SQL grammar support: first_value/last_value(expression, [RESPECT NULLS | 
> IGNORE NULLS])
> 
>
> Key: SPARK-28310
> URL: https://issues.apache.org/jira/browse/SPARK-28310
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zhu, Lipeng
>Assignee: Zhu, Lipeng
>Priority: Minor
> Fix For: 3.0.0
>
>
> According to the ANSI SQL 2011:
> {code:sql}
>  ::= 
>  ::= RESPECT NULLS | IGNORE NULLS
>  ::=
> [  treatment>
> ]
>  ::=
> FIRST_VALUE | LAST_VALUE
> {code}
> Teradata - 
> [https://docs.teradata.com/reader/756LNiPSFdY~4JcCCcR5Cw/SUwCpTupqmlBJvi2mipOaA]
>  
> Oracle - 
> [https://docs.oracle.com/en/database/oracle/oracle-database/18/sqlrf/FIRST_VALUE.html#GUID-D454EC3F-370C-4C64-9B11-33FCB10D95EC]
> Redshift – 
> [https://docs.aws.amazon.com/redshift/latest/dg/r_WF_first_value.html]
>  
> Postgresql didn't implement the Ignore/respect nulls. 
> [https://www.postgresql.org/docs/devel/functions-window.html]
> h3. Note
> The SQL standard defines a {{RESPECT NULLS}} or {{IGNORE NULLS}} option for 
> {{lead}}, {{lag}}, {{first_value}}, {{last_value}}, and {{nth_value}}. This 
> is not implemented in PostgreSQL: the behavior is always the same as the 
> standard's default, namely {{RESPECT NULLS}}.
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28327) Spark SQL can't support union with left query have queryOrganization

2019-07-10 Thread Yuming Wang (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16882109#comment-16882109
 ] 

Yuming Wang commented on SPARK-28327:
-

PostgreSQL also does not support this:
{code:sql}
postgres=# create or replace temporary view t1 as select * from (values(1), 
(2), (null), (3), (null)) as v (val);
CREATE VIEW
postgres=# SELECT val FROM t1 LIMIT 1 union all SELECT val FROM t1 LIMIT 2;
ERROR:  syntax error at or near "union"
LINE 1: SELECT val FROM t1 LIMIT 1 union all SELECT val FROM t1 LIMI...
   ^
postgres=# (SELECT val FROM t1 LIMIT 1) union all (SELECT val FROM t1 LIMIT 2);
 val
-
   1
   1
   2
(3 rows)
{code}


Could you add parentheses?
{code:sql}
spark-sql> create or replace temporary view t1 as select * from (values(1), 
(2), (null), (3), (null)) as v (val);
spark-sql> (SELECT val FROM t1 LIMIT 1) union all (SELECT val FROM t1 LIMIT 2);
1
1
2
{code}


> Spark SQL can't support union with left query  have queryOrganization
> -
>
> Key: SPARK-28327
> URL: https://issues.apache.org/jira/browse/SPARK-28327
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> Spark SQL can't support SQL like 
> {code:java}
> SELECT A FROM TABLE_1 LIMIT 1
> UNION 
> SELECT A FROM TABLE_2 LIMIT 2{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28335) Flaky test: org.apache.spark.streaming.kafka010.DirectKafkaStreamSuite.offset recovery from kafka

2019-07-10 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28335:


Assignee: (was: Apache Spark)

> Flaky test: org.apache.spark.streaming.kafka010.DirectKafkaStreamSuite.offset 
> recovery from kafka
> -
>
> Key: SPARK-28335
> URL: https://issues.apache.org/jira/browse/SPARK-28335
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Tests
>Affects Versions: 2.1.3, 2.2.3, 2.3.3, 3.0.0, 2.4.3
>Reporter: Gabor Somogyi
>Priority: Minor
> Attachments: bad.log
>
>
> {code:java}
> org.scalatest.exceptions.TestFailedException: {} was empty
>   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)
>   at 
> org.apache.spark.streaming.kafka010.DirectKafkaStreamSuite$$anonfun$6.apply$mcV$sp(DirectKafkaStreamSuite.scala:466)
>   at 
> org.apache.spark.streaming.kafka010.DirectKafkaStreamSuite$$anonfun$6.apply(DirectKafkaStreamSuite.scala:416)
>   at 
> org.apache.spark.streaming.kafka010.DirectKafkaStreamSuite$$anonfun$6.apply(DirectKafkaStreamSuite.scala:416)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
>   at or
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28335) Flaky test: org.apache.spark.streaming.kafka010.DirectKafkaStreamSuite.offset recovery from kafka

2019-07-10 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28335:


Assignee: Apache Spark

> Flaky test: org.apache.spark.streaming.kafka010.DirectKafkaStreamSuite.offset 
> recovery from kafka
> -
>
> Key: SPARK-28335
> URL: https://issues.apache.org/jira/browse/SPARK-28335
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Tests
>Affects Versions: 2.1.3, 2.2.3, 2.3.3, 3.0.0, 2.4.3
>Reporter: Gabor Somogyi
>Assignee: Apache Spark
>Priority: Minor
> Attachments: bad.log
>
>
> {code:java}
> org.scalatest.exceptions.TestFailedException: {} was empty
>   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)
>   at 
> org.apache.spark.streaming.kafka010.DirectKafkaStreamSuite$$anonfun$6.apply$mcV$sp(DirectKafkaStreamSuite.scala:466)
>   at 
> org.apache.spark.streaming.kafka010.DirectKafkaStreamSuite$$anonfun$6.apply(DirectKafkaStreamSuite.scala:416)
>   at 
> org.apache.spark.streaming.kafka010.DirectKafkaStreamSuite$$anonfun$6.apply(DirectKafkaStreamSuite.scala:416)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
>   at or
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28294) Support `spark.history.fs.cleaner.maxNum` configuration

2019-07-10 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-28294.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 25072
[https://github.com/apache/spark/pull/25072]

> Support `spark.history.fs.cleaner.maxNum` configuration
> ---
>
> Key: SPARK-28294
> URL: https://issues.apache.org/jira/browse/SPARK-28294
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>
> Up to now, Apache Spark maintains the event log directory by time policy, 
> `spark.history.fs.cleaner.maxAge`. However, there are two issues.
> 1. Some file system has a limitation on the maximum number of files in a 
> single directory. For example, HDFS 
> `dfs.namenode.fs-limits.max-directory-items` is 1024 * 1024 by default.
> - 
> https://hadoop.apache.org/docs/r3.2.0/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
> 2. Spark is sometimes unable to to clean up some old log files due to 
> permission issues. 
> To handle both (1) and (2), this issue aims to support an additional number 
> policy configuration for the event log directory, 
> `spark.history.fs.cleaner.maxNum`. Spark can try to keep the number of files 
> in the event log directory according to this policy.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28294) Support `spark.history.fs.cleaner.maxNum` configuration

2019-07-10 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-28294:
-

Assignee: Dongjoon Hyun

> Support `spark.history.fs.cleaner.maxNum` configuration
> ---
>
> Key: SPARK-28294
> URL: https://issues.apache.org/jira/browse/SPARK-28294
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>
> Up to now, Apache Spark maintains the event log directory by time policy, 
> `spark.history.fs.cleaner.maxAge`. However, there are two issues.
> 1. Some file system has a limitation on the maximum number of files in a 
> single directory. For example, HDFS 
> `dfs.namenode.fs-limits.max-directory-items` is 1024 * 1024 by default.
> - 
> https://hadoop.apache.org/docs/r3.2.0/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
> 2. Spark is sometimes unable to to clean up some old log files due to 
> permission issues. 
> To handle both (1) and (2), this issue aims to support an additional number 
> policy configuration for the event log directory, 
> `spark.history.fs.cleaner.maxNum`. Spark can try to keep the number of files 
> in the event log directory according to this policy.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28199) Remove usage of ProcessingTime in Spark codebase

2019-07-10 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-28199:
--
Labels: release-notes  (was: )

> Remove usage of ProcessingTime in Spark codebase
> 
>
> Key: SPARK-28199
> URL: https://issues.apache.org/jira/browse/SPARK-28199
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Minor
>  Labels: release-notes
>
> Even ProcessingTime is deprecated in 2.2.0, it's being used in Spark 
> codebase, and actually the alternative Spark proposes use deprecated methods 
> which feels like circular - never be able to remove usage.
> This issue targets to deal with removing usage of ProcessingTime in Spark 
> codebase, via adding new class to replace ProcessingTime.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28234) Spark Resources - add python support to get resources

2019-07-10 Thread Thomas Graves (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16882092#comment-16882092
 ] 

Thomas Graves commented on SPARK-28234:
---

Testing driver side:
{code:java}
>>> sc.resources['gpu'].addresses
['0', '1']
>>> sc.resources['gpu'].name
'gpu'{code}
 

basic code for testing executor side:

 
{code:java}
from pyspark import TaskContext
import socket

def task_info(*_):
    ctx = TaskContext()
    return ["addrs: {0}".format(ctx.resources()['gpu'].addresses)]

for x in sc.parallelize([], 8).mapPartitions(task_info).collect():
    print(x)

{code}

> Spark Resources - add python support to get resources
> -
>
> Key: SPARK-28234
> URL: https://issues.apache.org/jira/browse/SPARK-28234
> Project: Spark
>  Issue Type: Story
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Major
>
> Add the equivalent python api for sc.resources and TaskContext.resources



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28338) spark.read.format("csv") treat empty string as null if csv file don't quotes in data

2019-07-10 Thread Jayadevan M (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jayadevan M updated SPARK-28338:

Summary: spark.read.format("csv") treat empty string as null if csv file 
don't quotes in data  (was: spark.read.format("csv") treat empty string as null 
if csv file don't quotes in columns)

> spark.read.format("csv") treat empty string as null if csv file don't quotes 
> in data
> 
>
> Key: SPARK-28338
> URL: https://issues.apache.org/jira/browse/SPARK-28338
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Jayadevan M
>Priority: Major
>
> The csv input file
> cat sample.csv 
> Name,Lastname,Age
> abc,,32
> pqr,xxx,30
>  
> spark-shell
> spark.read.format("csv").option("header", 
> "true").load("/media/ub_share/projects/*.csv").head(3)
> res14: Array[org.apache.spark.sql.Row] = Array([abc,null,32], [pqr,xxx,30])
>  
> scala> spark.read.format("csv").option("header", "true").option("nullValue", 
> "?").load("/media/ub_share/projects/*.csv").head(3)
> res15: Array[org.apache.spark.sql.Row] = Array([abc,null,32], [pqr,xxx,30])



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28338) spark.read.format("csv") treat empty string as null if csv file don't quotes in columns

2019-07-10 Thread Jayadevan M (JIRA)

Jayadevan M created SPARK-28338:
---

 Summary: spark.read.format("csv") treat empty string as null if 
csv file don't quotes in columns
 Key: SPARK-28338
 URL: https://issues.apache.org/jira/browse/SPARK-28338
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.3
Reporter: Jayadevan M


The csv input file

cat sample.csv 
Name,Lastname,Age
abc,,32
pqr,xxx,30

 

spark-shell

spark.read.format("csv").option("header", 
"true").load("/media/ub_share/projects/*.csv").head(3)
res14: Array[org.apache.spark.sql.Row] = Array([abc,null,32], [pqr,xxx,30])

 

scala> spark.read.format("csv").option("header", "true").option("nullValue", 
"?").load("/media/ub_share/projects/*.csv").head(3)
res15: Array[org.apache.spark.sql.Row] = Array([abc,null,32], [pqr,xxx,30])



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28267) Update building-spark.md

2019-07-10 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-28267:
-

Assignee: Yuming Wang

> Update building-spark.md
> 
>
> Key: SPARK-28267
> URL: https://issues.apache.org/jira/browse/SPARK-28267
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Trivial
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28267) Update building-spark.md

2019-07-10 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-28267.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 25063
[https://github.com/apache/spark/pull/25063]

> Update building-spark.md
> 
>
> Key: SPARK-28267
> URL: https://issues.apache.org/jira/browse/SPARK-28267
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Trivial
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 166 matches

Mail list logo