[jira] [Created] (SPARK-20734) Structured Streaming spark.sql.streaming.schemaInference not handling schema changes

2017-05-13 Thread Ram (JIRA)
Ram created SPARK-20734:
---

 Summary: Structured Streaming spark.sql.streaming.schemaInference 
not handling schema changes
 Key: SPARK-20734
 URL: https://issues.apache.org/jira/browse/SPARK-20734
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.1.1
Reporter: Ram


sparkSession.config("spark.sql.streaming.schemaInference", 
true).getOrCreate();
Dataset dataset = 
sparkSession.readStream().parquet("file:/files-to-process");

StreamingQuery streamingQuery =
dataset.writeStream().option("checkpointLocation", 
"file:/checkpoint-location")
.outputMode(Append()).start("file:/save-parquet-files");

streamingQuery.awaitTermination();

After streaming query started If there's a schema changes on new paruet 
files under files-to-process directory. Structured Streaming not writing new 
schema changes. Is it possible to handle these schema changes in Structured 
Streaming.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20733) Permission Error: Access Denied

2017-05-13 Thread Parker Xiao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Parker Xiao  updated SPARK-20733:
-
Description: 
I am experiencing the following issue when I tried to launch pyspark. 

'''c:\spark>pyspark
Python 3.6.0 |Anaconda 4.3.1 (64-bit)| (default, Dec 23 2016, 11:57:41) 
[MSC 
v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
Traceback (most recent call last):
File "C:\spark\bin\..\python\pyspark\shell.py", line 38, in 
   SparkContext._ensure_initialized()
File "C:\spark\python\pyspark\context.py", line 259, in _ensure_initialized
   SparkContext._gateway = gateway or launch_gateway(conf)
File "C:\spark\python\pyspark\java_gateway.py", line 80, in launch_gateway
   proc = Popen(command, stdin=PIPE, env=env)
File "C:\Users\shuzhe\Anaconda3\lib\subprocess.py", line 707, in __init__
   restore_signals, start_new_session)
File "C:\Users\shuzhe\Anaconda3\lib\subprocess.py", line 990, in 
_execute_child
startupinfo)
PermissionError: [WinError 5] Access is denied'''

I know that some of these problems occur because of administrator problem. 
However, when I go to the folder and right click 'run as administrator', the 
problem still exists. So could anyone help me to figure out what the problem 
is? 

  was:
I am experiencing the following issue when I tried to launch pyspark. 

c:\spark>pyspark
Python 3.6.0 |Anaconda 4.3.1 (64-bit)| (default, Dec 23 2016, 11:57:41) 
[MSC 
v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
Traceback (most recent call last):
File "C:\spark\bin\..\python\pyspark\shell.py", line 38, in 
   SparkContext._ensure_initialized()
File "C:\spark\python\pyspark\context.py", line 259, in _ensure_initialized
   SparkContext._gateway = gateway or launch_gateway(conf)
File "C:\spark\python\pyspark\java_gateway.py", line 80, in launch_gateway
   proc = Popen(command, stdin=PIPE, env=env)
File "C:\Users\shuzhe\Anaconda3\lib\subprocess.py", line 707, in __init__
   restore_signals, start_new_session)
File "C:\Users\shuzhe\Anaconda3\lib\subprocess.py", line 990, in 
_execute_child
startupinfo)
PermissionError: [WinError 5] Access is denied

I know that some of these problems occur because of administrator problem. 
However, when I go to the folder and right click 'run as administrator', the 
problem still exists. So could anyone help me to figure out what the problem 
is? 


> Permission Error: Access Denied
> ---
>
> Key: SPARK-20733
> URL: https://issues.apache.org/jira/browse/SPARK-20733
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.1
> Environment: Windows 64-bit, Scala Version 2.11.8, Java 1.8.0_131, 
> Python 3.6, Anaconda 4.3.1
>Reporter: Parker Xiao 
>Priority: Critical
>
> I am experiencing the following issue when I tried to launch pyspark. 
> '''c:\spark>pyspark
> Python 3.6.0 |Anaconda 4.3.1 (64-bit)| (default, Dec 23 2016, 11:57:41) 
> [MSC 
> v.1900 64 bit (AMD64)] on win32
> Type "help", "copyright", "credits" or "license" for more information.
> Traceback (most recent call last):
> File "C:\spark\bin\..\python\pyspark\shell.py", line 38, in 
>SparkContext._ensure_initialized()
> File "C:\spark\python\pyspark\context.py", line 259, in 
> _ensure_initialized
>SparkContext._gateway = gateway or launch_gateway(conf)
> File "C:\spark\python\pyspark\java_gateway.py", line 80, in launch_gateway
>proc = Popen(command, stdin=PIPE, env=env)
> File "C:\Users\shuzhe\Anaconda3\lib\subprocess.py", line 707, in __init__
>restore_signals, start_new_session)
> File "C:\Users\shuzhe\Anaconda3\lib\subprocess.py", line 990, in 
> _execute_child
> startupinfo)
> PermissionError: [WinError 5] Access is denied'''
> I know that some of these problems occur because of administrator problem. 
> However, when I go to the folder and right click 'run as administrator', the 
> problem still exists. So could anyone help me to figure out what the problem 
> is? 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20733) Permission Error: Access Denied

2017-05-13 Thread Parker Xiao (JIRA)
Parker Xiao  created SPARK-20733:


 Summary: Permission Error: Access Denied
 Key: SPARK-20733
 URL: https://issues.apache.org/jira/browse/SPARK-20733
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.1.1
 Environment: Windows 64-bit, Scala Version 2.11.8, Java 1.8.0_131, 
Python 3.6, Anaconda 4.3.1
Reporter: Parker Xiao 
Priority: Critical


I am experiencing the following issue when I tried to launch pyspark. 

c:\spark>pyspark
Python 3.6.0 |Anaconda 4.3.1 (64-bit)| (default, Dec 23 2016, 11:57:41) 
[MSC 
v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
Traceback (most recent call last):
File "C:\spark\bin\..\python\pyspark\shell.py", line 38, in 
   SparkContext._ensure_initialized()
File "C:\spark\python\pyspark\context.py", line 259, in _ensure_initialized
   SparkContext._gateway = gateway or launch_gateway(conf)
File "C:\spark\python\pyspark\java_gateway.py", line 80, in launch_gateway
   proc = Popen(command, stdin=PIPE, env=env)
File "C:\Users\shuzhe\Anaconda3\lib\subprocess.py", line 707, in __init__
   restore_signals, start_new_session)
File "C:\Users\shuzhe\Anaconda3\lib\subprocess.py", line 990, in 
_execute_child
startupinfo)
PermissionError: [WinError 5] Access is denied

I know that some of these problems occur because of administrator problem. 
However, when I go to the folder and right click 'run as administrator', the 
problem still exists. So could anyone help me to figure out what the problem 
is? 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20725) partial aggregate should behave correctly for sameResult

2017-05-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16009564#comment-16009564
 ] 

Apache Spark commented on SPARK-20725:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/17975

> partial aggregate should behave correctly for sameResult
> 
>
> Key: SPARK-20725
> URL: https://issues.apache.org/jira/browse/SPARK-20725
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20725) partial aggregate should behave correctly for sameResult

2017-05-13 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell updated SPARK-20725:
--
Target Version/s: 2.1.2

> partial aggregate should behave correctly for sameResult
> 
>
> Key: SPARK-20725
> URL: https://issues.apache.org/jira/browse/SPARK-20725
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20725) partial aggregate should behave correctly for sameResult

2017-05-13 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-20725.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

> partial aggregate should behave correctly for sameResult
> 
>
> Key: SPARK-20725
> URL: https://issues.apache.org/jira/browse/SPARK-20725
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19840) Disallow creating permanent functions with invalid class names

2017-05-13 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16009348#comment-16009348
 ] 

Dongjoon Hyun commented on SPARK-19840:
---

Hi, [~smilegator].
Is there any plan to fix this in 2.2.0?
Or, do you think it is possible to fix this with the first PR in 2.2.0 for now?

> Disallow creating permanent functions with invalid class names
> --
>
> Key: SPARK-19840
> URL: https://issues.apache.org/jira/browse/SPARK-19840
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Dongjoon Hyun
>
> Currently, Spark raises exceptions on creating invalid **temporary** 
> functions, but doesn't for **permanent** functions. This issue aims to 
> disallow creating permanent functions with invalid class names.
> *BEFORE*
> {code}
> scala> sql("CREATE TEMPORARY FUNCTION function_with_invalid_classname AS 
> 'org.invalid'").show
> java.lang.ClassNotFoundException: org.invalid at 
> ...
> scala> sql("CREATE FUNCTION function_with_invalid_classname AS 
> 'org.invalid'").show
> ++
> ||
> ++
> ++
> scala> sql("show functions like 'function_*'").show(false)
> +---+
> |function   |
> +---+
> |default.function_with_invalid_classname|
> +---+
> scala> sql("select function_with_invalid_classname()").show
> org.apache.spark.sql.AnalysisException: Undefined function: 
> 'function_with_invalid_classname'. This function is neither a registered 
> temporary function nor a permanent function registered in the database 
> 'default'.; line 1 pos 7
> {code}
> *AFTER*
> {code}
> scala> sql("CREATE FUNCTION function_with_invalid_classname AS 
> 'org.invalid'").show
> java.lang.ClassNotFoundException: org.invalid
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20684) expose createGlobalTempView and dropGlobalTempView in SparkR

2017-05-13 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16009344#comment-16009344
 ] 

Dongjoon Hyun commented on SPARK-20684:
---

Hi, [~falaki].
According to the comments on PR, this issue seems to be closed as DUPLICATE of 
SPARK-17865.
If you agree, could you close this issue?

> expose createGlobalTempView and dropGlobalTempView in SparkR
> 
>
> Key: SPARK-20684
> URL: https://issues.apache.org/jira/browse/SPARK-20684
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Hossein Falaki
>
> This is a useful API that is not exposed in SparkR. It will help with moving 
> data between languages on a single single Spark application.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18772) Unnecessary conversion try for special floats in JSON

2017-05-13 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-18772:
---

Assignee: Hyukjin Kwon

> Unnecessary conversion try for special floats in JSON
> -
>
> Key: SPARK-18772
> URL: https://issues.apache.org/jira/browse/SPARK-18772
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.2.0
>Reporter: Nathan Howell
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.2.0
>
>
> It looks we can avoid some cases for unnecessary conversion try in special 
> floats in JSON.
> {code}
> scala> import org.apache.spark.sql.types._
> import org.apache.spark.sql.types._
> scala> spark.read.schema(StructType(Seq(StructField("a", 
> DoubleType.option("mode", "FAILFAST").json(Seq("""{"a": 
> "nan"}""").toDS).show()
> 17/05/12 11:30:41 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
> java.lang.NumberFormatException: For input string: "nan"
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18772) Unnecessary conversion try for special floats in JSON

2017-05-13 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-18772.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17956
[https://github.com/apache/spark/pull/17956]

> Unnecessary conversion try for special floats in JSON
> -
>
> Key: SPARK-18772
> URL: https://issues.apache.org/jira/browse/SPARK-18772
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.2.0
>Reporter: Nathan Howell
>Priority: Minor
> Fix For: 2.2.0
>
>
> It looks we can avoid some cases for unnecessary conversion try in special 
> floats in JSON.
> {code}
> scala> import org.apache.spark.sql.types._
> import org.apache.spark.sql.types._
> scala> spark.read.schema(StructType(Seq(StructField("a", 
> DoubleType.option("mode", "FAILFAST").json(Seq("""{"a": 
> "nan"}""").toDS).show()
> 17/05/12 11:30:41 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
> java.lang.NumberFormatException: For input string: "nan"
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20628) Keep track of nodes which are going to be shut down & avoid scheduling new tasks

2017-05-13 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16009261#comment-16009261
 ] 

holdenk commented on SPARK-20628:
-

I'm going to take a crack at implementing this - I'm traveling a lot until 
Spark Summit so might be on the slow side working on this.

> Keep track of nodes which are going to be shut down & avoid scheduling new 
> tasks
> 
>
> Key: SPARK-20628
> URL: https://issues.apache.org/jira/browse/SPARK-20628
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: holdenk
>
> Keep track of nodes which are going to be shut down. We considered adding 
> this for YARN but took a different approach, for instances where we can't 
> control instance termination though (EC2, GCE, etc.) this may make more sense.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20629) Copy shuffle data when nodes are being shut down

2017-05-13 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-20629:

Summary: Copy shuffle data when nodes are being shut down  (was: Copy data 
when nodes are being shut down)

> Copy shuffle data when nodes are being shut down
> 
>
> Key: SPARK-20629
> URL: https://issues.apache.org/jira/browse/SPARK-20629
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: holdenk
>
> We decided not to do this for YARN, but for EC2/GCE and similar systems nodes 
> may be shut down entirely without the ability to keep an AuxiliaryService 
> around.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18772) Unnecessary conversion try for special floats in JSON

2017-05-13 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-18772:
-
Affects Version/s: 2.2.0
  Description: 
It looks we can avoid some cases for unnecessary conversion try in special 
floats in JSON.

{code}
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._

scala> spark.read.schema(StructType(Seq(StructField("a", 
DoubleType.option("mode", "FAILFAST").json(Seq("""{"a": 
"nan"}""").toDS).show()
17/05/12 11:30:41 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
java.lang.NumberFormatException: For input string: "nan"
...
{code}




  was:
It looks we can avoid some cases for unnecessary conversion try in special 
floats in JSON.

Also, we could support some other cases for them such as {{+INF}}, {{INF}} and 
{{-INF}}.

For avoiding additional conversions, please refer the codes below:

{code}
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._

scala> spark.read.schema(StructType(Seq(StructField("a", 
DoubleType.option("mode", "FAILFAST").json(Seq("""{"a": 
"nan"}""").toDS).show()
17/05/12 11:30:41 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
java.lang.NumberFormatException: For input string: "nan"
...
{code}




  Summary: Unnecessary conversion try for special floats in JSON  
(was: Unnecessary conversion try and some missing cases for special floats in 
JSON)

> Unnecessary conversion try for special floats in JSON
> -
>
> Key: SPARK-18772
> URL: https://issues.apache.org/jira/browse/SPARK-18772
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.2.0
>Reporter: Nathan Howell
>Priority: Minor
>
> It looks we can avoid some cases for unnecessary conversion try in special 
> floats in JSON.
> {code}
> scala> import org.apache.spark.sql.types._
> import org.apache.spark.sql.types._
> scala> spark.read.schema(StructType(Seq(StructField("a", 
> DoubleType.option("mode", "FAILFAST").json(Seq("""{"a": 
> "nan"}""").toDS).show()
> 17/05/12 11:30:41 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
> java.lang.NumberFormatException: For input string: "nan"
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20731) Add ability to change or omit .csv file extension in CSV Data Source

2017-05-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20731:


Assignee: Apache Spark

> Add ability to change or omit .csv file extension in CSV Data Source
> 
>
> Key: SPARK-20731
> URL: https://issues.apache.org/jira/browse/SPARK-20731
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Mikko Kupsu
>Assignee: Apache Spark
>Priority: Minor
>
> CSV Data Source has the ability to change the field delimiter. If this is 
> changed for example to TAB, then the default file extension "csv" is 
> misleading and eg. "tsv" would be preferable.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20731) Add ability to change or omit .csv file extension in CSV Data Source

2017-05-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20731:


Assignee: (was: Apache Spark)

> Add ability to change or omit .csv file extension in CSV Data Source
> 
>
> Key: SPARK-20731
> URL: https://issues.apache.org/jira/browse/SPARK-20731
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Mikko Kupsu
>Priority: Minor
>
> CSV Data Source has the ability to change the field delimiter. If this is 
> changed for example to TAB, then the default file extension "csv" is 
> misleading and eg. "tsv" would be preferable.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20731) Add ability to change or omit .csv file extension in CSV Data Source

2017-05-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16009180#comment-16009180
 ] 

Apache Spark commented on SPARK-20731:
--

User 'mikkokupsu' has created a pull request for this issue:
https://github.com/apache/spark/pull/17973

> Add ability to change or omit .csv file extension in CSV Data Source
> 
>
> Key: SPARK-20731
> URL: https://issues.apache.org/jira/browse/SPARK-20731
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Mikko Kupsu
>Priority: Minor
>
> CSV Data Source has the ability to change the field delimiter. If this is 
> changed for example to TAB, then the default file extension "csv" is 
> misleading and eg. "tsv" would be preferable.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20731) Add ability to change or omit .csv file extension in CSV Data Source

2017-05-13 Thread Mikko Kupsu (JIRA)
Mikko Kupsu created SPARK-20731:
---

 Summary: Add ability to change or omit .csv file extension in CSV 
Data Source
 Key: SPARK-20731
 URL: https://issues.apache.org/jira/browse/SPARK-20731
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.1
Reporter: Mikko Kupsu
Priority: Minor


CSV Data Source has the ability to change the field delimiter. If this is 
changed for example to TAB, then the default file extension "csv" is misleading 
and eg. "tsv" would be preferable.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20723) Random Forest Classifier should expose intermediateRDDStorageLevel similar to ALS

2017-05-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20723:


Assignee: Apache Spark

> Random Forest Classifier should expose intermediateRDDStorageLevel similar to 
> ALS
> -
>
> Key: SPARK-20723
> URL: https://issues.apache.org/jira/browse/SPARK-20723
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: madhukara phatak
>Assignee: Apache Spark
>Priority: Minor
>
> Currently Random Forest implementation cache as the intermediatery data using 
> *MEMORY_AND_DISK* storage level. This creates issues in low memory scenarios. 
> So we should expose an expert param *intermediateStorageLevel* which allows 
> user to customise the storage level. This is similar to als options like 
> specified in below jira
> https://issues.apache.org/jira/browse/SPARK-14412



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20723) Random Forest Classifier should expose intermediateRDDStorageLevel similar to ALS

2017-05-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20723:


Assignee: (was: Apache Spark)

> Random Forest Classifier should expose intermediateRDDStorageLevel similar to 
> ALS
> -
>
> Key: SPARK-20723
> URL: https://issues.apache.org/jira/browse/SPARK-20723
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: madhukara phatak
>Priority: Minor
>
> Currently Random Forest implementation cache as the intermediatery data using 
> *MEMORY_AND_DISK* storage level. This creates issues in low memory scenarios. 
> So we should expose an expert param *intermediateStorageLevel* which allows 
> user to customise the storage level. This is similar to als options like 
> specified in below jira
> https://issues.apache.org/jira/browse/SPARK-14412



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20723) Random Forest Classifier should expose intermediateRDDStorageLevel similar to ALS

2017-05-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16009171#comment-16009171
 ] 

Apache Spark commented on SPARK-20723:
--

User 'phatak-dev' has created a pull request for this issue:
https://github.com/apache/spark/pull/17972

> Random Forest Classifier should expose intermediateRDDStorageLevel similar to 
> ALS
> -
>
> Key: SPARK-20723
> URL: https://issues.apache.org/jira/browse/SPARK-20723
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: madhukara phatak
>Priority: Minor
>
> Currently Random Forest implementation cache as the intermediatery data using 
> *MEMORY_AND_DISK* storage level. This creates issues in low memory scenarios. 
> So we should expose an expert param *intermediateStorageLevel* which allows 
> user to customise the storage level. This is similar to als options like 
> specified in below jira
> https://issues.apache.org/jira/browse/SPARK-14412



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20723) Random Forest Classifier should expose intermediateRDDStorageLevel similar to ALS

2017-05-13 Thread madhukara phatak (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

madhukara phatak updated SPARK-20723:
-
Description: 
Currently Random Forest implementation cache as the intermediatery data using 
*MEMORY_AND_DISK* storage level. This creates issues in low memory scenarios. 
So we should expose an expert param *intermediateStorageLevel* which allows 
user to customise the storage level. This is similar to als options like 
specified in below jira

https://issues.apache.org/jira/browse/SPARK-14412

  was:
Currently Random Forest implementation cache as the intermediatery data using 
*MEMORY_AND_DISK* storage level. This creates issues in low memory scenarios. 
So we should expose an expert param *intermediateRDDStorageLevel* which allows 
user to customise the storage level. This is similar to als options like 
specified in below jira

https://issues.apache.org/jira/browse/SPARK-14412


> Random Forest Classifier should expose intermediateRDDStorageLevel similar to 
> ALS
> -
>
> Key: SPARK-20723
> URL: https://issues.apache.org/jira/browse/SPARK-20723
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: madhukara phatak
>Priority: Minor
>
> Currently Random Forest implementation cache as the intermediatery data using 
> *MEMORY_AND_DISK* storage level. This creates issues in low memory scenarios. 
> So we should expose an expert param *intermediateStorageLevel* which allows 
> user to customise the storage level. This is similar to als options like 
> specified in below jira
> https://issues.apache.org/jira/browse/SPARK-14412



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19122) Unnecessary shuffle+sort added if join predicates ordering differ from bucketing and sorting order

2017-05-13 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16009165#comment-16009165
 ] 

Tejas Patil commented on SPARK-19122:
-

Thanks for confirming. I have added it in the jira description in case someone 
comes across this in future.

> Unnecessary shuffle+sort added if join predicates ordering differ from 
> bucketing and sorting order
> --
>
> Key: SPARK-19122
> URL: https://issues.apache.org/jira/browse/SPARK-19122
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Tejas Patil
>
> `table1` and `table2` are sorted and bucketed on columns `j` and `k` (in 
> respective order)
> This is how they are generated:
> {code}
> val df = (0 until 16).map(i => (i % 8, i * 2, i.toString)).toDF("i", "j", 
> "k").coalesce(1)
> df.write.format("org.apache.spark.sql.hive.orc.OrcFileFormat").bucketBy(8, 
> "j", "k").sortBy("j", "k").saveAsTable("table1")
> df.write.format("org.apache.spark.sql.hive.orc.OrcFileFormat").bucketBy(8, 
> "j", "k").sortBy("j", "k").saveAsTable("table2")
> {code}
> Now, if join predicates are specified in query in *same* order as bucketing 
> and sort order, there is no shuffle and sort.
> {code}
> scala> hc.sql("SET spark.sql.autoBroadcastJoinThreshold=1")
> scala> hc.sql("SELECT * FROM table1 a JOIN table2 b ON a.j=b.j AND 
> a.k=b.k").explain(true)
> == Physical Plan ==
> *SortMergeJoin [j#61, k#62], [j#100, k#101], Inner
> :- *Project [i#60, j#61, k#62]
> :  +- *Filter (isnotnull(k#62) && isnotnull(j#61))
> : +- *FileScan orc default.table1[i#60,j#61,k#62] Batched: false, Format: 
> ORC, Location: InMemoryFileIndex[file:/table1], PartitionFilters: [], 
> PushedFilters: [IsNotNull(k), IsNotNull(j)], ReadSchema: 
> struct
> +- *Project [i#99, j#100, k#101]
>+- *Filter (isnotnull(j#100) && isnotnull(k#101))
>   +- *FileScan orc default.table2[i#99,j#100,k#101] Batched: false, 
> Format: ORC, Location: InMemoryFileIndex[file:/table2], PartitionFilters: [], 
> PushedFilters: [IsNotNull(j), IsNotNull(k)], ReadSchema: 
> struct
> {code}
> The same query with join predicates in *different* order from bucketing and 
> sort order leads to extra shuffle and sort being introduced
> {code}
> scala> hc.sql("SET spark.sql.autoBroadcastJoinThreshold=1")
> scala> hc.sql("SELECT * FROM table1 a JOIN table2 b ON a.k=b.k AND a.j=b.j 
> ").explain(true)
> == Physical Plan ==
> *SortMergeJoin [k#62, j#61], [k#101, j#100], Inner
> :- *Sort [k#62 ASC NULLS FIRST, j#61 ASC NULLS FIRST], false, 0
> :  +- Exchange hashpartitioning(k#62, j#61, 200)
> : +- *Project [i#60, j#61, k#62]
> :+- *Filter (isnotnull(k#62) && isnotnull(j#61))
> :   +- *FileScan orc default.table1[i#60,j#61,k#62] Batched: false, 
> Format: ORC, Location: InMemoryFileIndex[file:/table1], PartitionFilters: [], 
> PushedFilters: [IsNotNull(k), IsNotNull(j)], ReadSchema: 
> struct
> +- *Sort [k#101 ASC NULLS FIRST, j#100 ASC NULLS FIRST], false, 0
>+- Exchange hashpartitioning(k#101, j#100, 200)
>   +- *Project [i#99, j#100, k#101]
>  +- *Filter (isnotnull(j#100) && isnotnull(k#101))
> +- *FileScan orc default.table2[i#99,j#100,k#101] Batched: false, 
> Format: ORC, Location: InMemoryFileIndex[file:/table2], PartitionFilters: [], 
> PushedFilters: [IsNotNull(j), IsNotNull(k)], ReadSchema: 
> struct
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19122) Unnecessary shuffle+sort added if join predicates ordering differ from bucketing and sorting order

2017-05-13 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated SPARK-19122:

Description: 
`table1` and `table2` are sorted and bucketed on columns `j` and `k` (in 
respective order)

This is how they are generated:
{code}
val df = (0 until 16).map(i => (i % 8, i * 2, i.toString)).toDF("i", "j", 
"k").coalesce(1)
df.write.format("org.apache.spark.sql.hive.orc.OrcFileFormat").bucketBy(8, "j", 
"k").sortBy("j", "k").saveAsTable("table1")
df.write.format("org.apache.spark.sql.hive.orc.OrcFileFormat").bucketBy(8, "j", 
"k").sortBy("j", "k").saveAsTable("table2")
{code}

Now, if join predicates are specified in query in *same* order as bucketing and 
sort order, there is no shuffle and sort.

{code}
scala> hc.sql("SET spark.sql.autoBroadcastJoinThreshold=1")
scala> hc.sql("SELECT * FROM table1 a JOIN table2 b ON a.j=b.j AND 
a.k=b.k").explain(true)

== Physical Plan ==
*SortMergeJoin [j#61, k#62], [j#100, k#101], Inner
:- *Project [i#60, j#61, k#62]
:  +- *Filter (isnotnull(k#62) && isnotnull(j#61))
: +- *FileScan orc default.table1[i#60,j#61,k#62] Batched: false, Format: 
ORC, Location: InMemoryFileIndex[file:/table1], PartitionFilters: [], 
PushedFilters: [IsNotNull(k), IsNotNull(j)], ReadSchema: 
struct
+- *Project [i#99, j#100, k#101]
   +- *Filter (isnotnull(j#100) && isnotnull(k#101))
  +- *FileScan orc default.table2[i#99,j#100,k#101] Batched: false, Format: 
ORC, Location: InMemoryFileIndex[file:/table2], PartitionFilters: [], 
PushedFilters: [IsNotNull(j), IsNotNull(k)], ReadSchema: 
struct
{code}


The same query with join predicates in *different* order from bucketing and 
sort order leads to extra shuffle and sort being introduced

{code}
scala> hc.sql("SET spark.sql.autoBroadcastJoinThreshold=1")
scala> hc.sql("SELECT * FROM table1 a JOIN table2 b ON a.k=b.k AND a.j=b.j 
").explain(true)

== Physical Plan ==
*SortMergeJoin [k#62, j#61], [k#101, j#100], Inner
:- *Sort [k#62 ASC NULLS FIRST, j#61 ASC NULLS FIRST], false, 0
:  +- Exchange hashpartitioning(k#62, j#61, 200)
: +- *Project [i#60, j#61, k#62]
:+- *Filter (isnotnull(k#62) && isnotnull(j#61))
:   +- *FileScan orc default.table1[i#60,j#61,k#62] Batched: false, 
Format: ORC, Location: InMemoryFileIndex[file:/table1], PartitionFilters: [], 
PushedFilters: [IsNotNull(k), IsNotNull(j)], ReadSchema: 
struct
+- *Sort [k#101 ASC NULLS FIRST, j#100 ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(k#101, j#100, 200)
  +- *Project [i#99, j#100, k#101]
 +- *Filter (isnotnull(j#100) && isnotnull(k#101))
+- *FileScan orc default.table2[i#99,j#100,k#101] Batched: false, 
Format: ORC, Location: InMemoryFileIndex[file:/table2], PartitionFilters: [], 
PushedFilters: [IsNotNull(j), IsNotNull(k)], ReadSchema: 
struct
{code}

  was:
`table1` and `table2` are sorted and bucketed on columns `j` and `k` (in 
respective order)

This is how they are generated:
{code}
val df = (0 until 16).map(i => (i % 8, i * 2, i.toString)).toDF("i", "j", 
"k").coalesce(1)
df.write.format("org.apache.spark.sql.hive.orc.OrcFileFormat").bucketBy(8, "j", 
"k").sortBy("j", "k").saveAsTable("table1")
df.write.format("org.apache.spark.sql.hive.orc.OrcFileFormat").bucketBy(8, "j", 
"k").sortBy("j", "k").saveAsTable("table2")
{code}

Now, if join predicates are specified in query in *same* order as bucketing and 
sort order, there is no shuffle and sort.

{code}
scala> hc.sql("SELECT * FROM table1 a JOIN table2 b ON a.j=b.j AND 
a.k=b.k").explain(true)

== Physical Plan ==
*SortMergeJoin [j#61, k#62], [j#100, k#101], Inner
:- *Project [i#60, j#61, k#62]
:  +- *Filter (isnotnull(k#62) && isnotnull(j#61))
: +- *FileScan orc default.table1[i#60,j#61,k#62] Batched: false, Format: 
ORC, Location: InMemoryFileIndex[file:/table1], PartitionFilters: [], 
PushedFilters: [IsNotNull(k), IsNotNull(j)], ReadSchema: 
struct
+- *Project [i#99, j#100, k#101]
   +- *Filter (isnotnull(j#100) && isnotnull(k#101))
  +- *FileScan orc default.table2[i#99,j#100,k#101] Batched: false, Format: 
ORC, Location: InMemoryFileIndex[file:/table2], PartitionFilters: [], 
PushedFilters: [IsNotNull(j), IsNotNull(k)], ReadSchema: 
struct
{code}


The same query with join predicates in *different* order from bucketing and 
sort order leads to extra shuffle and sort being introduced

{code}
scala> hc.sql("SELECT * FROM table1 a JOIN table2 b ON a.k=b.k AND a.j=b.j 
").explain(true)

== Physical Plan ==
*SortMergeJoin [k#62, j#61], [k#101, j#100], Inner
:- *Sort [k#62 ASC NULLS FIRST, j#61 ASC NULLS FIRST], false, 0
:  +- Exchange hashpartitioning(k#62, j#61, 200)
: +- *Project [i#60, j#61, k#62]
:+- *Filter (isnotnull(k#62) && isnotnull(j#61))
:   +- *FileScan orc