[jira] [Commented] (SPARK-10781) Allow certain number of failed tasks and allow job to succeed

2018-03-31 Thread Fei Niu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16421558#comment-16421558
 ] 

Fei Niu commented on SPARK-10781:
-

This can be a very useful feature. For example, if your sequence file format 
itself is bad, currently there is no way to catch the exception and move on. It 
makes some data set not able to process.

> Allow certain number of failed tasks and allow job to succeed
> -
>
> Key: SPARK-10781
> URL: https://issues.apache.org/jira/browse/SPARK-10781
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.5.0
>Reporter: Thomas Graves
>Priority: Major
>
> MapReduce has this config mapreduce.map.failures.maxpercent and 
> mapreduce.reduce.failures.maxpercent which allows for a certain percent of 
> tasks to fail but the job to still succeed.  
> This could be a useful feature in Spark also if a job doesn't need all the 
> tasks to be successful.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22393) spark-shell can't find imported types in class constructors, extends clause

2018-03-31 Thread Joe Pallas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16421548#comment-16421548
 ] 

Joe Pallas commented on SPARK-22393:


The changes that were imported in [https://github.com/apache/spark/pull/19846] 
don't seem to cover all the cases that the Scala 2.12 changes covered.  To be 
specific, this sequence:
{code}
import scala.reflect.runtime.{universe => ru}
import ru.TypeTag
class C[T: TypeTag](value: T)
{code}
works correctly in Scala 2.12 with -Yrepl-class-based, but does not work in 
spark-shell 2.3.0.

I don't understand the import-handling code enough to understand the problem, 
however.  It figures out that it needs the import for TypeTag but it doesn't 
recognize that the import depends on the previous import:
{noformat}
:9: error: not found: value ru
import ru.TypeTag
   ^
{noformat}

 

> spark-shell can't find imported types in class constructors, extends clause
> ---
>
> Key: SPARK-22393
> URL: https://issues.apache.org/jira/browse/SPARK-22393
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.0.2, 2.1.2, 2.2.0
>Reporter: Ryan Williams
>Assignee: Mark Petruska
>Priority: Minor
> Fix For: 2.3.0
>
>
> {code}
> $ spark-shell
> …
> scala> import org.apache.spark.Partition
> import org.apache.spark.Partition
> scala> class P(p: Partition)
> :11: error: not found: type Partition
>class P(p: Partition)
>   ^
> scala> class P(val index: Int) extends Partition
> :11: error: not found: type Partition
>class P(val index: Int) extends Partition
>^
> {code}
> Any class that I {{import}} gives "not found: type ___" when used as a 
> parameter to a class, or in an extends clause; this applies to classes I 
> import from JARs I provide via {{--jars}} as well as core Spark classes as 
> above.
> This worked in 1.6.3 but has been broken since 2.0.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12216) Spark failed to delete temp directory

2018-03-31 Thread Kingsley Jones (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16421530#comment-16421530
 ] 

Kingsley Jones edited comment on SPARK-12216 at 4/1/18 12:21 AM:
-

Same issue under Windows 10 and Windows Server 2016 using Java 1.8, Spark 
2.2.1, Hadoop 2.7

My tests support the contention of [~IgorBabalich] ... it seems that 
classloaders instantiated by the code are not ever being closed. On *nix this 
is not a problem since the files are not locked. However, on windows the files 
are locked.

In addition to the resources mentioned by Igor this Oracle bug fix from Java 7 
seems relevant:

[https://docs.oracle.com/javase/7/docs/technotes/guides/net/ClassLoader.html]

A new method "close()" was introduced to address the problem, which shows up on 
Windows due to the differing treatment of file locks between the Windows file 
system and *nix file system.

I would point out that this is a generic java issue which breaks the 
cross-platform intention of that platform as a whole.

The Oracle blog also contains a post:

[https://blogs.oracle.com/corejavatechtips/closing-a-urlclassloader]

I have been searching the Apache Spark code-base for classloader instances, in 
search of any ".close()" action. I could not find any, so I believe 
[~IgorBabalich] is correct - the issue has to do with classloaders not being 
closed.

I would fix it myself, but thusfar it is not clear to me *when* the classloader 
needs to be closed. That is just ignorance on my part. The question is whether 
the classloader should be closed when still available as variable at the point 
where it has been instantiated, or later during the ShutdownHookManger cleanup. 
If the latter, then it was not clear to me how to actually get a list of open 
class loaders.

That is where I am at so far. I am prepared to put some work into this, but I 
need some help from those who know the codebase to help answer the above 
question - maybe with a well-isolated test.

MY TESTS...

This issue has been around in one form or another for at least four years and 
shows up on many threads.

The standard answer is that it is a "permissions issue" to do with Windows.

That assertion is objectively false.

There is simple test to prove it.

At a windows prompt, start spark-shell

C:\spark\spark-shell   

then get the temp file directory:

scala> sc.getConf.get("spark.repl.class.outputDir")

it will be in %AppData%\Local\Temp tree e.g.

C:\Users\kings\AppData\Local\Temp\spark-d67b262e-f6c8-43d7-8790-731308497f02\repl-4cc87dce-8608-4643-b869-b0287ac4571f

where the last file name has GUID that changes in each iteration.

With the spark session still open, go to the Temp directory and try to delete 
the given directory.

You won't be able to... there is a lock on it.

Now issue

scala> :quit

to quit the session.

The stack trace will show that ShutdownHookManager tried to delete the 
directory above but could not.

If you now try and delete it through the file system you can.

This is because the JVM actually cleans up the locks on exit.

So, it is not a permission issue, but a feature of the Windows treatment of 
file locks.

This is the *known issue* that was addressed in the Java bug fix through 
introduction of a Closeable interface close method for URLClassLoader. It was 
fixed there since many enterprise systems run on Windows.

Now... to further test the cause, I used the Windows Linux Subsytem.

To acces this (post install) you run

C:> bash

from a command prompt.

In order to get this to work, I used the same spark install, but had to install 
a fresh copy of jdk on ubuntu within the Windows bash subsystem. This is 
standard ubuntu stuff, but the path to your windows c drive is /mnt/c

If I rerun the same test, the new output of 

scala> sc.getConf.get("spark.repl.class.outputDir")

will be a different folder location under Linux /tmp but with the same setup.

With the spark session still active it is possible to delete the spark folders 
in the /tmp folder *while the session is still active*. This is the difference 
between Windows and Linux. While bash is running Ubuntu on Windows, it has the 
different file locking behaviour which means you can delete the spark temp 
folders while a session is running.

If you run through a new session with spark-shell at the linux prompt and issue 
:quit it will shutdown without any stacktrace error from ShutdownHookManger.

So, my conclusions are as follows:

1) this is not a permissions issue as per the common assertion

2) it is a Windows specific problem for *known* reasons - namely the difference 
on file-locking as compared with Linux

3) it was considered a *bug* in the Java ecosystem and was fixed as such from 
Java 1.7 with the .close() method

Further...

People who need to run Spark on windows infrastructure (like me) can either run 
a docker container or use the windows linux 

[jira] [Comment Edited] (SPARK-12216) Spark failed to delete temp directory

2018-03-31 Thread Kingsley Jones (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16421530#comment-16421530
 ] 

Kingsley Jones edited comment on SPARK-12216 at 4/1/18 12:21 AM:
-

Same issue under Windows 10 and Windows Server 2016 using Java 1.8, Spark 
2.2.1, Hadoop 2.7

My tests support the contention of [~IgorBabalich]Igor Bablich... it seems that 
classloaders instantiated by the code are not ever being closed. On *nix this 
is not a problem since the files are not locked. However, on windows the files 
are locked.

In addition to the resources mentioned by Igor this Oracle bug fix from Java 7 
seems relevant:

[https://docs.oracle.com/javase/7/docs/technotes/guides/net/ClassLoader.html]

A new method "close()" was introduced to address the problem, which shows up on 
Windows due to the differing treatment of file locks between the Windows file 
system and *nix file system.

I would point out that this is a generic java issue which breaks the 
cross-platform intention of that platform as a whole.

The Oracle blog also contains a post:

[https://blogs.oracle.com/corejavatechtips/closing-a-urlclassloader]

I have been searching the Apache Spark code-base for classloader instances, in 
search of any ".close()" action. I could not find any, so I believe 
[~IgorBabalich] is correct - the issue has to do with classloaders not being 
closed.

I would fix it myself, but thusfar it is not clear to me *when* the classloader 
needs to be closed. That is just ignorance on my part. The question is whether 
the classloader should be closed when still available as variable at the point 
where it has been instantiated, or later during the ShutdownHookManger cleanup. 
If the latter, then it was not clear to me how to actually get a list of open 
class loaders.

That is where I am at so far. I am prepared to put some work into this, but I 
need some help from those who know the codebase to help answer the above 
question - maybe with a well-isolated test.

MY TESTS...

This issue has been around in one form or another for at least four years and 
shows up on many threads.

The standard answer is that it is a "permissions issue" to do with Windows.

That assertion is objectively false.

There is simple test to prove it.

At a windows prompt, start spark-shell

C:\spark\spark-shell   

then get the temp file directory:

scala> sc.getConf.get("spark.repl.class.outputDir")

it will be in %AppData%\Local\Temp tree e.g.

C:\Users\kings\AppData\Local\Temp\spark-d67b262e-f6c8-43d7-8790-731308497f02\repl-4cc87dce-8608-4643-b869-b0287ac4571f

where the last file name has GUID that changes in each iteration.

With the spark session still open, go to the Temp directory and try to delete 
the given directory.

You won't be able to... there is a lock on it.

Now issue

scala> :quit

to quit the session.

The stack trace will show that ShutdownHookManager tried to delete the 
directory above but could not.

If you now try and delete it through the file system you can.

This is because the JVM actually cleans up the locks on exit.

So, it is not a permission issue, but a feature of the Windows treatment of 
file locks.

This is the *known issue* that was addressed in the Java bug fix through 
introduction of a Closeable interface close method for URLClassLoader. It was 
fixed there since many enterprise systems run on Windows.

Now... to further test the cause, I used the Windows Linux Subsytem.

To acces this (post install) you run

C:> bash

from a command prompt.

In order to get this to work, I used the same spark install, but had to install 
a fresh copy of jdk on ubuntu within the Windows bash subsystem. This is 
standard ubuntu stuff, but the path to your windows c drive is /mnt/c

If I rerun the same test, the new output of 

scala> sc.getConf.get("spark.repl.class.outputDir")

will be a different folder location under Linux /tmp but with the same setup.

With the spark session still active it is possible to delete the spark folders 
in the /tmp folder *while the session is still active*. This is the difference 
between Windows and Linux. While bash is running Ubuntu on Windows, it has the 
different file locking behaviour which means you can delete the spark temp 
folders while a session is running.

If you run through a new session with spark-shell at the linux prompt and issue 
:quit it will shutdown without any stacktrace error from ShutdownHookManger.

So, my conclusions are as follows:

1) this is not a permissions issue as per the common assertion

2) it is a Windows specific problem for *known* reasons - namely the difference 
on file-locking as compared with Linux

3) it was considered a *bug* in the Java ecosystem and was fixed as such from 
Java 1.7 with the .close() method

Further...

People who need to run Spark on windows infrastructure (like me) can either run 
a docker container or use the windows 

[jira] [Commented] (SPARK-12216) Spark failed to delete temp directory

2018-03-31 Thread Kingsley Jones (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16421530#comment-16421530
 ] 

Kingsley Jones commented on SPARK-12216:


Same issue under Windows 10 and Windows Server 2016 using Java 1.8, Spark 
2.2.1, Hadoop 2.7

My tests support the contention of Igor Bablich... it seems that classloaders 
instantiated by the code are not ever being closed. On *nix this is not a 
problem since the files are not locked. However, on windows the files are 
locked.

In addition to the resources mentioned by Igor this Oracle bug fix from Java 7 
seems relevant:

[https://docs.oracle.com/javase/7/docs/technotes/guides/net/ClassLoader.html]

A new method "close()" was introduced to address the problem, which shows up on 
Windows due to the differing treatment of file locks between the Windows file 
system and *nix file system.

I would point out that this is a generic java issue which breaks the 
cross-platform intention of that platform as a whole.

The Oracle blog also contains a post:

[https://blogs.oracle.com/corejavatechtips/closing-a-urlclassloader]

I have been searching the Apache Spark code-base for classloader instances, in 
search of any ".close()" action. I could not find any, so I believe 
[~IgorBabalich] is correct - the issue has to do with classloaders not being 
closed.

I would fix it myself, but thusfar it is not clear to me *when* the classloader 
needs to be closed. That is just ignorance on my part. The question is whether 
the classloader should be closed when still available as variable at the point 
where it has been instantiated, or later during the ShutdownHookManger cleanup. 
If the latter, then it was not clear to me how to actually get a list of open 
class loaders.

That is where I am at so far. I am prepared to put some work into this, but I 
need some help from those who know the codebase to help answer the above 
question - maybe with a well-isolated test.

MY TESTS...

This issue has been around in one form or another for at least four years and 
shows up on many threads.

The standard answer is that it is a "permissions issue" to do with Windows.

That assertion is objectively false.

There is simple test to prove it.

At a windows prompt, start spark-shell

C:\spark\spark-shell   

then get the temp file directory:

scala> sc.getConf.get("spark.repl.class.outputDir")

it will be in %AppData%\Local\Temp tree e.g.

C:\Users\kings\AppData\Local\Temp\spark-d67b262e-f6c8-43d7-8790-731308497f02\repl-4cc87dce-8608-4643-b869-b0287ac4571f

where the last file name has GUID that changes in each iteration.

With the spark session still open, go to the Temp directory and try to delete 
the given directory.

You won't be able to... there is a lock on it.

Now issue

scala> :quit

to quit the session.

The stack trace will show that ShutdownHookManager tried to delete the 
directory above but could not.

If you now try and delete it through the file system you can.

This is because the JVM actually cleans up the locks on exit.

So, it is not a permission issue, but a feature of the Windows treatment of 
file locks.

This is the *known issue* that was addressed in the Java bug fix through 
introduction of a Closeable interface close method for URLClassLoader. It was 
fixed there since many enterprise systems run on Windows.

Now... to further test the cause, I used the Windows Linux Subsytem.

To acces this (post install) you run

C:> bash

from a command prompt.

In order to get this to work, I used the same spark install, but had to install 
a fresh copy of jdk on ubuntu within the Windows bash subsystem. This is 
standard ubuntu stuff, but the path to your windows c drive is /mnt/c

If I rerun the same test, the new output of 

scala> sc.getConf.get("spark.repl.class.outputDir")

will be a different folder location under Linux /tmp but with the same setup.

With the spark session still active it is possible to delete the spark folders 
in the /tmp folder *while the session is still active*. This is the difference 
between Windows and Linux. While bash is running Ubuntu on Windows, it has the 
different file locking behaviour which means you can delete the spark temp 
folders while a session is running.

If you run through a new session with spark-shell at the linux prompt and issue 
:quit it will shutdown without any stacktrace error from ShutdownHookManger.

So, my conclusions are as follows:

1) this is not a permissions issue as per the common assertion

2) it is a Windows specific problem for *known* reasons - namely the difference 
on file-locking as compared with Linux

3) it was considered a *bug* in the Java ecosystem and was fixed as such from 
Java 1.7 with the .close() method

Further...

People who need to run Spark on windows infrastructure (like me) can either run 
a docker container or use the windows linux subsystem to launch processes. So 
we do have a workaround.


[jira] [Updated] (SPARK-19842) Informational Referential Integrity Constraints Support in Spark

2018-03-31 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-19842:

Target Version/s: 2.4.0, 3.0.0

> Informational Referential Integrity Constraints Support in Spark
> 
>
> Key: SPARK-19842
> URL: https://issues.apache.org/jira/browse/SPARK-19842
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Ioana Delaney
>Priority: Major
> Attachments: InformationalRIConstraints.doc
>
>
> *Informational Referential Integrity Constraints Support in Spark*
> This work proposes support for _informational primary key_ and _foreign key 
> (referential integrity) constraints_ in Spark. The main purpose is to open up 
> an area of query optimization techniques that rely on referential integrity 
> constraints semantics. 
> An _informational_ or _statistical constraint_ is a constraint such as a 
> _unique_, _primary key_, _foreign key_, or _check constraint_, that can be 
> used by Spark to improve query performance. Informational constraints are not 
> enforced by the Spark SQL engine; rather, they are used by Catalyst to 
> optimize the query processing. They provide semantics information that allows 
> Catalyst to rewrite queries to eliminate joins, push down aggregates, remove 
> unnecessary Distinct operations, and perform a number of other optimizations. 
> Informational constraints are primarily targeted to applications that load 
> and analyze data that originated from a data warehouse. For such 
> applications, the conditions for a given constraint are known to be true, so 
> the constraint does not need to be enforced during data load operations. 
> The attached document covers constraint definition, metastore storage, 
> constraint validation, and maintenance. The document shows many examples of 
> query performance improvements that utilize referential integrity constraints 
> and can be implemented in Spark.
> Link to the google doc: 
> [InformationalRIConstraints|https://docs.google.com/document/d/17r-cOqbKF7Px0xb9L7krKg2-RQB_gD2pxOmklm-ehsw/edit]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23791) Sub-optimal generated code for sum aggregating

2018-03-31 Thread Valentin Nikotin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16421407#comment-16421407
 ] 

Valentin Nikotin commented on SPARK-23791:
--

Hi Marco Gaido, 

I tested performance with 2.2 and 2.3, there was not really difference between 
these 2 versions. 
1. I am running test right now to get timings for 1 to 105 columns (gcp 
dataproc 1+4 nodes cluster with Spark 2.2). 
I've noticed that before and confirmed now: 

* 1-13 cols, wholeStage runs faster
* starting from 14 cols timing jumped ~4-5 times, while with disabled 
wholeStage it's linearly growing.
* 84 - 99 -- the error above
* 100+ runs the same timings

2. I have not tried current master yet. I would like test it next week.

I actually found very similar issue, but without too much details: [link 
title|https://issues.apache.org/jira/browse/SPARK-20479]

> Sub-optimal generated code for sum aggregating
> --
>
> Key: SPARK-23791
> URL: https://issues.apache.org/jira/browse/SPARK-23791
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Valentin Nikotin
>Priority: Major
>  Labels: performance
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> It appears to be that with wholeStage codegen enabled simple spark job 
> performing sum aggregation of 50 columns runs ~4 timer slower than without 
> wholeStage codegen.
> Please check test case code. Please note that udf is only to prevent 
> elimination optimizations that could be applied to literals. 
> {code:scala}
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.{Column, DataFrame, SparkSession}
> import org.apache.spark.sql.internal.SQLConf.WHOLESTAGE_CODEGEN_ENABLED
> object SPARK_23791 {
>   def main(args: Array[String]): Unit = {
> val spark = SparkSession
>   .builder()
>   .master("local[4]")
>   .appName("test")
>   .getOrCreate()
> def addConstColumns(prefix: String, cnt: Int, value: Column)(inputDF: 
> DataFrame) =
>   (0 until cnt).foldLeft(inputDF)((df, idx) => 
> df.withColumn(s"$prefix$idx", value))
> val dummy = udf(() => Option.empty[Int])
> def test(cnt: Int = 50, rows: Int = 500, grps: Int = 1000): Double = {
>   val t0 = System.nanoTime()
>   spark.range(rows).toDF()
> .withColumn("grp", col("id").mod(grps))
> .transform(addConstColumns("null_", cnt, dummy()))
> .groupBy("grp")
> .agg(sum("null_0"), (1 until cnt).map(idx => sum(s"null_$idx")): _*)
> .collect()
>   val t1 = System.nanoTime()
>   (t1 - t0) / 1e9
> }
> val timings = for (i <- 1 to 3) yield {
>   spark.sessionState.conf.setConf(WHOLESTAGE_CODEGEN_ENABLED, true)
>   val with_wholestage = test()
>   spark.sessionState.conf.setConf(WHOLESTAGE_CODEGEN_ENABLED, false)
>   val without_wholestage = test()
>   (with_wholestage, without_wholestage)
> }
> timings.foreach(println)
> println("Press enter ...")
> System.in.read()
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23837) Create table as select gives exception if the spark generated alias name contains comma

2018-03-31 Thread Shahid K I (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shahid K I updated SPARK-23837:
---
Priority: Minor  (was: Major)

> Create table as select gives exception if the spark generated alias name 
> contains comma
> ---
>
> Key: SPARK-23837
> URL: https://issues.apache.org/jira/browse/SPARK-23837
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Shahid K I
>Priority: Minor
>
> For spark generated alias name contains comma, Hive metastore throws 
> exception.
>  
> 0: jdbc:hive2://ha-cluster/default> create table a (col1 decimal(18,3), col2 
> decimal(18,5));
>  +---++
> |Result|
> +---++
>  +---++
>  No rows selected (0.171 seconds)
>  0: jdbc:hive2://ha-cluster/default> select col1*col2 from a;
>  
> +---+
> |(CAST(col1 AS DECIMAL(20,5)) * CAST(col2 AS DECIMAL(20,5)))|
> +---+
>  
> +---+
>  No rows selected (0.168 seconds)
>  0: jdbc:hive2://ha-cluster/default> create table b as select col1*col2 from 
> a;
> Error: org.apache.spark.sql.AnalysisException: Cannot create a table having a 
> column whose name contains commas in Hive metastore. Table: `default`.`b`; 
> Column: (CAST(col1 AS DECIMAL(20,5)) * CAST(col2 AS DECIMAL(20,5))); 
> (state=,code=0)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23837) Create table as select gives exception if the spark generated alias name contains comma

2018-03-31 Thread Shahid K I (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shahid K I updated SPARK-23837:
---
Description: 
For spark generated alias name contains comma, Hive metastore throws exception.

 

0: jdbc:hive2://ha-cluster/default> create table a (col1 decimal(18,3), col2 
decimal(18,5));

 +---++
|Result|

+---++
 +---++

 No rows selected (0.171 seconds)
 0: jdbc:hive2://ha-cluster/default> select col1*col2 from a;
 
+---+
|(CAST(col1 AS DECIMAL(20,5)) * CAST(col2 AS DECIMAL(20,5)))|

+---+
 
+---+
 No rows selected (0.168 seconds)
 0: jdbc:hive2://ha-cluster/default> create table b as select col1*col2 from a;

Error: org.apache.spark.sql.AnalysisException: Cannot create a table having a 
column whose name contains commas in Hive metastore. Table: `default`.`b`; 
Column: (CAST(col1 AS DECIMAL(20,5)) * CAST(col2 AS DECIMAL(20,5))); 
(state=,code=0)



  was:

For spark generated alias name contains comma, Hive metastore throws exception.

0: jdbc:hive2://ha-cluster/default> create table a (col1 decimal(18,3), col2 
decimal(18,5));
+-+--+
| Result  |
+-+--+
+-+--+
No rows selected (0.171 seconds)
0: jdbc:hive2://ha-cluster/default> select col1*col2 from a;
+---+
| (CAST(col1 AS DECIMAL(20,5)) * CAST(col2 AS DECIMAL(20,5)))  |
+---+
+---+
No rows selected (0.168 seconds)
0: jdbc:hive2://ha-cluster/default> create table b as select col1*col2 from a;

Error: org.apache.spark.sql.AnalysisException: Cannot create a table having a 
column whose name contains commas in Hive metastore. Table: `default`.`b`; 
Column: (CAST(col1 AS DECIMAL(20,5)) * CAST(col2 AS DECIMAL(20,5))); 
(state=,code=0)




> Create table as select gives exception if the spark generated alias name 
> contains comma
> ---
>
> Key: SPARK-23837
> URL: https://issues.apache.org/jira/browse/SPARK-23837
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Shahid K I
>Priority: Major
>
> For spark generated alias name contains comma, Hive metastore throws 
> exception.
>  
> 0: jdbc:hive2://ha-cluster/default> create table a (col1 decimal(18,3), col2 
> decimal(18,5));
>  +---++
> |Result|
> +---++
>  +---++
>  No rows selected (0.171 seconds)
>  0: jdbc:hive2://ha-cluster/default> select col1*col2 from a;
>  
> +---+
> |(CAST(col1 AS DECIMAL(20,5)) * CAST(col2 AS DECIMAL(20,5)))|
> +---+
>  
> +---+
>  No rows selected (0.168 seconds)
>  0: jdbc:hive2://ha-cluster/default> create table b as select col1*col2 from 
> a;
> Error: org.apache.spark.sql.AnalysisException: Cannot create a table having a 
> column whose name contains commas in Hive metastore. Table: `default`.`b`; 
> Column: (CAST(col1 AS DECIMAL(20,5)) * CAST(col2 AS DECIMAL(20,5))); 
> (state=,code=0)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23837) Create table as select gives exception if the spark generated alias name contains comma

2018-03-31 Thread Shahid K I (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shahid K I updated SPARK-23837:
---
Description: 

For spark generated alias name contains comma, Hive metastore throws exception.

0: jdbc:hive2://ha-cluster/default> create table a (col1 decimal(18,3), col2 
decimal(18,5));
+-+--+
| Result  |
+-+--+
+-+--+
No rows selected (0.171 seconds)
0: jdbc:hive2://ha-cluster/default> select col1*col2 from a;
+---+
| (CAST(col1 AS DECIMAL(20,5)) * CAST(col2 AS DECIMAL(20,5)))  |
+---+
+---+
No rows selected (0.168 seconds)
0: jdbc:hive2://ha-cluster/default> create table b as select col1*col2 from a;

Error: org.apache.spark.sql.AnalysisException: Cannot create a table having a 
column whose name contains commas in Hive metastore. Table: `default`.`b`; 
Column: (CAST(col1 AS DECIMAL(20,5)) * CAST(col2 AS DECIMAL(20,5))); 
(state=,code=0)



  was:
For spark generated alias name contains comma, Hive metastore throws exception.

0: jdbc:hive2://ha-cluster/default> create table a (col1 decimal(18,3), col2 
decimal(18,5));
+-+--+
| Result  |
+-+--+
+-+--+
No rows selected (0.171 seconds)
0: jdbc:hive2://ha-cluster/default> select col1*col2 from a;
+---+
| (CAST(col1 AS DECIMAL(20,5)) * CAST(col2 AS DECIMAL(20,5)))  |
+---+
+---+
No rows selected (0.168 seconds)
0: jdbc:hive2://ha-cluster/default> create table b as select col1*col2 from a;

Error: org.apache.spark.sql.AnalysisException: Cannot create a table having a 
column whose name contains commas in Hive metastore. Table: `default`.`b`; 
Column: (CAST(col1 AS DECIMAL(20,5)) * CAST(col2 AS DECIMAL(20,5))); 
(state=,code=0)




> Create table as select gives exception if the spark generated alias name 
> contains comma
> ---
>
> Key: SPARK-23837
> URL: https://issues.apache.org/jira/browse/SPARK-23837
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Shahid K I
>Priority: Major
>
> For spark generated alias name contains comma, Hive metastore throws 
> exception.
> 0: jdbc:hive2://ha-cluster/default> create table a (col1 decimal(18,3), col2 
> decimal(18,5));
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.171 seconds)
> 0: jdbc:hive2://ha-cluster/default> select col1*col2 from a;
> +---+
> | (CAST(col1 AS DECIMAL(20,5)) * CAST(col2 AS DECIMAL(20,5)))  |
> +---+
> +---+
> No rows selected (0.168 seconds)
> 0: jdbc:hive2://ha-cluster/default> create table b as select col1*col2 from a;
> Error: org.apache.spark.sql.AnalysisException: Cannot create a table having a 
> column whose name contains commas in Hive metastore. Table: `default`.`b`; 
> Column: (CAST(col1 AS DECIMAL(20,5)) * CAST(col2 AS DECIMAL(20,5))); 
> (state=,code=0)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23837) Create table as select gives exception if the spark generated alias name contains comma

2018-03-31 Thread Shahid K I (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shahid K I updated SPARK-23837:
---
Description: 
For spark generated alias name contains comma, Hive metastore throws exception.

0: jdbc:hive2://ha-cluster/default> create table a (col1 decimal(18,3), col2 
decimal(18,5));
+-+--+
| Result  |
+-+--+
+-+--+
No rows selected (0.171 seconds)
0: jdbc:hive2://ha-cluster/default> select col1*col2 from a;
+---+
| (CAST(col1 AS DECIMAL(20,5)) * CAST(col2 AS DECIMAL(20,5)))  |
+---+
+---+
No rows selected (0.168 seconds)
0: jdbc:hive2://ha-cluster/default> create table b as select col1*col2 from a;

Error: org.apache.spark.sql.AnalysisException: Cannot create a table having a 
column whose name contains commas in Hive metastore. Table: `default`.`b`; 
Column: (CAST(col1 AS DECIMAL(20,5)) * CAST(col2 AS DECIMAL(20,5))); 
(state=,code=0)



  was:
For spark generated alias name contains comma, Hive metastore throws exception.

0: jdbc:hive2://ha-cluster/default> create table a (col1 decimal(18,3), col2 
decimal(18,5));
+-+--+
| Result  |
+-+--+
+-+--+
No rows selected (0.171 seconds)
0: jdbc:hive2://ha-cluster/default> select col1*col2 from a;
+--+--+
| (CAST(col1 AS DECIMAL(20,5)) * CAST(col2 AS DECIMAL(20,5)))  |
+--+--+
+--+--+
No rows selected (0.168 seconds)
0: jdbc:hive2://ha-cluster/default> create table b as select col1*col2 from a;
*Error: org.apache.spark.sql.AnalysisException: Cannot create a table having a 
column whose name contains commas in Hive me
tastore. Table: `default`.`b`; 
Column: (CAST(col1 AS DECIMAL(20,5)) * CAST(col2 AS DECIMAL(20,5))); 
(state=,code=0)
*

!image-2018-03-31-19-57-38-496.png!


> Create table as select gives exception if the spark generated alias name 
> contains comma
> ---
>
> Key: SPARK-23837
> URL: https://issues.apache.org/jira/browse/SPARK-23837
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Shahid K I
>Priority: Major
>
> For spark generated alias name contains comma, Hive metastore throws 
> exception.
> 0: jdbc:hive2://ha-cluster/default> create table a (col1 decimal(18,3), col2 
> decimal(18,5));
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.171 seconds)
> 0: jdbc:hive2://ha-cluster/default> select col1*col2 from a;
> +---+
> | (CAST(col1 AS DECIMAL(20,5)) * CAST(col2 AS DECIMAL(20,5)))  |
> +---+
> +---+
> No rows selected (0.168 seconds)
> 0: jdbc:hive2://ha-cluster/default> create table b as select col1*col2 from a;
> Error: org.apache.spark.sql.AnalysisException: Cannot create a table having a 
> column whose name contains commas in Hive metastore. Table: `default`.`b`; 
> Column: (CAST(col1 AS DECIMAL(20,5)) * CAST(col2 AS DECIMAL(20,5))); 
> (state=,code=0)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23837) Create table as select gives exception if the spark generated alias name contains comma

2018-03-31 Thread Shahid K I (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shahid K I updated SPARK-23837:
---
Description: 
For spark generated alias name contains comma, Hive metastore throws exception.

0: jdbc:hive2://ha-cluster/default> create table a (col1 decimal(18,3), col2 
decimal(18,5));
+-+--+
| Result  |
+-+--+
+-+--+
No rows selected (0.171 seconds)
0: jdbc:hive2://ha-cluster/default> select col1*col2 from a;
+--+--+
| (CAST(col1 AS DECIMAL(20,5)) * CAST(col2 AS DECIMAL(20,5)))  |
+--+--+
+--+--+
No rows selected (0.168 seconds)
0: jdbc:hive2://ha-cluster/default> create table b as select col1*col2 from a;
*Error: org.apache.spark.sql.AnalysisException: Cannot create a table having a 
column whose name contains commas in Hive me
tastore. Table: `default`.`b`; 
Column: (CAST(col1 AS DECIMAL(20,5)) * CAST(col2 AS DECIMAL(20,5))); 
(state=,code=0)
*

!image-2018-03-31-19-57-38-496.png!

  was:
For spark generated alias name contains comma, Hive metastore throws exception.
!image-2018-03-31-19-57-38-496.png!


> Create table as select gives exception if the spark generated alias name 
> contains comma
> ---
>
> Key: SPARK-23837
> URL: https://issues.apache.org/jira/browse/SPARK-23837
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Shahid K I
>Priority: Major
>
> For spark generated alias name contains comma, Hive metastore throws 
> exception.
> 0: jdbc:hive2://ha-cluster/default> create table a (col1 decimal(18,3), col2 
> decimal(18,5));
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.171 seconds)
> 0: jdbc:hive2://ha-cluster/default> select col1*col2 from a;
> +--+--+
> | (CAST(col1 AS DECIMAL(20,5)) * CAST(col2 AS DECIMAL(20,5)))  |
> +--+--+
> +--+--+
> No rows selected (0.168 seconds)
> 0: jdbc:hive2://ha-cluster/default> create table b as select col1*col2 from a;
> *Error: org.apache.spark.sql.AnalysisException: Cannot create a table having 
> a column whose name contains commas in Hive me
> tastore. Table: 
> `default`.`b`; Column: (CAST(col1 AS DECIMAL(20,5)) * CAST(col2 AS 
> DECIMAL(20,5))); (state=,code=0)
> *
> !image-2018-03-31-19-57-38-496.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23661) Implement treeAggregate on Dataset API

2018-03-31 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16421348#comment-16421348
 ] 

Liang-Chi Hsieh commented on SPARK-23661:
-

For the implementation of {{Dataset.treeAggregate}}, I'm thinking if we need to 
support SQL tree aggregate for all cases. For example, {{RDD.treeAggregate}} 
can be seen as grouping without keys. This is the case tree aggregation can 
benefit. For grouping by keys, I'm wondering if it really performs much better 
than non tree aggregation.

cc [~cloud_fan]

> Implement treeAggregate on Dataset API
> --
>
> Key: SPARK-23661
> URL: https://issues.apache.org/jira/browse/SPARK-23661
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
>
> Many algorithms in MLlib are still not migrated their internal computing 
> workload from {{RDD}} to {{DataFrame}}. {{treeAggregate}} is one of obstacles 
> we need to address in order to see complete migration.
> This ticket is opened to provide {{treeAggregate}} on Dataset API. For now 
> this should be a private API used by ML component.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23835) When Dataset.as converts column from nullable to non-nullable type, null Doubles are converted silently to -1

2018-03-31 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16421346#comment-16421346
 ] 

Liang-Chi Hsieh commented on SPARK-23835:
-

What is the better behavior it should have?

> When Dataset.as converts column from nullable to non-nullable type, null 
> Doubles are converted silently to -1
> -
>
> Key: SPARK-23835
> URL: https://issues.apache.org/jira/browse/SPARK-23835
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> I constructed a DataFrame with a nullable java.lang.Double column (and an 
> extra Double column).  I then converted it to a Dataset using ```as[(Double, 
> Double)]```.  When the Dataset is shown, it has a null.  When it is collected 
> and printed, the null is silently converted to a -1.
> Code snippet to reproduce this:
> {code}
> val localSpark = spark
> import localSpark.implicits._
> val df = Seq[(java.lang.Double, Double)](
>   (1.0, 2.0),
>   (3.0, 4.0),
>   (Double.NaN, 5.0),
>   (null, 6.0)
> ).toDF("a", "b")
> df.show()  // OUTPUT 1: has null
> df.printSchema()
> val data = df.as[(Double, Double)]
> data.show()  // OUTPUT 2: has null
> data.collect().foreach(println)  // OUTPUT 3: has -1
> {code}
> OUTPUT 1 and 2:
> {code}
> ++---+
> |   a|  b|
> ++---+
> | 1.0|2.0|
> | 3.0|4.0|
> | NaN|5.0|
> |null|6.0|
> ++---+
> {code}
> OUTPUT 3:
> {code}
> (1.0,2.0)
> (3.0,4.0)
> (NaN,5.0)
> (-1.0,6.0)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23837) Create table as select gives exception if the spark generated alias name contains comma

2018-03-31 Thread Shahid K I (JIRA)
Shahid K I created SPARK-23837:
--

 Summary: Create table as select gives exception if the spark 
generated alias name contains comma
 Key: SPARK-23837
 URL: https://issues.apache.org/jira/browse/SPARK-23837
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0, 2.2.1
Reporter: Shahid K I


For spark generated alias name contains comma, Hive metastore throws exception.
!image-2018-03-31-19-57-38-496.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19826) spark.ml Python API for PIC

2018-03-31 Thread Huaxin Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16421215#comment-16421215
 ] 

Huaxin Gao commented on SPARK-19826:


I coded the PIC python API based on the changes in SPARK-15784 Add Power 
Iteration Clustering to spark.ml

([https://github.com/apache/spark/pull/15770]). Will submit a PR once PR 15770 
is merged in. 
h1.  

> spark.ml Python API for PIC
> ---
>
> Key: SPARK-19826
> URL: https://issues.apache.org/jira/browse/SPARK-19826
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23836) Support returning StructType & MapType in Arrow's "scalar" UDFS (or similar)

2018-03-31 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16421191#comment-16421191
 ] 

Hyukjin Kwon commented on SPARK-23836:
--

WDYT about moving this under SPARK-21187?

> Support returning StructType & MapType in Arrow's "scalar" UDFS (or similar)
> 
>
> Key: SPARK-23836
> URL: https://issues.apache.org/jira/browse/SPARK-23836
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: holdenk
>Priority: Major
>
> Currently not all of the supported types can be returned from the scalar 
> pandas UDF type. This means if someone wants to return a struct type doing a 
> map operation right now they either have to do a "junk" groupBy or use the 
> non-vectorized results.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org