date:20170904

[jira] [Created] (SPARK-21920) DataFrame Fail To Find The Column Name

2017-09-04 Thread abhijit nag (JIRA)

abhijit nag created SPARK-21920:
---

 Summary: DataFrame Fail To Find The Column Name
 Key: SPARK-21920
 URL: https://issues.apache.org/jira/browse/SPARK-21920
 Project: Spark
  Issue Type: Question
  Components: Spark Core
Affects Versions: 1.6.0
Reporter: abhijit nag
Priority: Critical


I am getting one issue like "sql.AnalysisException: cannot resolve column_name"
Wrote a simple query as below.
[DataFrame df= df1
  .join(df2, df1.col("MERCHANT").equalTo(df2.col("MERCHANT")))
  .select(df2.col("MERCH_ID"), df1.col("MERCHANT")));]

Exception Found : 
resolved attribute(s) MERCH_ID#738 missing from 
MERCHANT#737,MERCHANT#928,MERCH_ID#929,MER_LOC#930 in operator !Project 
[MERCH_ID#738,MERCHANT#737];

Problem Solved by following code:
DataFrame df= df1.alias("df1").
  .join(df2.alias("df2"), 
functions.col("df1.MERCHANT").equalTo(functions.col("df2.MERCHANT")))
  .select(functions.col("df2.MERCH_ID"), functions.col("df2.MERCHANT")));

Similar kind of issue appears rare, but I want to know the root cause of this 
problem. 
Is it a bug in Spark 1.6 or something else.





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21918) HiveClient shouldn't share Hive object between different thread

2017-09-04 Thread Hu Liu, (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hu Liu, updated SPARK-21918:

Description: 
I'm testing the spark thrift server and found that all the DDL statements are 
run by user hive even if hive.server2.enable.doAs=true
The root cause is that Hive object is shared between different thread in 
HiveClientImpl
{code:java}
  private def client: Hive = {
if (clientLoader.cachedHive != null) {
  clientLoader.cachedHive.asInstanceOf[Hive]
} else {
  val c = Hive.get(conf)
  clientLoader.cachedHive = c
  c
}
  }
{code}
But in impersonation mode, we should just share the Hive object inside the 
thread so that the  metastore client in Hive could be associated with right 
user.

we can  pass the Hive object of parent thread to child thread when running the 
sql to fix it
I have already had a initial patch for review and I'm glad to work on it if 
anyone could assign it to me.


  was:
I'm testing the spark thrift server and found that all the DDL statements are 
run by user hive even if hive.server2.enable.doAs=true
The root cause is that Hive object is shared between different thread in 
HiveClientImpl
{code:java}
  private def client: Hive = {
if (clientLoader.cachedHive != null) {
  clientLoader.cachedHive.asInstanceOf[Hive]
} else {
  val c = Hive.get(conf)
  clientLoader.cachedHive = c
  c
}
  }
{code}
But in impersonation mode, we should just share the Hive object inside the 
thread.

we can  pass the Hive object of current thread to new thread when running the 
sql to fix it
I have already had a initial patch for review and I'm glad to work on it if 
anyone could assign it to me.



> HiveClient shouldn't share Hive object between different thread
> ---
>
> Key: SPARK-21918
> URL: https://issues.apache.org/jira/browse/SPARK-21918
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hu Liu,
>
> I'm testing the spark thrift server and found that all the DDL statements are 
> run by user hive even if hive.server2.enable.doAs=true
> The root cause is that Hive object is shared between different thread in 
> HiveClientImpl
> {code:java}
>   private def client: Hive = {
> if (clientLoader.cachedHive != null) {
>   clientLoader.cachedHive.asInstanceOf[Hive]
> } else {
>   val c = Hive.get(conf)
>   clientLoader.cachedHive = c
>   c
> }
>   }
> {code}
> But in impersonation mode, we should just share the Hive object inside the 
> thread so that the  metastore client in Hive could be associated with right 
> user.
> we can  pass the Hive object of parent thread to child thread when running 
> the sql to fix it
> I have already had a initial patch for review and I'm glad to work on it if 
> anyone could assign it to me.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21919) inconsistent behavior of AFTsurvivalRegression algorithm

2017-09-04 Thread Ashish Chopra (JIRA)

Ashish Chopra created SPARK-21919:
-

 Summary: inconsistent behavior of AFTsurvivalRegression algorithm
 Key: SPARK-21919
 URL: https://issues.apache.org/jira/browse/SPARK-21919
 Project: Spark
  Issue Type: Bug
  Components: ML, PySpark
Affects Versions: 2.2.0
 Environment: Spark Version: 2.2.0
Cluster setup: Standalone single node
Python version: 3.5.2
Reporter: Ashish Chopra


Took the direct example from spark ml documentation.
{code}
training = spark.createDataFrame([
(1.218, 1.0, Vectors.dense(1.560, -0.605)),
(2.949, 0.0, Vectors.dense(0.346, 2.158)),
(3.627, 0.0, Vectors.dense(1.380, 0.231)),
(0.273, 1.0, Vectors.dense(0.520, 1.151)),
(4.199, 0.0, Vectors.dense(0.795, -0.226))], ["label", "censor", 
"features"])
quantileProbabilities = [0.3, 0.6]
aft = AFTSurvivalRegression(quantileProbabilities=quantileProbabilities,
quantilesCol="quantiles")
#aft = AFTSurvivalRegression()
model = aft.fit(training)

# Print the coefficients, intercept and scale parameter for AFT survival 
regression
print("Coefficients: " + str(model.coefficients))
print("Intercept: " + str(model.intercept))
print("Scale: " + str(model.scale))
model.transform(training).show(truncate=False)
{code}
result is:

Coefficients: [-0.496304411053,0.198452172529]
Intercept: 2.6380898963056327
Scale: 1.5472363533632303
||label||censor||features  ||prediction   || quantiles ||
|1.218|1.0   |[1.56,-0.605] |5.718985621018951 | 
[1.160322990805951,4.99546058340675]|
|2.949|0.0   |[0.346,2.158] |18.07678210850554 
|[3.66759199449632,15.789837303662042]|
|3.627|0.0   |[1.38,0.231]  |7.381908879359964 
|[1.4977129086101573,6.4480027195054905]|
|0.273|1.0   |[0.52,1.151]  
|13.577717814884505|[2.754778414791513,11.859962351993202]|
|4.199|0.0   |[0.795,-0.226]|9.013087597344805 
|[1.828662187733188,7.8728164067854856]|

But if we change the value of all labels as label + 20. as:
{code}
training = spark.createDataFrame([
(21.218, 1.0, Vectors.dense(1.560, -0.605)),
(22.949, 0.0, Vectors.dense(0.346, 2.158)),
(23.627, 0.0, Vectors.dense(1.380, 0.231)),
(20.273, 1.0, Vectors.dense(0.520, 1.151)),
(24.199, 0.0, Vectors.dense(0.795, -0.226))], ["label", "censor", 
"features"])
quantileProbabilities = [0.3, 0.6]
aft = AFTSurvivalRegression(quantileProbabilities=quantileProbabilities,
 quantilesCol="quantiles")
#aft = AFTSurvivalRegression()
model = aft.fit(training)

# Print the coefficients, intercept and scale parameter for AFT survival 
regression
print("Coefficients: " + str(model.coefficients))
print("Intercept: " + str(model.intercept))
print("Scale: " + str(model.scale))
model.transform(training).show(truncate=False)
{code}
result changes to:

Coefficients: [23.9932020748,3.18105314757]
Intercept: 7.35052273751137
Scale: 7698609960.724161
||label ||censor||features  ||prediction   ||quantiles||
|21.218|1.0   |[1.56,-0.605] |4.0912442688237169E18|[0.0,0.0]|
|22.949|0.0   |[0.346,2.158] |6.011158613411288E9  |[0.0,0.0]|
|23.627|0.0   |[1.38,0.231]  |7.7835948690311181E17|[0.0,0.0]|
|20.273|1.0   |[0.52,1.151]  |1.5880852723124176E10|[0.0,0.0]|
|24.199|0.0   |[0.795,-0.226]|1.4590190884193677E11|[0.0,0.0]|

Can someone please explain this exponential blow up in prediction, as per my 
understanding prediction in AFT is a prediction of the time when the failure 
event will occur, not able to understand why it will change exponentially 
against the value of the label.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21918) HiveClient shouldn't share Hive object between different thread

2017-09-04 Thread Hu Liu, (JIRA)

Hu Liu, created SPARK-21918:
---

 Summary: HiveClient shouldn't share Hive object between different 
thread
 Key: SPARK-21918
 URL: https://issues.apache.org/jira/browse/SPARK-21918
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Hu Liu,


I'm testing the spark thrift server and found that all the DDL statements are 
run by user hive even if hive.server2.enable.doAs=true
The root cause is that Hive object is shared between different thread in 
HiveClientImpl
{code:java}
  private def client: Hive = {
if (clientLoader.cachedHive != null) {
  clientLoader.cachedHive.asInstanceOf[Hive]
} else {
  val c = Hive.get(conf)
  clientLoader.cachedHive = c
  c
}
  }
{code}
But in impersonation mode, we should just share the Hive object inside the 
thread.

we can  pass the Hive object of current thread to new thread when running the 
sql to fix it
I have already had a initial patch for review and I'm glad to work on it if 
anyone could assign it to me.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21917) Remote http(s) resources is not supported in YARN mode

2017-09-04 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16153141#comment-16153141
 ] 

Saisai Shao commented on SPARK-21917:
-

I'm inclining to choose option 1, the only overhead is resource re-uploading, 
the fix is restricted to SparkSubmit and all other code could be worked 
transparently.

What's your opinion [~tgraves] [~vanzin]?

> Remote http(s) resources is not supported in YARN mode
> --
>
> Key: SPARK-21917
> URL: https://issues.apache.org/jira/browse/SPARK-21917
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, YARN
>Affects Versions: 2.2.0
>Reporter: Saisai Shao
>Priority: Minor
>
> In the current Spark, when submitting application on YARN with remote 
> resources {{./bin/spark-shell --jars 
> http://central.maven.org/maven2/com/github/swagger-akka-http/swagger-akka-http_2.11/0.10.1/swagger-akka-http_2.11-0.10.1.jar
>  --master yarn-client -v}}, Spark will be failed with:
> {noformat}
> java.io.IOException: No FileSystem for scheme: http
>   at 
> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2586)
>   at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2593)
>   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2632)
>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2614)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
>   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
>   at 
> org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:354)
>   at 
> org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:478)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11$$anonfun$apply$6.apply(Client.scala:600)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11$$anonfun$apply$6.apply(Client.scala:599)
>   at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:74)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11.apply(Client.scala:599)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11.apply(Client.scala:598)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:598)
>   at 
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:848)
>   at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:173)
> {noformat}
> This is because {{YARN#client}} assumes resources must be on the Hadoop 
> compatible FS, also in the NM 
> (https://github.com/apache/hadoop/blob/99e558b13ba4d5832aea97374e1d07b4e78e5e39/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ContainerLocalizer.java#L245)
>  it will only use Hadoop compatible FS to download resources. So this makes 
> Spark on YARN fail to support remote http(s) resources.
> To solve this problem, there might be several options:
> * Download remote http(s) resources to local and add this local downloaded 
> resources to dist cache. The downside of this option is that remote resources 
> will be uploaded again unnecessarily.
> * Filter remote http(s) resources and add them with spark.jars or 
> spark.files, to leverage Spark's internal fileserver to distribute remote 
> http(s) resources. The problem of this solution is: for some resources which 
> require to be available before application start may not work.
> * Leverage Hadoop's support http(s) file system 
> (https://issues.apache.org/jira/browse/HADOOP-14383). This is only worked in 
> Hadoop 2.9+, and I think even we implement a similar one in Spark will not be 
> worked.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21917) Remote http(s) resources is not supported in YARN mode

2017-09-04 Thread Saisai Shao (JIRA)

Saisai Shao created SPARK-21917:
---

 Summary: Remote http(s) resources is not supported in YARN mode
 Key: SPARK-21917
 URL: https://issues.apache.org/jira/browse/SPARK-21917
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit, YARN
Affects Versions: 2.2.0
Reporter: Saisai Shao
Priority: Minor


In the current Spark, when submitting application on YARN with remote resources 
{{./bin/spark-shell --jars 
http://central.maven.org/maven2/com/github/swagger-akka-http/swagger-akka-http_2.11/0.10.1/swagger-akka-http_2.11-0.10.1.jar
 --master yarn-client -v}}, Spark will be failed with:

{noformat}
java.io.IOException: No FileSystem for scheme: http
at 
org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2586)
at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2593)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at 
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2632)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2614)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at 
org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:354)
at 
org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:478)
at 
org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11$$anonfun$apply$6.apply(Client.scala:600)
at 
org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11$$anonfun$apply$6.apply(Client.scala:599)
at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:74)
at 
org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11.apply(Client.scala:599)
at 
org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11.apply(Client.scala:598)
at scala.collection.immutable.List.foreach(List.scala:381)
at 
org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:598)
at 
org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:848)
at 
org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:173)
{noformat}

This is because {{YARN#client}} assumes resources must be on the Hadoop 
compatible FS, also in the NM 
(https://github.com/apache/hadoop/blob/99e558b13ba4d5832aea97374e1d07b4e78e5e39/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ContainerLocalizer.java#L245)
 it will only use Hadoop compatible FS to download resources. So this makes 
Spark on YARN fail to support remote http(s) resources.

To solve this problem, there might be several options:

* Download remote http(s) resources to local and add this local downloaded 
resources to dist cache. The downside of this option is that remote resources 
will be uploaded again unnecessarily.

* Filter remote http(s) resources and add them with spark.jars or spark.files, 
to leverage Spark's internal fileserver to distribute remote http(s) resources. 
The problem of this solution is: for some resources which require to be 
available before application start may not work.

* Leverage Hadoop's support http(s) file system 
(https://issues.apache.org/jira/browse/HADOOP-14383). This is only worked in 
Hadoop 2.9+, and I think even we implement a similar one in Spark will not be 
worked.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21916) Set isolationOn=true when create client to remote hive metastore

2017-09-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21916:


Assignee: (was: Apache Spark)

> Set isolationOn=true when create client to remote hive metastore
> 
>
> Key: SPARK-21916
> URL: https://issues.apache.org/jira/browse/SPARK-21916
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: jin xing
>
> In current code, we set {{isolationOn=!isCliSessionState()}} when create hive 
> client for metadata. However conf of {{CliSessionState}} points to local 
> dummy 
> metastore(https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala#L416).
>  Using {{CliSessionState}}, we fail to get metadata from remote hive 
> metastore. We can always set {{isolationOn=true}} when create hive client for 
> metadata 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21916) Set isolationOn=true when create client to remote hive metastore

2017-09-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16153126#comment-16153126
 ] 

Apache Spark commented on SPARK-21916:
--

User 'jinxing64' has created a pull request for this issue:
https://github.com/apache/spark/pull/19127

> Set isolationOn=true when create client to remote hive metastore
> 
>
> Key: SPARK-21916
> URL: https://issues.apache.org/jira/browse/SPARK-21916
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: jin xing
>
> In current code, we set {{isolationOn=!isCliSessionState()}} when create hive 
> client for metadata. However conf of {{CliSessionState}} points to local 
> dummy 
> metastore(https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala#L416).
>  Using {{CliSessionState}}, we fail to get metadata from remote hive 
> metastore. We can always set {{isolationOn=true}} when create hive client for 
> metadata 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21916) Set isolationOn=true when create client to remote hive metastore

2017-09-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21916:


Assignee: Apache Spark

> Set isolationOn=true when create client to remote hive metastore
> 
>
> Key: SPARK-21916
> URL: https://issues.apache.org/jira/browse/SPARK-21916
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: jin xing
>Assignee: Apache Spark
>
> In current code, we set {{isolationOn=!isCliSessionState()}} when create hive 
> client for metadata. However conf of {{CliSessionState}} points to local 
> dummy 
> metastore(https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala#L416).
>  Using {{CliSessionState}}, we fail to get metadata from remote hive 
> metastore. We can always set {{isolationOn=true}} when create hive client for 
> metadata 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21916) Set isolationOn=true when create client to remote hive metastore

2017-09-04 Thread jin xing (JIRA)

jin xing created SPARK-21916:


 Summary: Set isolationOn=true when create client to remote hive 
metastore
 Key: SPARK-21916
 URL: https://issues.apache.org/jira/browse/SPARK-21916
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: jin xing


In current code, we set {{isolationOn=!isCliSessionState()}} when create hive 
client for metadata. However conf of {{CliSessionState}} points to local dummy 
metastore(https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala#L416).
 Using {{CliSessionState}}, we fail to get metadata from remote hive metastore. 
We can always set {{isolationOn=true}} when create hive client for metadata 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21915) Model 1 and Model 2 ParamMaps Missing

2017-09-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16153084#comment-16153084
 ] 

Apache Spark commented on SPARK-21915:
--

User 'marktab' has created a pull request for this issue:
https://github.com/apache/spark/pull/19126

> Model 1 and Model 2 ParamMaps Missing
> -
>
> Key: SPARK-21915
> URL: https://issues.apache.org/jira/browse/SPARK-21915
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 1.6.3, 2.0.0, 
> 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.2.0
>Reporter: Mark Tabladillo
>Priority: Minor
>  Labels: easyfix
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The original Scala code says
> println("Model 2 was fit using parameters: " + model2.parent.extractParamMap)
> The parent is lr
> There is no method for accessing parent as is done in Scala.
> 
> This code has been tested in Python, and returns values consistent with Scala
> Proposing to call the lr variable instead of model1 or model2
> 
> This patch was tested with Spark 2.1.0 comparing the Scala and PySpark 
> results. Pyspark returns nothing at present for those two print lines.
> The output for model2 in PySpark should be
> {Param(parent='LogisticRegression_4187be538f744d5a9090', name='tol', doc='the 
> convergence tolerance for iterative algorithms (>= 0).'): 1e-06,
> Param(parent='LogisticRegression_4187be538f744d5a9090', 
> name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 
> 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 
> penalty.'): 0.0,
> Param(parent='LogisticRegression_4187be538f744d5a9090', name='predictionCol', 
> doc='prediction column name.'): 'prediction',
> Param(parent='LogisticRegression_4187be538f744d5a9090', name='featuresCol', 
> doc='features column name.'): 'features',
> Param(parent='LogisticRegression_4187be538f744d5a9090', name='labelCol', 
> doc='label column name.'): 'label',
> Param(parent='LogisticRegression_4187be538f744d5a9090', 
> name='probabilityCol', doc='Column name for predicted class conditional 
> probabilities. Note: Not all models output well-calibrated probability 
> estimates! These probabilities should be treated as confidences, not precise 
> probabilities.'): 'myProbability',
> Param(parent='LogisticRegression_4187be538f744d5a9090', 
> name='rawPredictionCol', doc='raw prediction (a.k.a. confidence) column 
> name.'): 'rawPrediction',
> Param(parent='LogisticRegression_4187be538f744d5a9090', name='family', 
> doc='The name of family which is a description of the label distribution to 
> be used in the model. Supported options: auto, binomial, multinomial'): 
> 'auto',
> Param(parent='LogisticRegression_4187be538f744d5a9090', name='fitIntercept', 
> doc='whether to fit an intercept term.'): True,
> Param(parent='LogisticRegression_4187be538f744d5a9090', name='threshold', 
> doc='Threshold in binary classification prediction, in range [0, 1]. If 
> threshold and thresholds are both set, they must match.e.g. if threshold is 
> p, then thresholds must be equal to [1-p, p].'): 0.55,
> Param(parent='LogisticRegression_4187be538f744d5a9090', 
> name='aggregationDepth', doc='suggested depth for treeAggregate (>= 2).'): 2,
> Param(parent='LogisticRegression_4187be538f744d5a9090', name='maxIter', 
> doc='max number of iterations (>= 0).'): 30,
> Param(parent='LogisticRegression_4187be538f744d5a9090', name='regParam', 
> doc='regularization parameter (>= 0).'): 0.1,
> Param(parent='LogisticRegression_4187be538f744d5a9090', 
> name='standardization', doc='whether to standardize the training features 
> before fitting the model.'): True}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21915) Model 1 and Model 2 ParamMaps Missing

2017-09-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21915:


Assignee: Apache Spark

> Model 1 and Model 2 ParamMaps Missing
> -
>
> Key: SPARK-21915
> URL: https://issues.apache.org/jira/browse/SPARK-21915
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 1.6.3, 2.0.0, 
> 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.2.0
>Reporter: Mark Tabladillo
>Assignee: Apache Spark
>Priority: Minor
>  Labels: easyfix
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The original Scala code says
> println("Model 2 was fit using parameters: " + model2.parent.extractParamMap)
> The parent is lr
> There is no method for accessing parent as is done in Scala.
> 
> This code has been tested in Python, and returns values consistent with Scala
> Proposing to call the lr variable instead of model1 or model2
> 
> This patch was tested with Spark 2.1.0 comparing the Scala and PySpark 
> results. Pyspark returns nothing at present for those two print lines.
> The output for model2 in PySpark should be
> {Param(parent='LogisticRegression_4187be538f744d5a9090', name='tol', doc='the 
> convergence tolerance for iterative algorithms (>= 0).'): 1e-06,
> Param(parent='LogisticRegression_4187be538f744d5a9090', 
> name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 
> 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 
> penalty.'): 0.0,
> Param(parent='LogisticRegression_4187be538f744d5a9090', name='predictionCol', 
> doc='prediction column name.'): 'prediction',
> Param(parent='LogisticRegression_4187be538f744d5a9090', name='featuresCol', 
> doc='features column name.'): 'features',
> Param(parent='LogisticRegression_4187be538f744d5a9090', name='labelCol', 
> doc='label column name.'): 'label',
> Param(parent='LogisticRegression_4187be538f744d5a9090', 
> name='probabilityCol', doc='Column name for predicted class conditional 
> probabilities. Note: Not all models output well-calibrated probability 
> estimates! These probabilities should be treated as confidences, not precise 
> probabilities.'): 'myProbability',
> Param(parent='LogisticRegression_4187be538f744d5a9090', 
> name='rawPredictionCol', doc='raw prediction (a.k.a. confidence) column 
> name.'): 'rawPrediction',
> Param(parent='LogisticRegression_4187be538f744d5a9090', name='family', 
> doc='The name of family which is a description of the label distribution to 
> be used in the model. Supported options: auto, binomial, multinomial'): 
> 'auto',
> Param(parent='LogisticRegression_4187be538f744d5a9090', name='fitIntercept', 
> doc='whether to fit an intercept term.'): True,
> Param(parent='LogisticRegression_4187be538f744d5a9090', name='threshold', 
> doc='Threshold in binary classification prediction, in range [0, 1]. If 
> threshold and thresholds are both set, they must match.e.g. if threshold is 
> p, then thresholds must be equal to [1-p, p].'): 0.55,
> Param(parent='LogisticRegression_4187be538f744d5a9090', 
> name='aggregationDepth', doc='suggested depth for treeAggregate (>= 2).'): 2,
> Param(parent='LogisticRegression_4187be538f744d5a9090', name='maxIter', 
> doc='max number of iterations (>= 0).'): 30,
> Param(parent='LogisticRegression_4187be538f744d5a9090', name='regParam', 
> doc='regularization parameter (>= 0).'): 0.1,
> Param(parent='LogisticRegression_4187be538f744d5a9090', 
> name='standardization', doc='whether to standardize the training features 
> before fitting the model.'): True}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21915) Model 1 and Model 2 ParamMaps Missing

2017-09-04 Thread Mark Tabladillo (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Tabladillo updated SPARK-21915:

Description: 
Error in PySpark example code
[https://github.com/apache/spark/blob/master/examples/src/main/python/ml/estimator_transformer_param_example.py]

The original Scala code says
println("Model 2 was fit using parameters: " + model2.parent.extractParamMap)

The parent is lr

There is no method for accessing parent as is done in Scala.



This code has been tested in Python, and returns values consistent with Scala


Proposing to call the lr variable instead of model1 or model2




This patch was tested with Spark 2.1.0 comparing the Scala and PySpark results. 
Pyspark returns nothing at present for those two print lines.

The output for model2 in PySpark should be

{Param(parent='LogisticRegression_4187be538f744d5a9090', name='tol', doc='the 
convergence tolerance for iterative algorithms (>= 0).'): 1e-06,
Param(parent='LogisticRegression_4187be538f744d5a9090', name='elasticNetParam', 
doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the 
penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0,
Param(parent='LogisticRegression_4187be538f744d5a9090', name='predictionCol', 
doc='prediction column name.'): 'prediction',
Param(parent='LogisticRegression_4187be538f744d5a9090', name='featuresCol', 
doc='features column name.'): 'features',
Param(parent='LogisticRegression_4187be538f744d5a9090', name='labelCol', 
doc='label column name.'): 'label',
Param(parent='LogisticRegression_4187be538f744d5a9090', name='probabilityCol', 
doc='Column name for predicted class conditional probabilities. Note: Not all 
models output well-calibrated probability estimates! These probabilities should 
be treated as confidences, not precise probabilities.'): 'myProbability',
Param(parent='LogisticRegression_4187be538f744d5a9090', 
name='rawPredictionCol', doc='raw prediction (a.k.a. confidence) column 
name.'): 'rawPrediction',
Param(parent='LogisticRegression_4187be538f744d5a9090', name='family', doc='The 
name of family which is a description of the label distribution to be used in 
the model. Supported options: auto, binomial, multinomial'): 'auto',
Param(parent='LogisticRegression_4187be538f744d5a9090', name='fitIntercept', 
doc='whether to fit an intercept term.'): True,
Param(parent='LogisticRegression_4187be538f744d5a9090', name='threshold', 
doc='Threshold in binary classification prediction, in range [0, 1]. If 
threshold and thresholds are both set, they must match.e.g. if threshold is p, 
then thresholds must be equal to [1-p, p].'): 0.55,
Param(parent='LogisticRegression_4187be538f744d5a9090', 
name='aggregationDepth', doc='suggested depth for treeAggregate (>= 2).'): 2,
Param(parent='LogisticRegression_4187be538f744d5a9090', name='maxIter', 
doc='max number of iterations (>= 0).'): 30,
Param(parent='LogisticRegression_4187be538f744d5a9090', name='regParam', 
doc='regularization parameter (>= 0).'): 0.1,
Param(parent='LogisticRegression_4187be538f744d5a9090', name='standardization', 
doc='whether to standardize the training features before fitting the model.'): 
True}

  was:
The original Scala code says
println("Model 2 was fit using parameters: " + model2.parent.extractParamMap)

The parent is lr

There is no method for accessing parent as is done in Scala.



This code has been tested in Python, and returns values consistent with Scala


Proposing to call the lr variable instead of model1 or model2




This patch was tested with Spark 2.1.0 comparing the Scala and PySpark results. 
Pyspark returns nothing at present for those two print lines.

The output for model2 in PySpark should be

{Param(parent='LogisticRegression_4187be538f744d5a9090', name='tol', doc='the 
convergence tolerance for iterative algorithms (>= 0).'): 1e-06,
Param(parent='LogisticRegression_4187be538f744d5a9090', name='elasticNetParam', 
doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the 
penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0,
Param(parent='LogisticRegression_4187be538f744d5a9090', name='predictionCol', 
doc='prediction column name.'): 'prediction',
Param(parent='LogisticRegression_4187be538f744d5a9090', name='featuresCol', 
doc='features column name.'): 'features',
Param(parent='LogisticRegression_4187be538f744d5a9090', name='labelCol', 
doc='label column name.'): 'label',
Param(parent='LogisticRegression_4187be538f744d5a9090', name='probabilityCol', 
doc='Column name for predicted class conditional probabilities. Note: Not all 
models output well-calibrated probability estimates! These probabilities should 
be treated as confidences, not precise probabilities.'): 'myProbability',
Param(parent='LogisticRegression_4187be538f744d5a9090', 
name='rawPredictionCol', doc='raw prediction (a.k.a. confidence) column 
name.'): 'rawPrediction',

[jira] [Assigned] (SPARK-21915) Model 1 and Model 2 ParamMaps Missing

2017-09-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21915:


Assignee: (was: Apache Spark)

> Model 1 and Model 2 ParamMaps Missing
> -
>
> Key: SPARK-21915
> URL: https://issues.apache.org/jira/browse/SPARK-21915
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 1.6.3, 2.0.0, 
> 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.2.0
>Reporter: Mark Tabladillo
>Priority: Minor
>  Labels: easyfix
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The original Scala code says
> println("Model 2 was fit using parameters: " + model2.parent.extractParamMap)
> The parent is lr
> There is no method for accessing parent as is done in Scala.
> 
> This code has been tested in Python, and returns values consistent with Scala
> Proposing to call the lr variable instead of model1 or model2
> 
> This patch was tested with Spark 2.1.0 comparing the Scala and PySpark 
> results. Pyspark returns nothing at present for those two print lines.
> The output for model2 in PySpark should be
> {Param(parent='LogisticRegression_4187be538f744d5a9090', name='tol', doc='the 
> convergence tolerance for iterative algorithms (>= 0).'): 1e-06,
> Param(parent='LogisticRegression_4187be538f744d5a9090', 
> name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 
> 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 
> penalty.'): 0.0,
> Param(parent='LogisticRegression_4187be538f744d5a9090', name='predictionCol', 
> doc='prediction column name.'): 'prediction',
> Param(parent='LogisticRegression_4187be538f744d5a9090', name='featuresCol', 
> doc='features column name.'): 'features',
> Param(parent='LogisticRegression_4187be538f744d5a9090', name='labelCol', 
> doc='label column name.'): 'label',
> Param(parent='LogisticRegression_4187be538f744d5a9090', 
> name='probabilityCol', doc='Column name for predicted class conditional 
> probabilities. Note: Not all models output well-calibrated probability 
> estimates! These probabilities should be treated as confidences, not precise 
> probabilities.'): 'myProbability',
> Param(parent='LogisticRegression_4187be538f744d5a9090', 
> name='rawPredictionCol', doc='raw prediction (a.k.a. confidence) column 
> name.'): 'rawPrediction',
> Param(parent='LogisticRegression_4187be538f744d5a9090', name='family', 
> doc='The name of family which is a description of the label distribution to 
> be used in the model. Supported options: auto, binomial, multinomial'): 
> 'auto',
> Param(parent='LogisticRegression_4187be538f744d5a9090', name='fitIntercept', 
> doc='whether to fit an intercept term.'): True,
> Param(parent='LogisticRegression_4187be538f744d5a9090', name='threshold', 
> doc='Threshold in binary classification prediction, in range [0, 1]. If 
> threshold and thresholds are both set, they must match.e.g. if threshold is 
> p, then thresholds must be equal to [1-p, p].'): 0.55,
> Param(parent='LogisticRegression_4187be538f744d5a9090', 
> name='aggregationDepth', doc='suggested depth for treeAggregate (>= 2).'): 2,
> Param(parent='LogisticRegression_4187be538f744d5a9090', name='maxIter', 
> doc='max number of iterations (>= 0).'): 30,
> Param(parent='LogisticRegression_4187be538f744d5a9090', name='regParam', 
> doc='regularization parameter (>= 0).'): 0.1,
> Param(parent='LogisticRegression_4187be538f744d5a9090', 
> name='standardization', doc='whether to standardize the training features 
> before fitting the model.'): True}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21915) Model 1 and Model 2 ParamMaps Missing

2017-09-04 Thread Mark Tabladillo (JIRA)

Mark Tabladillo created SPARK-21915:
---

 Summary: Model 1 and Model 2 ParamMaps Missing
 Key: SPARK-21915
 URL: https://issues.apache.org/jira/browse/SPARK-21915
 Project: Spark
  Issue Type: Bug
  Components: ML, PySpark
Affects Versions: 2.2.0, 2.1.1, 2.1.0, 2.0.2, 2.0.1, 2.0.0, 1.6.3, 1.6.2, 
1.6.1, 1.6.0, 1.5.2, 1.5.1, 1.5.0
Reporter: Mark Tabladillo
Priority: Minor


The original Scala code says
println("Model 2 was fit using parameters: " + model2.parent.extractParamMap)

The parent is lr

There is no method for accessing parent as is done in Scala.



This code has been tested in Python, and returns values consistent with Scala


Proposing to call the lr variable instead of model1 or model2




This patch was tested with Spark 2.1.0 comparing the Scala and PySpark results. 
Pyspark returns nothing at present for those two print lines.

The output for model2 in PySpark should be

{Param(parent='LogisticRegression_4187be538f744d5a9090', name='tol', doc='the 
convergence tolerance for iterative algorithms (>= 0).'): 1e-06,
Param(parent='LogisticRegression_4187be538f744d5a9090', name='elasticNetParam', 
doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the 
penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0,
Param(parent='LogisticRegression_4187be538f744d5a9090', name='predictionCol', 
doc='prediction column name.'): 'prediction',
Param(parent='LogisticRegression_4187be538f744d5a9090', name='featuresCol', 
doc='features column name.'): 'features',
Param(parent='LogisticRegression_4187be538f744d5a9090', name='labelCol', 
doc='label column name.'): 'label',
Param(parent='LogisticRegression_4187be538f744d5a9090', name='probabilityCol', 
doc='Column name for predicted class conditional probabilities. Note: Not all 
models output well-calibrated probability estimates! These probabilities should 
be treated as confidences, not precise probabilities.'): 'myProbability',
Param(parent='LogisticRegression_4187be538f744d5a9090', 
name='rawPredictionCol', doc='raw prediction (a.k.a. confidence) column 
name.'): 'rawPrediction',
Param(parent='LogisticRegression_4187be538f744d5a9090', name='family', doc='The 
name of family which is a description of the label distribution to be used in 
the model. Supported options: auto, binomial, multinomial'): 'auto',
Param(parent='LogisticRegression_4187be538f744d5a9090', name='fitIntercept', 
doc='whether to fit an intercept term.'): True,
Param(parent='LogisticRegression_4187be538f744d5a9090', name='threshold', 
doc='Threshold in binary classification prediction, in range [0, 1]. If 
threshold and thresholds are both set, they must match.e.g. if threshold is p, 
then thresholds must be equal to [1-p, p].'): 0.55,
Param(parent='LogisticRegression_4187be538f744d5a9090', 
name='aggregationDepth', doc='suggested depth for treeAggregate (>= 2).'): 2,
Param(parent='LogisticRegression_4187be538f744d5a9090', name='maxIter', 
doc='max number of iterations (>= 0).'): 30,
Param(parent='LogisticRegression_4187be538f744d5a9090', name='regParam', 
doc='regularization parameter (>= 0).'): 0.1,
Param(parent='LogisticRegression_4187be538f744d5a9090', name='standardization', 
doc='whether to standardize the training features before fitting the model.'): 
True}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19126) Join Documentation Improvements

2017-09-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16153078#comment-16153078
 ] 

Apache Spark commented on SPARK-19126:
--

User 'marktab' has created a pull request for this issue:
https://github.com/apache/spark/pull/19126

> Join Documentation Improvements
> ---
>
> Key: SPARK-19126
> URL: https://issues.apache.org/jira/browse/SPARK-19126
> Project: Spark
>  Issue Type: Improvement
>Reporter: Bill Chambers
>Assignee: Bill Chambers
>Priority: Minor
> Fix For: 2.1.1, 2.2.0
>
>
> - Some join types are missing (no mention of anti join)
> - Joins are labelled inconsistently both within each language and between 
> languages.
> - Update according to new join spec for `crossJoin`
> Pull request coming...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21914) Running examples as tests in SQL builtin function documentation

2017-09-04 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16153011#comment-16153011
 ] 

Hyukjin Kwon commented on SPARK-21914:
--

[~rxin], would you mind if I ask whether you like this idea (running examples 
in SQL doc as tests) ?

> Running examples as tests in SQL builtin function documentation
> ---
>
> Key: SPARK-21914
> URL: https://issues.apache.org/jira/browse/SPARK-21914
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>
> It looks we have added many examples in {{ExpressionDescription}} for builtin 
> functions.
> Actually, if I have seen correctly, we have fixed many examples so far in 
> some minor PRs and sometimes require to add the examples as tests sql and 
> golden files.
> As we have formatted examples in {{ExpressionDescription.examples}} - 
> https://github.com/apache/spark/blob/ba327ee54c32b11107793604895bd38559804858/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/ExpressionDescription.java#L44-L50,
>  and we have `SQLQueryTestSuite`, I think we could run the examples as tests 
> like Python's doctests.
> Rough way I am thinking:
> 1. Loads the example in {{ExpressionDescription}}.
> 2. identify queries by {{>}}.
> 3. identify the rest of them as the results.
> 4. run the examples by reusing {{SQLQueryTestSuite}} if possible.
> 5. compare the output by reusing {{SQLQueryTestSuite}} if possible.
> Advantages of doing this I could think for now:
> - Reduce the number of PRs to fix the examples
> - De-duplicate the test cases that should be added into sql and golden files.
> - Correct documentation with correct examples.
> - Reduce reviewing costs for documentation fix PRs.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21914) Running examples as tests in SQL builtin function documentation

2017-09-04 Thread Hyukjin Kwon (JIRA)

Hyukjin Kwon created SPARK-21914:


 Summary: Running examples as tests in SQL builtin function 
documentation
 Key: SPARK-21914
 URL: https://issues.apache.org/jira/browse/SPARK-21914
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 2.3.0
Reporter: Hyukjin Kwon


It looks we have added many examples in {{ExpressionDescription}} for builtin 
functions.
Actually, if I have seen correctly, we have fixed many examples so far in some 
minor PRs and sometimes require to add the examples as tests sql and golden 
files.

As we have formatted examples in {{ExpressionDescription.examples}} - 
https://github.com/apache/spark/blob/ba327ee54c32b11107793604895bd38559804858/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/ExpressionDescription.java#L44-L50,
 and we have `SQLQueryTestSuite`, I think we could run the examples as tests 
like Python's doctests.

Rough way I am thinking:

1. Loads the example in {{ExpressionDescription}}.
2. identify queries by {{>}}.
3. identify the rest of them as the results.
4. run the examples by reusing {{SQLQueryTestSuite}} if possible.
5. compare the output by reusing {{SQLQueryTestSuite}} if possible.

Advantages of doing this I could think for now:

- Reduce the number of PRs to fix the examples
- De-duplicate the test cases that should be added into sql and golden files.
- Correct documentation with correct examples.
- Reduce reviewing costs for documentation fix PRs.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21913) `withDatabase` should drop database with CASCADE

2017-09-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16152977#comment-16152977
 ] 

Apache Spark commented on SPARK-21913:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/19125

> `withDatabase` should drop database with CASCADE
> 
>
> Key: SPARK-21913
> URL: https://issues.apache.org/jira/browse/SPARK-21913
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.2.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Currently, it fails if the database is not empty. It would be great if we 
> drop cleanly with CASCADE.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21913) `withDatabase` should drop database with CASCADE

2017-09-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21913:


Assignee: (was: Apache Spark)

> `withDatabase` should drop database with CASCADE
> 
>
> Key: SPARK-21913
> URL: https://issues.apache.org/jira/browse/SPARK-21913
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.2.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Currently, it fails if the database is not empty. It would be great if we 
> drop cleanly with CASCADE.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21913) `withDatabase` should drop database with CASCADE

2017-09-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21913:


Assignee: Apache Spark

> `withDatabase` should drop database with CASCADE
> 
>
> Key: SPARK-21913
> URL: https://issues.apache.org/jira/browse/SPARK-21913
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.2.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Minor
>
> Currently, it fails if the database is not empty. It would be great if we 
> drop cleanly with CASCADE.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21913) `withDatabase` should drop database with CASCADE

2017-09-04 Thread Dongjoon Hyun (JIRA)

Dongjoon Hyun created SPARK-21913:
-

 Summary: `withDatabase` should drop database with CASCADE
 Key: SPARK-21913
 URL: https://issues.apache.org/jira/browse/SPARK-21913
 Project: Spark
  Issue Type: Bug
  Components: SQL, Tests
Affects Versions: 2.2.0
Reporter: Dongjoon Hyun
Priority: Minor


Currently, it fails if the database is not empty. It would be great if we drop 
cleanly with CASCADE.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21727) Operating on an ArrayType in a SparkR DataFrame throws error

2017-09-04 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16152973#comment-16152973
 ] 

Felix Cheung edited comment on SPARK-21727 at 9/4/17 11:08 PM:
---

precisely.
as far as I can tell, everything should "just work" if we return "array" from 
`getSerdeType()` for this case when length > 1.



was (Author: felixcheung):
precisely.
as far as I can tell, everything should "just work" if we return `array` from 
`getSerdeType()` for this case when length > 1.


> Operating on an ArrayType in a SparkR DataFrame throws error
> 
>
> Key: SPARK-21727
> URL: https://issues.apache.org/jira/browse/SPARK-21727
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Neil McQuarrie
>
> Previously 
> [posted|https://stackoverflow.com/questions/45056973/sparkr-dataframe-with-r-lists-as-elements]
>  this as a stack overflow question but it seems to be a bug.
> If I have an R data.frame where one of the column data types is an integer 
> *list* -- i.e., each of the elements in the column embeds an entire R list of 
> integers -- then it seems I can convert this data.frame to a SparkR DataFrame 
> just fine... SparkR treats the column as ArrayType(Double). 
> However, any subsequent operation on this SparkR DataFrame appears to throw 
> an error.
> Create an example R data.frame:
> {code}
> indices <- 1:4
> myDf <- data.frame(indices)
> myDf$data <- list(rep(0, 20))}}
> {code}
> Examine it to make sure it looks okay:
> {code}
> > str(myDf) 
> 'data.frame':   4 obs. of  2 variables:  
>  $ indices: int  1 2 3 4  
>  $ data   :List of 4
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
> > head(myDf)   
>   indices   data 
> 1   1 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 2   2 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 3   3 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 4   4 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
> {code}
> Convert it to a SparkR DataFrame:
> {code}
> library(SparkR, lib.loc=paste0(Sys.getenv("SPARK_HOME"),"/R/lib"))
> sparkR.session(master = "local[*]")
> mySparkDf <- as.DataFrame(myDf)
> {code}
> Examine the SparkR DataFrame schema; notice that the list column was 
> successfully converted to ArrayType:
> {code}
> > schema(mySparkDf)
> StructType
> |-name = "indices", type = "IntegerType", nullable = TRUE
> |-name = "data", type = "ArrayType(DoubleType,true)", nullable = TRUE
> {code}
> However, operating on the SparkR DataFrame throws an error:
> {code}
> > collect(mySparkDf)
> 17/07/13 17:23:00 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 
> (TID 1)
> java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: 
> java.lang.Double is not a valid external type for schema of array
> if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null 
> else validateexternaltype(getexternalrowfield(assertnotnull(input[0, 
> org.apache.spark.sql.Row, true]), 0, indices), IntegerType) AS indices#0
> ... long stack trace ...
> {code}
> Using Spark 2.2.0, R 3.4.0, Java 1.8.0_131, Windows 10.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21727) Operating on an ArrayType in a SparkR DataFrame throws error

2017-09-04 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16152973#comment-16152973
 ] 

Felix Cheung commented on SPARK-21727:
--

precisely.
as far as I can tell, everything should "just work" if we return `array` from 
`getSerdeType()` for this case when length > 1.


> Operating on an ArrayType in a SparkR DataFrame throws error
> 
>
> Key: SPARK-21727
> URL: https://issues.apache.org/jira/browse/SPARK-21727
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Neil McQuarrie
>
> Previously 
> [posted|https://stackoverflow.com/questions/45056973/sparkr-dataframe-with-r-lists-as-elements]
>  this as a stack overflow question but it seems to be a bug.
> If I have an R data.frame where one of the column data types is an integer 
> *list* -- i.e., each of the elements in the column embeds an entire R list of 
> integers -- then it seems I can convert this data.frame to a SparkR DataFrame 
> just fine... SparkR treats the column as ArrayType(Double). 
> However, any subsequent operation on this SparkR DataFrame appears to throw 
> an error.
> Create an example R data.frame:
> {code}
> indices <- 1:4
> myDf <- data.frame(indices)
> myDf$data <- list(rep(0, 20))}}
> {code}
> Examine it to make sure it looks okay:
> {code}
> > str(myDf) 
> 'data.frame':   4 obs. of  2 variables:  
>  $ indices: int  1 2 3 4  
>  $ data   :List of 4
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
> > head(myDf)   
>   indices   data 
> 1   1 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 2   2 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 3   3 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 4   4 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
> {code}
> Convert it to a SparkR DataFrame:
> {code}
> library(SparkR, lib.loc=paste0(Sys.getenv("SPARK_HOME"),"/R/lib"))
> sparkR.session(master = "local[*]")
> mySparkDf <- as.DataFrame(myDf)
> {code}
> Examine the SparkR DataFrame schema; notice that the list column was 
> successfully converted to ArrayType:
> {code}
> > schema(mySparkDf)
> StructType
> |-name = "indices", type = "IntegerType", nullable = TRUE
> |-name = "data", type = "ArrayType(DoubleType,true)", nullable = TRUE
> {code}
> However, operating on the SparkR DataFrame throws an error:
> {code}
> > collect(mySparkDf)
> 17/07/13 17:23:00 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 
> (TID 1)
> java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: 
> java.lang.Double is not a valid external type for schema of array
> if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null 
> else validateexternaltype(getexternalrowfield(assertnotnull(input[0, 
> org.apache.spark.sql.Row, true]), 0, indices), IntegerType) AS indices#0
> ... long stack trace ...
> {code}
> Using Spark 2.2.0, R 3.4.0, Java 1.8.0_131, Windows 10.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21905) ClassCastException when call sqlContext.sql on temp table

2017-09-04 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16152955#comment-16152955
 ] 

Marco Gaido commented on SPARK-21905:
-

This is likely to be caused by a bug in the Magellan package. It expects to 
receive an InternalRow to deserialize but in this case it doesn't happen. So it 
should be fixed there.

> ClassCastException when call sqlContext.sql on temp table
> -
>
> Key: SPARK-21905
> URL: https://issues.apache.org/jira/browse/SPARK-21905
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: bluejoe
>
> {code:java}
> val schema = StructType(List(
>   StructField("name", DataTypes.StringType, true),
>   StructField("location", new PointUDT, true)))
> val rowRdd = sqlContext.sparkContext.parallelize(Seq("bluejoe", "alex"), 
> 4).map({ x: String ⇒ Row.fromSeq(Seq(x, Point(100, 100))) });
> val dataFrame = sqlContext.createDataFrame(rowRdd, schema)
> dataFrame.createOrReplaceTempView("person");
> sqlContext.sql("SELECT * FROM person").foreach(println(_));
> {code}
> the last statement throws exception:
> {code:java}
> Caused by: java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericRow cannot be cast to 
> org.apache.spark.sql.catalyst.InternalRow
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.evalIfFalseExpr1$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:287)
>   ... 18 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21418) NoSuchElementException: None.get in DataSourceScanExec with sun.io.serialization.extendedDebugInfo=true

2017-09-04 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-21418.
---
   Resolution: Fixed
 Assignee: Sean Owen
Fix Version/s: 2.3.0
   2.2.1

> NoSuchElementException: None.get in DataSourceScanExec with 
> sun.io.serialization.extendedDebugInfo=true
> ---
>
> Key: SPARK-21418
> URL: https://issues.apache.org/jira/browse/SPARK-21418
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Daniel Darabos
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 2.2.1, 2.3.0
>
>
> I don't have a minimal reproducible example yet, sorry. I have the following 
> lines in a unit test for our Spark application:
> {code}
> val df = mySparkSession.read.format("jdbc")
>   .options(Map("url" -> url, "dbtable" -> "test_table"))
>   .load()
> df.show
> println(df.rdd.collect)
> {code}
> The output shows the DataFrame contents from {{df.show}}. But the {{collect}} 
> fails:
> {noformat}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 
> serialization failed: java.util.NoSuchElementException: None.get
> java.util.NoSuchElementException: None.get
>   at scala.None$.get(Option.scala:347)
>   at scala.None$.get(Option.scala:345)
>   at 
> org.apache.spark.sql.execution.DataSourceScanExec$class.org$apache$spark$sql$execution$DataSourceScanExec$$redact(DataSourceScanExec.scala:70)
>   at 
> org.apache.spark.sql.execution.DataSourceScanExec$$anonfun$4.apply(DataSourceScanExec.scala:54)
>   at 
> org.apache.spark.sql.execution.DataSourceScanExec$$anonfun$4.apply(DataSourceScanExec.scala:52)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.execution.DataSourceScanExec$class.simpleString(DataSourceScanExec.scala:52)
>   at 
> org.apache.spark.sql.execution.RowDataSourceScanExec.simpleString(DataSourceScanExec.scala:75)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.verboseString(QueryPlan.scala:349)
>   at 
> org.apache.spark.sql.execution.RowDataSourceScanExec.org$apache$spark$sql$execution$DataSourceScanExec$$super$verboseString(DataSourceScanExec.scala:75)
>   at 
> org.apache.spark.sql.execution.DataSourceScanExec$class.verboseString(DataSourceScanExec.scala:60)
>   at 
> org.apache.spark.sql.execution.RowDataSourceScanExec.verboseString(DataSourceScanExec.scala:75)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:556)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.generateTreeString(WholeStageCodegenExec.scala:451)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:576)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:480)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:477)
>   at org.apache.spark.sql.catalyst.trees.TreeNode.toString(TreeNode.scala:474)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1421)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>   at 
> java.io.O

[jira] [Assigned] (SPARK-21912) Creating ORC datasource table should check invalid column names

2017-09-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21912:


Assignee: Apache Spark

> Creating ORC datasource table should check invalid column names
> ---
>
> Key: SPARK-21912
> URL: https://issues.apache.org/jira/browse/SPARK-21912
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>
> Currently, users meet job abortions while creating ORC data source tables 
> with invalid column names. We had better prevent this by raising 
> AnalysisException like Paquet data source tables.
> {code}
> scala> sql("CREATE TABLE orc1 USING ORC AS SELECT 1 `a b`")
> 17/09/04 13:28:21 ERROR Utils: Aborting task
> java.lang.IllegalArgumentException: Error: : expected at the position 8 of 
> 'struct' but ' ' is found.
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:360)
> ...
> 17/09/04 13:28:21 WARN FileOutputCommitter: Could not delete 
> file:/Users/dongjoon/spark-release/spark-master/spark-warehouse/orc1/_temporary/0/_temporary/attempt_20170904132821_0001_m_00_0
> 17/09/04 13:28:21 ERROR FileFormatWriter: Job job_20170904132821_0001 aborted.
> 17/09/04 13:28:21 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> org.apache.spark.SparkException: Task failed while writing rows.
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21912) Creating ORC datasource table should check invalid column names

2017-09-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21912:


Assignee: (was: Apache Spark)

> Creating ORC datasource table should check invalid column names
> ---
>
> Key: SPARK-21912
> URL: https://issues.apache.org/jira/browse/SPARK-21912
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Dongjoon Hyun
>
> Currently, users meet job abortions while creating ORC data source tables 
> with invalid column names. We had better prevent this by raising 
> AnalysisException like Paquet data source tables.
> {code}
> scala> sql("CREATE TABLE orc1 USING ORC AS SELECT 1 `a b`")
> 17/09/04 13:28:21 ERROR Utils: Aborting task
> java.lang.IllegalArgumentException: Error: : expected at the position 8 of 
> 'struct' but ' ' is found.
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:360)
> ...
> 17/09/04 13:28:21 WARN FileOutputCommitter: Could not delete 
> file:/Users/dongjoon/spark-release/spark-master/spark-warehouse/orc1/_temporary/0/_temporary/attempt_20170904132821_0001_m_00_0
> 17/09/04 13:28:21 ERROR FileFormatWriter: Job job_20170904132821_0001 aborted.
> 17/09/04 13:28:21 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> org.apache.spark.SparkException: Task failed while writing rows.
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21912) Creating ORC datasource table should check invalid column names

2017-09-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16152931#comment-16152931
 ] 

Apache Spark commented on SPARK-21912:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/19124

> Creating ORC datasource table should check invalid column names
> ---
>
> Key: SPARK-21912
> URL: https://issues.apache.org/jira/browse/SPARK-21912
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Dongjoon Hyun
>
> Currently, users meet job abortions while creating ORC data source tables 
> with invalid column names. We had better prevent this by raising 
> AnalysisException like Paquet data source tables.
> {code}
> scala> sql("CREATE TABLE orc1 USING ORC AS SELECT 1 `a b`")
> 17/09/04 13:28:21 ERROR Utils: Aborting task
> java.lang.IllegalArgumentException: Error: : expected at the position 8 of 
> 'struct' but ' ' is found.
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:360)
> ...
> 17/09/04 13:28:21 WARN FileOutputCommitter: Could not delete 
> file:/Users/dongjoon/spark-release/spark-master/spark-warehouse/orc1/_temporary/0/_temporary/attempt_20170904132821_0001_m_00_0
> 17/09/04 13:28:21 ERROR FileFormatWriter: Job job_20170904132821_0001 aborted.
> 17/09/04 13:28:21 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> org.apache.spark.SparkException: Task failed while writing rows.
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21912) Creating ORC datasource table should check invalid column names

2017-09-04 Thread Dongjoon Hyun (JIRA)

Dongjoon Hyun created SPARK-21912:
-

 Summary: Creating ORC datasource table should check invalid column 
names
 Key: SPARK-21912
 URL: https://issues.apache.org/jira/browse/SPARK-21912
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Dongjoon Hyun


Currently, users meet job abortions while creating ORC data source tables with 
invalid column names. We had better prevent this by raising AnalysisException 
like Paquet data source tables.

{code}
scala> sql("CREATE TABLE orc1 USING ORC AS SELECT 1 `a b`")
17/09/04 13:28:21 ERROR Utils: Aborting task
java.lang.IllegalArgumentException: Error: : expected at the position 8 of 
'struct' but ' ' is found.
at 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:360)
...
17/09/04 13:28:21 WARN FileOutputCommitter: Could not delete 
file:/Users/dongjoon/spark-release/spark-master/spark-warehouse/orc1/_temporary/0/_temporary/attempt_20170904132821_0001_m_00_0
17/09/04 13:28:21 ERROR FileFormatWriter: Job job_20170904132821_0001 aborted.
17/09/04 13:28:21 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
org.apache.spark.SparkException: Task failed while writing rows.
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21866) SPIP: Image support in Spark

2017-09-04 Thread Matei Zaharia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16152915#comment-16152915
 ] 

Matei Zaharia commented on SPARK-21866:
---

Just to chime in on this, I've also seen feedback that the deep learning 
libraries for Spark are too fragmented: there are too many of them, and people 
don't know where to start. This standard representation would at least give 
them a clear way to interoperate. It would let people write separate libraries 
for image processing, data augmentation and then training for example.

> SPIP: Image support in Spark
> 
>
> Key: SPARK-21866
> URL: https://issues.apache.org/jira/browse/SPARK-21866
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Timothy Hunter
>  Labels: SPIP
> Attachments: SPIP - Image support for Apache Spark V1.1.pdf
>
>
> h2. Background and motivation
> As Apache Spark is being used more and more in the industry, some new use 
> cases are emerging for different data formats beyond the traditional SQL 
> types or the numerical types (vectors and matrices). Deep Learning 
> applications commonly deal with image processing. A number of projects add 
> some Deep Learning capabilities to Spark (see list below), but they struggle 
> to  communicate with each other or with MLlib pipelines because there is no 
> standard way to represent an image in Spark DataFrames. We propose to 
> federate efforts for representing images in Spark by defining a 
> representation that caters to the most common needs of users and library 
> developers.
> This SPIP proposes a specification to represent images in Spark DataFrames 
> and Datasets (based on existing industrial standards), and an interface for 
> loading sources of images. It is not meant to be a full-fledged image 
> processing library, but rather the core description that other libraries and 
> users can rely on. Several packages already offer various processing 
> facilities for transforming images or doing more complex operations, and each 
> has various design tradeoffs that make them better as standalone solutions.
> This project is a joint collaboration between Microsoft and Databricks, which 
> have been testing this design in two open source packages: MMLSpark and Deep 
> Learning Pipelines.
> The proposed image format is an in-memory, decompressed representation that 
> targets low-level applications. It is significantly more liberal in memory 
> usage than compressed image representations such as JPEG, PNG, etc., but it 
> allows easy communication with popular image processing libraries and has no 
> decoding overhead.
> h2. Targets users and personas:
> Data scientists, data engineers, library developers.
> The following libraries define primitives for loading and representing 
> images, and will gain from a common interchange format (in alphabetical 
> order):
> * BigDL
> * DeepLearning4J
> * Deep Learning Pipelines
> * MMLSpark
> * TensorFlow (Spark connector)
> * TensorFlowOnSpark
> * TensorFrames
> * Thunder
> h2. Goals:
> * Simple representation of images in Spark DataFrames, based on pre-existing 
> industrial standards (OpenCV)
> * This format should eventually allow the development of high-performance 
> integration points with image processing libraries such as libOpenCV, Google 
> TensorFlow, CNTK, and other C libraries.
> * The reader should be able to read popular formats of images from 
> distributed sources.
> h2. Non-Goals:
> Images are a versatile medium and encompass a very wide range of formats and 
> representations. This SPIP explicitly aims at the most common use case in the 
> industry currently: multi-channel matrices of binary, int32, int64, float or 
> double data that can fit comfortably in the heap of the JVM:
> * the total size of an image should be restricted to less than 2GB (roughly)
> * the meaning of color channels is application-specific and is not mandated 
> by the standard (in line with the OpenCV standard)
> * specialized formats used in meteorology, the medical field, etc. are not 
> supported
> * this format is specialized to images and does not attempt to solve the more 
> general problem of representing n-dimensional tensors in Spark
> h2. Proposed API changes
> We propose to add a new package in the package structure, under the MLlib 
> project:
> {{org.apache.spark.image}}
> h3. Data format
> We propose to add the following structure:
> imageSchema = StructType([
> * StructField("mode", StringType(), False),
> ** The exact representation of the data.
> ** The values are described in the following OpenCV convention. Basically, 
> the type has both "depth" and "number of channels" info: in particular, type 
> "CV_8UC3" means "3 channel unsigned bytes". BGRA format would be CV_8UC4

[jira] [Commented] (SPARK-21882) OutputMetrics doesn't count written bytes correctly in the saveAsHadoopDataset function

2017-09-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16152914#comment-16152914
 ] 

Apache Spark commented on SPARK-21882:
--

User 'awarrior' has created a pull request for this issue:
https://github.com/apache/spark/pull/19115

> OutputMetrics doesn't count written bytes correctly in the 
> saveAsHadoopDataset function
> ---
>
> Key: SPARK-21882
> URL: https://issues.apache.org/jira/browse/SPARK-21882
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1, 2.2.0
>Reporter: linxiaojun
>Priority: Minor
> Attachments: SPARK-21882.patch
>
>
> The first job called from saveAsHadoopDataset, running in each executor, does 
> not calculate the writtenBytes of OutputMetrics correctly (writtenBytes is 
> 0). The reason is that we did not initialize the callback function called to 
> find bytes written in the right way. As usual, statisticsTable which records 
> statistics in a FileSystem must be initialized at the beginning (this will be 
> triggered when open SparkHadoopWriter). The solution for this issue is to 
> adjust the order of callback function initialization. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21418) NoSuchElementException: None.get in DataSourceScanExec with sun.io.serialization.extendedDebugInfo=true

2017-09-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21418:


Assignee: Apache Spark

> NoSuchElementException: None.get in DataSourceScanExec with 
> sun.io.serialization.extendedDebugInfo=true
> ---
>
> Key: SPARK-21418
> URL: https://issues.apache.org/jira/browse/SPARK-21418
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Daniel Darabos
>Assignee: Apache Spark
>Priority: Minor
>
> I don't have a minimal reproducible example yet, sorry. I have the following 
> lines in a unit test for our Spark application:
> {code}
> val df = mySparkSession.read.format("jdbc")
>   .options(Map("url" -> url, "dbtable" -> "test_table"))
>   .load()
> df.show
> println(df.rdd.collect)
> {code}
> The output shows the DataFrame contents from {{df.show}}. But the {{collect}} 
> fails:
> {noformat}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 
> serialization failed: java.util.NoSuchElementException: None.get
> java.util.NoSuchElementException: None.get
>   at scala.None$.get(Option.scala:347)
>   at scala.None$.get(Option.scala:345)
>   at 
> org.apache.spark.sql.execution.DataSourceScanExec$class.org$apache$spark$sql$execution$DataSourceScanExec$$redact(DataSourceScanExec.scala:70)
>   at 
> org.apache.spark.sql.execution.DataSourceScanExec$$anonfun$4.apply(DataSourceScanExec.scala:54)
>   at 
> org.apache.spark.sql.execution.DataSourceScanExec$$anonfun$4.apply(DataSourceScanExec.scala:52)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.execution.DataSourceScanExec$class.simpleString(DataSourceScanExec.scala:52)
>   at 
> org.apache.spark.sql.execution.RowDataSourceScanExec.simpleString(DataSourceScanExec.scala:75)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.verboseString(QueryPlan.scala:349)
>   at 
> org.apache.spark.sql.execution.RowDataSourceScanExec.org$apache$spark$sql$execution$DataSourceScanExec$$super$verboseString(DataSourceScanExec.scala:75)
>   at 
> org.apache.spark.sql.execution.DataSourceScanExec$class.verboseString(DataSourceScanExec.scala:60)
>   at 
> org.apache.spark.sql.execution.RowDataSourceScanExec.verboseString(DataSourceScanExec.scala:75)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:556)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.generateTreeString(WholeStageCodegenExec.scala:451)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:576)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:480)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:477)
>   at org.apache.spark.sql.catalyst.trees.TreeNode.toString(TreeNode.scala:474)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1421)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at java.io.ObjectOutputStream.writeObject0(Ob

[jira] [Commented] (SPARK-21418) NoSuchElementException: None.get in DataSourceScanExec with sun.io.serialization.extendedDebugInfo=true

2017-09-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16152821#comment-16152821
 ] 

Apache Spark commented on SPARK-21418:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/19123

> NoSuchElementException: None.get in DataSourceScanExec with 
> sun.io.serialization.extendedDebugInfo=true
> ---
>
> Key: SPARK-21418
> URL: https://issues.apache.org/jira/browse/SPARK-21418
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Daniel Darabos
>Priority: Minor
>
> I don't have a minimal reproducible example yet, sorry. I have the following 
> lines in a unit test for our Spark application:
> {code}
> val df = mySparkSession.read.format("jdbc")
>   .options(Map("url" -> url, "dbtable" -> "test_table"))
>   .load()
> df.show
> println(df.rdd.collect)
> {code}
> The output shows the DataFrame contents from {{df.show}}. But the {{collect}} 
> fails:
> {noformat}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 
> serialization failed: java.util.NoSuchElementException: None.get
> java.util.NoSuchElementException: None.get
>   at scala.None$.get(Option.scala:347)
>   at scala.None$.get(Option.scala:345)
>   at 
> org.apache.spark.sql.execution.DataSourceScanExec$class.org$apache$spark$sql$execution$DataSourceScanExec$$redact(DataSourceScanExec.scala:70)
>   at 
> org.apache.spark.sql.execution.DataSourceScanExec$$anonfun$4.apply(DataSourceScanExec.scala:54)
>   at 
> org.apache.spark.sql.execution.DataSourceScanExec$$anonfun$4.apply(DataSourceScanExec.scala:52)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.execution.DataSourceScanExec$class.simpleString(DataSourceScanExec.scala:52)
>   at 
> org.apache.spark.sql.execution.RowDataSourceScanExec.simpleString(DataSourceScanExec.scala:75)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.verboseString(QueryPlan.scala:349)
>   at 
> org.apache.spark.sql.execution.RowDataSourceScanExec.org$apache$spark$sql$execution$DataSourceScanExec$$super$verboseString(DataSourceScanExec.scala:75)
>   at 
> org.apache.spark.sql.execution.DataSourceScanExec$class.verboseString(DataSourceScanExec.scala:60)
>   at 
> org.apache.spark.sql.execution.RowDataSourceScanExec.verboseString(DataSourceScanExec.scala:75)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:556)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.generateTreeString(WholeStageCodegenExec.scala:451)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:576)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:480)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:477)
>   at org.apache.spark.sql.catalyst.trees.TreeNode.toString(TreeNode.scala:474)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1421)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>   at 
> java.io.ObjectOutputStream.writeOrdina

[jira] [Assigned] (SPARK-21418) NoSuchElementException: None.get in DataSourceScanExec with sun.io.serialization.extendedDebugInfo=true

2017-09-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21418:


Assignee: (was: Apache Spark)

> NoSuchElementException: None.get in DataSourceScanExec with 
> sun.io.serialization.extendedDebugInfo=true
> ---
>
> Key: SPARK-21418
> URL: https://issues.apache.org/jira/browse/SPARK-21418
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Daniel Darabos
>Priority: Minor
>
> I don't have a minimal reproducible example yet, sorry. I have the following 
> lines in a unit test for our Spark application:
> {code}
> val df = mySparkSession.read.format("jdbc")
>   .options(Map("url" -> url, "dbtable" -> "test_table"))
>   .load()
> df.show
> println(df.rdd.collect)
> {code}
> The output shows the DataFrame contents from {{df.show}}. But the {{collect}} 
> fails:
> {noformat}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 
> serialization failed: java.util.NoSuchElementException: None.get
> java.util.NoSuchElementException: None.get
>   at scala.None$.get(Option.scala:347)
>   at scala.None$.get(Option.scala:345)
>   at 
> org.apache.spark.sql.execution.DataSourceScanExec$class.org$apache$spark$sql$execution$DataSourceScanExec$$redact(DataSourceScanExec.scala:70)
>   at 
> org.apache.spark.sql.execution.DataSourceScanExec$$anonfun$4.apply(DataSourceScanExec.scala:54)
>   at 
> org.apache.spark.sql.execution.DataSourceScanExec$$anonfun$4.apply(DataSourceScanExec.scala:52)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.execution.DataSourceScanExec$class.simpleString(DataSourceScanExec.scala:52)
>   at 
> org.apache.spark.sql.execution.RowDataSourceScanExec.simpleString(DataSourceScanExec.scala:75)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.verboseString(QueryPlan.scala:349)
>   at 
> org.apache.spark.sql.execution.RowDataSourceScanExec.org$apache$spark$sql$execution$DataSourceScanExec$$super$verboseString(DataSourceScanExec.scala:75)
>   at 
> org.apache.spark.sql.execution.DataSourceScanExec$class.verboseString(DataSourceScanExec.scala:60)
>   at 
> org.apache.spark.sql.execution.RowDataSourceScanExec.verboseString(DataSourceScanExec.scala:75)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:556)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.generateTreeString(WholeStageCodegenExec.scala:451)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:576)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:480)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:477)
>   at org.apache.spark.sql.catalyst.trees.TreeNode.toString(TreeNode.scala:474)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1421)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:117

[jira] [Updated] (SPARK-21418) NoSuchElementException: None.get in DataSourceScanExec with sun.io.serialization.extendedDebugInfo=true

2017-09-04 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-21418:
--
Summary: NoSuchElementException: None.get in DataSourceScanExec with 
sun.io.serialization.extendedDebugInfo=true  (was: NoSuchElementException: 
None.get on DataFrame.rdd)

> NoSuchElementException: None.get in DataSourceScanExec with 
> sun.io.serialization.extendedDebugInfo=true
> ---
>
> Key: SPARK-21418
> URL: https://issues.apache.org/jira/browse/SPARK-21418
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Daniel Darabos
>Priority: Minor
>
> I don't have a minimal reproducible example yet, sorry. I have the following 
> lines in a unit test for our Spark application:
> {code}
> val df = mySparkSession.read.format("jdbc")
>   .options(Map("url" -> url, "dbtable" -> "test_table"))
>   .load()
> df.show
> println(df.rdd.collect)
> {code}
> The output shows the DataFrame contents from {{df.show}}. But the {{collect}} 
> fails:
> {noformat}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 
> serialization failed: java.util.NoSuchElementException: None.get
> java.util.NoSuchElementException: None.get
>   at scala.None$.get(Option.scala:347)
>   at scala.None$.get(Option.scala:345)
>   at 
> org.apache.spark.sql.execution.DataSourceScanExec$class.org$apache$spark$sql$execution$DataSourceScanExec$$redact(DataSourceScanExec.scala:70)
>   at 
> org.apache.spark.sql.execution.DataSourceScanExec$$anonfun$4.apply(DataSourceScanExec.scala:54)
>   at 
> org.apache.spark.sql.execution.DataSourceScanExec$$anonfun$4.apply(DataSourceScanExec.scala:52)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.execution.DataSourceScanExec$class.simpleString(DataSourceScanExec.scala:52)
>   at 
> org.apache.spark.sql.execution.RowDataSourceScanExec.simpleString(DataSourceScanExec.scala:75)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.verboseString(QueryPlan.scala:349)
>   at 
> org.apache.spark.sql.execution.RowDataSourceScanExec.org$apache$spark$sql$execution$DataSourceScanExec$$super$verboseString(DataSourceScanExec.scala:75)
>   at 
> org.apache.spark.sql.execution.DataSourceScanExec$class.verboseString(DataSourceScanExec.scala:60)
>   at 
> org.apache.spark.sql.execution.RowDataSourceScanExec.verboseString(DataSourceScanExec.scala:75)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:556)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.generateTreeString(WholeStageCodegenExec.scala:451)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:576)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:480)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:477)
>   at org.apache.spark.sql.catalyst.trees.TreeNode.toString(TreeNode.scala:474)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1421)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>   at 
> java.io.ObjectOutputStrea

[jira] [Updated] (SPARK-21418) NoSuchElementException: None.get on DataFrame.rdd

2017-09-04 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-21418:
--
Priority: Minor  (was: Major)

I think we could easily make this code a little more defensive so that this 
doesn't result in an error. It's just trying to check if a config exists in 
SparkConf and there's no particular need for this to fail.

> NoSuchElementException: None.get on DataFrame.rdd
> -
>
> Key: SPARK-21418
> URL: https://issues.apache.org/jira/browse/SPARK-21418
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Daniel Darabos
>Priority: Minor
>
> I don't have a minimal reproducible example yet, sorry. I have the following 
> lines in a unit test for our Spark application:
> {code}
> val df = mySparkSession.read.format("jdbc")
>   .options(Map("url" -> url, "dbtable" -> "test_table"))
>   .load()
> df.show
> println(df.rdd.collect)
> {code}
> The output shows the DataFrame contents from {{df.show}}. But the {{collect}} 
> fails:
> {noformat}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 
> serialization failed: java.util.NoSuchElementException: None.get
> java.util.NoSuchElementException: None.get
>   at scala.None$.get(Option.scala:347)
>   at scala.None$.get(Option.scala:345)
>   at 
> org.apache.spark.sql.execution.DataSourceScanExec$class.org$apache$spark$sql$execution$DataSourceScanExec$$redact(DataSourceScanExec.scala:70)
>   at 
> org.apache.spark.sql.execution.DataSourceScanExec$$anonfun$4.apply(DataSourceScanExec.scala:54)
>   at 
> org.apache.spark.sql.execution.DataSourceScanExec$$anonfun$4.apply(DataSourceScanExec.scala:52)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.execution.DataSourceScanExec$class.simpleString(DataSourceScanExec.scala:52)
>   at 
> org.apache.spark.sql.execution.RowDataSourceScanExec.simpleString(DataSourceScanExec.scala:75)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.verboseString(QueryPlan.scala:349)
>   at 
> org.apache.spark.sql.execution.RowDataSourceScanExec.org$apache$spark$sql$execution$DataSourceScanExec$$super$verboseString(DataSourceScanExec.scala:75)
>   at 
> org.apache.spark.sql.execution.DataSourceScanExec$class.verboseString(DataSourceScanExec.scala:60)
>   at 
> org.apache.spark.sql.execution.RowDataSourceScanExec.verboseString(DataSourceScanExec.scala:75)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:556)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.generateTreeString(WholeStageCodegenExec.scala:451)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:576)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:480)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:477)
>   at org.apache.spark.sql.catalyst.trees.TreeNode.toString(TreeNode.scala:474)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1421)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream

[jira] [Assigned] (SPARK-21911) Parallel Model Evaluation for ML Tuning: Python

2017-09-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21911:


Assignee: Apache Spark

> Parallel Model Evaluation for ML Tuning: Python
> ---
>
> Key: SPARK-21911
> URL: https://issues.apache.org/jira/browse/SPARK-21911
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>Assignee: Apache Spark
>
> Add parallelism support for ML tuning in pyspark.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21911) Parallel Model Evaluation for ML Tuning: PySpark

2017-09-04 Thread Weichen Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-21911:
---
Summary: Parallel Model Evaluation for ML Tuning: PySpark  (was: Parallel 
Model Evaluation for ML Tuning: Python)

> Parallel Model Evaluation for ML Tuning: PySpark
> 
>
> Key: SPARK-21911
> URL: https://issues.apache.org/jira/browse/SPARK-21911
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>
> Add parallelism support for ML tuning in pyspark.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17041) Columns in schema are no longer case sensitive when reading csv file

2017-09-04 Thread Alexandre Dupriez (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16152745#comment-16152745
 ] 

Alexandre Dupriez edited comment on SPARK-17041 at 9/4/17 3:54 PM:
---

I would advocate for a message which highlights the problem is case-related, 
since it may not be obvious from a message like the following
{{Reference 'Output' is ambiguous, could be: Output#1263, Output#1295}}
In fact it seems the column's header name provided in the exception can be 
taken from either of the colliding columns - and thus contain capital letters, 
which can be misleading w.r.t. case sensitivity.


was (Author: hangleton):
I would advocate for a message which highlights the problem is case-related, 
since it may not be obvious from a message like
{{Reference 'Output' is ambiguous, could be: Output#1263, Output#1295}}
In fact it seems the column's header name provided in the message can be taken 
from either of the colliding columns - and thus contain capital letters, which 
can be misleading w.r.t. case sensitivity.

> Columns in schema are no longer case sensitive when reading csv file
> 
>
> Key: SPARK-17041
> URL: https://issues.apache.org/jira/browse/SPARK-17041
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
>Reporter: Barry Becker
>
> It used to be (in spark 1.6.2) that I could read a csv file that had columns 
> with  names that differed only by case. For example, one column may be 
> "output" and another called "Output". Now (with spark 2.0.0) if I try to read 
> such a file, I get an error like this:
> {code}
> org.apache.spark.sql.AnalysisException: Reference 'Output' is ambiguous, 
> could be: Output#1263, Output#1295.;
> {code}
> The schema (dfSchema below) that I pass to the csv read looks like this:
> {code}
> StructType( StructField(Output,StringType,true), ... 
> StructField(output,StringType,true), ...)
> {code}
> The code that does the read is this
> {code}
> sqlContext.read
>   .format("csv")
>   .option("header", "false") // Use first line of all files as header
>   .option("inferSchema", "false") // Automatically infer data types
>   .schema(dfSchema)
>   .csv(dataFile)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17041) Columns in schema are no longer case sensitive when reading csv file

2017-09-04 Thread Alexandre Dupriez (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16152745#comment-16152745
 ] 

Alexandre Dupriez edited comment on SPARK-17041 at 9/4/17 3:53 PM:
---

I would advocate for a message which highlights the problem is case-related, 
since it may not be obvious from a message like {{Reference 'Output' is 
ambiguous, could be: Output#1263, Output#1295}}
In fact it seems the column's header name provided in the message can be taken 
from either of the colliding columns - and thus contain capital letters, which 
can be misleading w.r.t. case sensitivity.


was (Author: hangleton):
I would advocate for a message which highlights the problem is case-related, 
since it may not be obvious from a message like {{Reference 'Output' is 
ambiguous, could be: Output#1263, Output#1295.;}} (in fact it seems the 
column's header name provided in the message can be taken from either of the 
colliding columns - and thus contain capital letters, which can be misleading 
w.r.t. case sensitivity).

> Columns in schema are no longer case sensitive when reading csv file
> 
>
> Key: SPARK-17041
> URL: https://issues.apache.org/jira/browse/SPARK-17041
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
>Reporter: Barry Becker
>
> It used to be (in spark 1.6.2) that I could read a csv file that had columns 
> with  names that differed only by case. For example, one column may be 
> "output" and another called "Output". Now (with spark 2.0.0) if I try to read 
> such a file, I get an error like this:
> {code}
> org.apache.spark.sql.AnalysisException: Reference 'Output' is ambiguous, 
> could be: Output#1263, Output#1295.;
> {code}
> The schema (dfSchema below) that I pass to the csv read looks like this:
> {code}
> StructType( StructField(Output,StringType,true), ... 
> StructField(output,StringType,true), ...)
> {code}
> The code that does the read is this
> {code}
> sqlContext.read
>   .format("csv")
>   .option("header", "false") // Use first line of all files as header
>   .option("inferSchema", "false") // Automatically infer data types
>   .schema(dfSchema)
>   .csv(dataFile)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17041) Columns in schema are no longer case sensitive when reading csv file

2017-09-04 Thread Alexandre Dupriez (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16152745#comment-16152745
 ] 

Alexandre Dupriez commented on SPARK-17041:
---

I would advocate for a message which highlights the problem is case-related, 
since it may not be obvious from a message like {{Reference 'Output' is 
ambiguous, could be: Output#1263, Output#1295.;}} (in fact it seems the 
column's header name provided in the message can be taken from either of the 
colliding columns - and thus contain capital letters, which can be misleading 
w.r.t. case sensitivity).

> Columns in schema are no longer case sensitive when reading csv file
> 
>
> Key: SPARK-17041
> URL: https://issues.apache.org/jira/browse/SPARK-17041
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
>Reporter: Barry Becker
>
> It used to be (in spark 1.6.2) that I could read a csv file that had columns 
> with  names that differed only by case. For example, one column may be 
> "output" and another called "Output". Now (with spark 2.0.0) if I try to read 
> such a file, I get an error like this:
> {code}
> org.apache.spark.sql.AnalysisException: Reference 'Output' is ambiguous, 
> could be: Output#1263, Output#1295.;
> {code}
> The schema (dfSchema below) that I pass to the csv read looks like this:
> {code}
> StructType( StructField(Output,StringType,true), ... 
> StructField(output,StringType,true), ...)
> {code}
> The code that does the read is this
> {code}
> sqlContext.read
>   .format("csv")
>   .option("header", "false") // Use first line of all files as header
>   .option("inferSchema", "false") // Automatically infer data types
>   .schema(dfSchema)
>   .csv(dataFile)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21418) NoSuchElementException: None.get on DataFrame.rdd

2017-09-04 Thread Daniel Darabos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16152742#comment-16152742
 ] 

Daniel Darabos commented on SPARK-21418:


Sorry for the delay. I can confirm that removing 
{{-Dsun.io.serialization.extendedDebugInfo=true}} is the fix. We only use this 
flag when running unit tests, but it's very useful for debugging serialization 
issues. It happens often in Spark that you accidentally include something in a 
closure that cannot be serialized. It's hard to figure out without this flag 
what caused that.

> NoSuchElementException: None.get on DataFrame.rdd
> -
>
> Key: SPARK-21418
> URL: https://issues.apache.org/jira/browse/SPARK-21418
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Daniel Darabos
>
> I don't have a minimal reproducible example yet, sorry. I have the following 
> lines in a unit test for our Spark application:
> {code}
> val df = mySparkSession.read.format("jdbc")
>   .options(Map("url" -> url, "dbtable" -> "test_table"))
>   .load()
> df.show
> println(df.rdd.collect)
> {code}
> The output shows the DataFrame contents from {{df.show}}. But the {{collect}} 
> fails:
> {noformat}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 
> serialization failed: java.util.NoSuchElementException: None.get
> java.util.NoSuchElementException: None.get
>   at scala.None$.get(Option.scala:347)
>   at scala.None$.get(Option.scala:345)
>   at 
> org.apache.spark.sql.execution.DataSourceScanExec$class.org$apache$spark$sql$execution$DataSourceScanExec$$redact(DataSourceScanExec.scala:70)
>   at 
> org.apache.spark.sql.execution.DataSourceScanExec$$anonfun$4.apply(DataSourceScanExec.scala:54)
>   at 
> org.apache.spark.sql.execution.DataSourceScanExec$$anonfun$4.apply(DataSourceScanExec.scala:52)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.execution.DataSourceScanExec$class.simpleString(DataSourceScanExec.scala:52)
>   at 
> org.apache.spark.sql.execution.RowDataSourceScanExec.simpleString(DataSourceScanExec.scala:75)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.verboseString(QueryPlan.scala:349)
>   at 
> org.apache.spark.sql.execution.RowDataSourceScanExec.org$apache$spark$sql$execution$DataSourceScanExec$$super$verboseString(DataSourceScanExec.scala:75)
>   at 
> org.apache.spark.sql.execution.DataSourceScanExec$class.verboseString(DataSourceScanExec.scala:60)
>   at 
> org.apache.spark.sql.execution.RowDataSourceScanExec.verboseString(DataSourceScanExec.scala:75)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:556)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.generateTreeString(WholeStageCodegenExec.scala:451)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:576)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:480)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:477)
>   at org.apache.spark.sql.catalyst.trees.TreeNode.toString(TreeNode.scala:474)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1421)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOu

[jira] [Comment Edited] (SPARK-21418) NoSuchElementException: None.get on DataFrame.rdd

2017-09-04 Thread Daniel Darabos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16088952#comment-16088952
 ] 

Daniel Darabos edited comment on SPARK-21418 at 9/4/17 3:49 PM:

I'm on holiday without a computer through the coming week, but I'll try to
dig deeper after that.

I do recall that we enable a JVM flag for printing extra details on
serialization errors. Now I wonder if that flag collects string forms even
when no error happens. I guess I should not be surprised: if it did not,
there would be no reason to ever disable this feature.

That already suggests an easy workaround :). Thanks!

was (Author: darabos):
I'm on holiday without a computer through the coming week, but I'll try to
dig deeper after that.

I do recall that we enable a JVM flag for printing extra details on
serialization errors. Now I wonder if that flag collects string forms even
when no error happens. I guess I should not be surprised: if it did not,
there would be no reason to ever disable this feature.

That already suggests an easy workaround :). Thanks!

On Jul 15, 2017 6:44 PM, "Kazuaki Ishizaki (JIRA)"  wrote:

[ https://issues.apache.org/jira/browse/SPARK-21418?page=
com.atlassian.jira.plugin.system.issuetabpanels:comment-
tabpanel&focusedCommentId=16088659#comment-16088659 ]

Kazuaki Ishizaki commented on SPARK-21418:
--

I am curious why {java.io.ObjectOutputStream.writeOrdinaryObject} calls
`toString` method. Do you specify some option to run this program for JVM?

following lines in a unit test for our Spark application:
{{collect}} fails:
serialization failed: java.util.NoSuchElementException: None.get
$apache$spark$sql$execution$DataSourceScanExec$$redact(DataSourceScanExec.
scala:70)
DataSourceScanExec.scala:54)
DataSourceScanExec.scala:52)
1.apply(TraversableLike.scala:234)
1.apply(TraversableLike.scala:234)
ResizableArray.scala:59)
DataSourceScanExec.scala:52)
DataSourceScanExec.scala:75)
QueryPlan.scala:349)
apache$spark$sql$execution$DataSourceScanExec$$super$verboseString(
DataSourceScanExec.scala:75)
class.verboseString(DataSourceScanExec.scala:60)
DataSourceScanExec.scala:75)
generateTreeString(TreeNode.scala:556)
generateTreeString(WholeStageCodegenExec.scala:451)
generateTreeString(TreeNode.scala:576)
TreeNode.scala:480)
TreeNode.scala:477)
TreeNode.scala:474)
ObjectOutputStream.java:1421)
ObjectOutputStream.java:1548)
ObjectOutputStream.java:1509)
ObjectOutputStream.java:1432)
ObjectOutputStream.java:1548)
ObjectOutputStream.java:1509)
ObjectOutputStream.java:1432)
ObjectOutputStream.java:1548)
ObjectOutputStream.java:1509)
ObjectOutputStream.java:1432)
ObjectOutputStream.java:1548)
ObjectOutputStream.java:1509)
ObjectOutputStream.java:1432)
ObjectOutputStream.java:1548)
ObjectOutputStream.java:1509)
ObjectOutputStream.java:1432)
writeObject(List.scala:468)
NativeMethodAccessorImpl.java:62)
DelegatingMethodAccessorImpl.java:43)
ObjectStreamClass.java:1028)
ObjectOutputStream.java:1496)
ObjectOutputStream.java:1432)
ObjectOutputStream.java:1548)
ObjectOutputStream.java:1509)
ObjectOutputStream.java:1432)
ObjectOutputStream.java:1548)
ObjectOutputStream.java:1509)
ObjectOutputStream.java:1432)
writeObject(JavaSerializer.scala:43)
serialize(JavaSerializer.scala:100)
DAGScheduler.scala:1003)
scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:930)
DAGScheduler.scala:874)
doOnReceive(DAGScheduler.scala:1677)
onReceive(DAGScheduler.scala:1669)
onReceive(DAGScheduler.scala:1658)
91fa80fe8a2480d64c430bd10f97b3d44c007bcc#diff-2a91a9a59953aa82fa132aaf45bd73
1bR69 from https://issues.apache.org/jira/browse/SPARK-20070. It tries to
redact sensitive information from {{explain}} output. (We are not trying to
explain anything here, so I doubt it is meant to be running in this case.)
When it needs to access some configuration, it tries to take it from the
"current" Spark session, which it just reads from a thread-local variable.
We appear to be on a thread where this variable is not set.
constraint on multi-threaded Spark applications.

--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

> NoSuchElementException: None.get on DataFrame.rdd
> -
>
> Key: SPARK-21418
> URL: https://issues.apache.org/jira/browse/SPARK-21418
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Daniel Darabos
>
> I don't have a minimal reproducible example yet, sorry. I have the following 
> lines in a unit test for our Spark application:
> {code}
> val df = mySparkSession.read.format("jdbc")
>   .options(Map("url" -> url, "dbtable" -> "test_table"))
>   .load()
> df.show
> println(df.rdd.collect)
> {code}
> The output shows the DataFrame contents from {{df.show}}. But the {{collect}} 
> fails:
> {nofo

[jira] [Commented] (SPARK-21727) Operating on an ArrayType in a SparkR DataFrame throws error

2017-09-04 Thread Neil McQuarrie (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16152766#comment-16152766
 ] 

Neil McQuarrie commented on SPARK-21727:


Happy to take on the change this side... (unless [~yanboliang] you had intended 
to?)

> Operating on an ArrayType in a SparkR DataFrame throws error
> 
>
> Key: SPARK-21727
> URL: https://issues.apache.org/jira/browse/SPARK-21727
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Neil McQuarrie
>
> Previously 
> [posted|https://stackoverflow.com/questions/45056973/sparkr-dataframe-with-r-lists-as-elements]
>  this as a stack overflow question but it seems to be a bug.
> If I have an R data.frame where one of the column data types is an integer 
> *list* -- i.e., each of the elements in the column embeds an entire R list of 
> integers -- then it seems I can convert this data.frame to a SparkR DataFrame 
> just fine... SparkR treats the column as ArrayType(Double). 
> However, any subsequent operation on this SparkR DataFrame appears to throw 
> an error.
> Create an example R data.frame:
> {code}
> indices <- 1:4
> myDf <- data.frame(indices)
> myDf$data <- list(rep(0, 20))}}
> {code}
> Examine it to make sure it looks okay:
> {code}
> > str(myDf) 
> 'data.frame':   4 obs. of  2 variables:  
>  $ indices: int  1 2 3 4  
>  $ data   :List of 4
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
> > head(myDf)   
>   indices   data 
> 1   1 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 2   2 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 3   3 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 4   4 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
> {code}
> Convert it to a SparkR DataFrame:
> {code}
> library(SparkR, lib.loc=paste0(Sys.getenv("SPARK_HOME"),"/R/lib"))
> sparkR.session(master = "local[*]")
> mySparkDf <- as.DataFrame(myDf)
> {code}
> Examine the SparkR DataFrame schema; notice that the list column was 
> successfully converted to ArrayType:
> {code}
> > schema(mySparkDf)
> StructType
> |-name = "indices", type = "IntegerType", nullable = TRUE
> |-name = "data", type = "ArrayType(DoubleType,true)", nullable = TRUE
> {code}
> However, operating on the SparkR DataFrame throws an error:
> {code}
> > collect(mySparkDf)
> 17/07/13 17:23:00 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 
> (TID 1)
> java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: 
> java.lang.Double is not a valid external type for schema of array
> if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null 
> else validateexternaltype(getexternalrowfield(assertnotnull(input[0, 
> org.apache.spark.sql.Row, true]), 0, indices), IntegerType) AS indices#0
> ... long stack trace ...
> {code}
> Using Spark 2.2.0, R 3.4.0, Java 1.8.0_131, Windows 10.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21727) Operating on an ArrayType in a SparkR DataFrame throws error

2017-09-04 Thread Neil McQuarrie (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16152764#comment-16152764
 ] 

Neil McQuarrie commented on SPARK-21727:


Well, if class is "numeric" (or "integer", "character", etc.), then technically 
it is always a vector? (There are no distinct scalars in R?) We could look at 
length > 1?

> Operating on an ArrayType in a SparkR DataFrame throws error
> 
>
> Key: SPARK-21727
> URL: https://issues.apache.org/jira/browse/SPARK-21727
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Neil McQuarrie
>
> Previously 
> [posted|https://stackoverflow.com/questions/45056973/sparkr-dataframe-with-r-lists-as-elements]
>  this as a stack overflow question but it seems to be a bug.
> If I have an R data.frame where one of the column data types is an integer 
> *list* -- i.e., each of the elements in the column embeds an entire R list of 
> integers -- then it seems I can convert this data.frame to a SparkR DataFrame 
> just fine... SparkR treats the column as ArrayType(Double). 
> However, any subsequent operation on this SparkR DataFrame appears to throw 
> an error.
> Create an example R data.frame:
> {code}
> indices <- 1:4
> myDf <- data.frame(indices)
> myDf$data <- list(rep(0, 20))}}
> {code}
> Examine it to make sure it looks okay:
> {code}
> > str(myDf) 
> 'data.frame':   4 obs. of  2 variables:  
>  $ indices: int  1 2 3 4  
>  $ data   :List of 4
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
> > head(myDf)   
>   indices   data 
> 1   1 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 2   2 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 3   3 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 4   4 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
> {code}
> Convert it to a SparkR DataFrame:
> {code}
> library(SparkR, lib.loc=paste0(Sys.getenv("SPARK_HOME"),"/R/lib"))
> sparkR.session(master = "local[*]")
> mySparkDf <- as.DataFrame(myDf)
> {code}
> Examine the SparkR DataFrame schema; notice that the list column was 
> successfully converted to ArrayType:
> {code}
> > schema(mySparkDf)
> StructType
> |-name = "indices", type = "IntegerType", nullable = TRUE
> |-name = "data", type = "ArrayType(DoubleType,true)", nullable = TRUE
> {code}
> However, operating on the SparkR DataFrame throws an error:
> {code}
> > collect(mySparkDf)
> 17/07/13 17:23:00 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 
> (TID 1)
> java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: 
> java.lang.Double is not a valid external type for schema of array
> if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null 
> else validateexternaltype(getexternalrowfield(assertnotnull(input[0, 
> org.apache.spark.sql.Row, true]), 0, indices), IntegerType) AS indices#0
> ... long stack trace ...
> {code}
> Using Spark 2.2.0, R 3.4.0, Java 1.8.0_131, Windows 10.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21911) Parallel Model Evaluation for ML Tuning: Python

2017-09-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21911:


Assignee: (was: Apache Spark)

> Parallel Model Evaluation for ML Tuning: Python
> ---
>
> Key: SPARK-21911
> URL: https://issues.apache.org/jira/browse/SPARK-21911
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>
> Add parallelism support for ML tuning in pyspark.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21911) Parallel Model Evaluation for ML Tuning: Python

2017-09-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16152762#comment-16152762
 ] 

Apache Spark commented on SPARK-21911:
--

User 'WeichenXu123' has created a pull request for this issue:
https://github.com/apache/spark/pull/19122

> Parallel Model Evaluation for ML Tuning: Python
> ---
>
> Key: SPARK-21911
> URL: https://issues.apache.org/jira/browse/SPARK-21911
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>
> Add parallelism support for ML tuning in pyspark.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19357) Parallel Model Evaluation for ML Tuning: Scala

2017-09-04 Thread Weichen Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-19357:
---
Summary: Parallel Model Evaluation for ML Tuning: Scala  (was: Parallel 
Model Evaluation for ML Tuning)

> Parallel Model Evaluation for ML Tuning: Scala
> --
>
> Key: SPARK-19357
> URL: https://issues.apache.org/jira/browse/SPARK-19357
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Bryan Cutler
>
> This is a first step of the parent task of Optimizations for ML Pipeline 
> Tuning to perform model evaluation in parallel.  A simple approach is to 
> naively evaluate with a possible parameter to control the level of 
> parallelism.  There are some concerns with this:
> * excessive caching of datasets
> * what to set as the default value for level of parallelism.  1 will evaluate 
> all models in serial, as is done currently. Higher values could lead to 
> excessive caching.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21911) Parallel Model Evaluation for ML Tuning: Python

2017-09-04 Thread Weichen Xu (JIRA)

Weichen Xu created SPARK-21911:
--

 Summary: Parallel Model Evaluation for ML Tuning: Python
 Key: SPARK-21911
 URL: https://issues.apache.org/jira/browse/SPARK-21911
 Project: Spark
  Issue Type: New Feature
  Components: ML, PySpark
Affects Versions: 2.3.0
Reporter: Weichen Xu


Add parallelism support for ML tuning in pyspark.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17041) Columns in schema are no longer case sensitive when reading csv file

2017-09-04 Thread Alexandre Dupriez (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16152745#comment-16152745
 ] 

Alexandre Dupriez edited comment on SPARK-17041 at 9/4/17 3:53 PM:
---

I would advocate for a message which highlights the problem is case-related, 
since it may not be obvious from a message like
{{Reference 'Output' is ambiguous, could be: Output#1263, Output#1295}}
In fact it seems the column's header name provided in the message can be taken 
from either of the colliding columns - and thus contain capital letters, which 
can be misleading w.r.t. case sensitivity.


was (Author: hangleton):
I would advocate for a message which highlights the problem is case-related, 
since it may not be obvious from a message like {{Reference 'Output' is 
ambiguous, could be: Output#1263, Output#1295}}
In fact it seems the column's header name provided in the message can be taken 
from either of the colliding columns - and thus contain capital letters, which 
can be misleading w.r.t. case sensitivity.

> Columns in schema are no longer case sensitive when reading csv file
> 
>
> Key: SPARK-17041
> URL: https://issues.apache.org/jira/browse/SPARK-17041
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
>Reporter: Barry Becker
>
> It used to be (in spark 1.6.2) that I could read a csv file that had columns 
> with  names that differed only by case. For example, one column may be 
> "output" and another called "Output". Now (with spark 2.0.0) if I try to read 
> such a file, I get an error like this:
> {code}
> org.apache.spark.sql.AnalysisException: Reference 'Output' is ambiguous, 
> could be: Output#1263, Output#1295.;
> {code}
> The schema (dfSchema below) that I pass to the csv read looks like this:
> {code}
> StructType( StructField(Output,StringType,true), ... 
> StructField(output,StringType,true), ...)
> {code}
> The code that does the read is this
> {code}
> sqlContext.read
>   .format("csv")
>   .option("header", "false") // Use first line of all files as header
>   .option("inferSchema", "false") // Automatically infer data types
>   .schema(dfSchema)
>   .csv(dataFile)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21908) Checkpoint broadcast variable in spark streaming job

2017-09-04 Thread Venkat Gurukrishna (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16152702#comment-16152702
 ] 

Venkat Gurukrishna commented on SPARK-21908:


[~srowen] I tried sending mail from id to u...@spark.apache.org but I got the 
following error:

:
Must be sent from an @apache.org address or a subscriber address or an address 
in LDAP.

Can you let me know how to send an email and to what email id I should send?



> Checkpoint broadcast variable in spark streaming job
> 
>
> Key: SPARK-21908
> URL: https://issues.apache.org/jira/browse/SPARK-21908
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 1.6.0
>Reporter: Venkat Gurukrishna
>
> In our Spark 1.6 CDH 5.8.3, job application, we are using the broadcast 
> variables and when we checkpoint them and restart the spark job getting error.
> Even tried with making the broadcast variable as transient. But we are 
> getting different exception.
> I have checked this JIRA link: 
> https://issues.apache.org/jira/browse/SPARK-5206
> which had mentioned to use singleton reference to broadcast variable and also 
> to use the transient. 
> Whether this needs to be done in the driver side or at the executor side?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21910) Connection pooling in Spark Job using HBASE Context

2017-09-04 Thread Venkat Gurukrishna (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16152699#comment-16152699
 ] 

Venkat Gurukrishna commented on SPARK-21910:


[~srowen] I tried sending mail from id to u...@spark.apache.org but I got the 
following error:

:
Must be sent from an @apache.org address or a subscriber address or an address 
in LDAP.

Can you let me know how to send an email and to what email id I should send?



> Connection pooling in Spark Job using HBASE Context
> ---
>
> Key: SPARK-21910
> URL: https://issues.apache.org/jira/browse/SPARK-21910
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 1.6.0
>Reporter: Venkat Gurukrishna
>
> Is there a way to implement the HBASE connection pool using the HBASE 
> Context? In our spark job we are making the HBASE call for each batch and we 
> see new connection object is getting created for each batch interval of one 
> second. We want to implement the connection pooling for HBASE context. Not 
> able to do the same. Is there way to achieve the same the connection pool to 
> HBASE using HBASE Context. We are using Spark 1.6.0, CDH 5.8.3, HBASE 1.2.0



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21909) Checkpoint HBASE Context in Spark Streaming Job

2017-09-04 Thread Venkat Gurukrishna (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16152700#comment-16152700
 ] 

Venkat Gurukrishna commented on SPARK-21909:


[~srowen] I tried sending mail from id to u...@spark.apache.org but I got the 
following error:

:
Must be sent from an @apache.org address or a subscriber address or an address 
in LDAP.

Can you let me know how to send an email and to what email id I should send?



> Checkpoint HBASE Context in Spark Streaming Job
> ---
>
> Key: SPARK-21909
> URL: https://issues.apache.org/jira/browse/SPARK-21909
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 1.6.0
>Reporter: Venkat Gurukrishna
>
> In our Spark 1.6 CDH 5.8.3, job application, when using the HBaseContext with 
> checkpoint and restart, it is giving exception. How to handle the 
> checkpointing for HBaseContext?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21910) Connection pooling in Spark Job using HBASE Context

2017-09-04 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21910.
---
Resolution: Invalid

Please stop opening JIRAs with questions. Use the mailing list

> Connection pooling in Spark Job using HBASE Context
> ---
>
> Key: SPARK-21910
> URL: https://issues.apache.org/jira/browse/SPARK-21910
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 1.6.0
>Reporter: Venkat Gurukrishna
>
> Is there a way to implement the HBASE connection pool using the HBASE 
> Context? In our spark job we are making the HBASE call for each batch and we 
> see new connection object is getting created for each batch interval of one 
> second. We want to implement the connection pooling for HBASE context. Not 
> able to do the same. Is there way to achieve the same the connection pool to 
> HBASE using HBASE Context. We are using Spark 1.6.0, CDH 5.8.3, HBASE 1.2.0



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21909) Checkpoint HBASE Context in Spark Streaming Job

2017-09-04 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21909.
---
Resolution: Invalid

Please stop opening JIRAs with questions. Use the mailing list

> Checkpoint HBASE Context in Spark Streaming Job
> ---
>
> Key: SPARK-21909
> URL: https://issues.apache.org/jira/browse/SPARK-21909
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 1.6.0
>Reporter: Venkat Gurukrishna
>
> In our Spark 1.6 CDH 5.8.3, job application, when using the HBaseContext with 
> checkpoint and restart, it is giving exception. How to handle the 
> checkpointing for HBaseContext?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21910) Connection pooling in Spark Job using HBASE Context

2017-09-04 Thread Venkat Gurukrishna (JIRA)

Venkat Gurukrishna created SPARK-21910:
--

 Summary: Connection pooling in Spark Job using HBASE Context
 Key: SPARK-21910
 URL: https://issues.apache.org/jira/browse/SPARK-21910
 Project: Spark
  Issue Type: Bug
  Components: DStreams
Affects Versions: 1.6.0
Reporter: Venkat Gurukrishna


Is there a way to implement the HBASE connection pool using the HBASE Context? 
In our spark job we are making the HBASE call for each batch and we see new 
connection object is getting created for each batch interval of one second. We 
want to implement the connection pooling for HBASE context. Not able to do the 
same. Is there way to achieve the same the connection pool to HBASE using HBASE 
Context. We are using Spark 1.6.0, CDH 5.8.3, HBASE 1.2.0



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21908) Checkpoint broadcast variable in spark streaming job

2017-09-04 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21908.
---
Resolution: Invalid

It's not clear what your error is or what the result was of using a singleton, 
but, questions should go to the mailing list in any event.

> Checkpoint broadcast variable in spark streaming job
> 
>
> Key: SPARK-21908
> URL: https://issues.apache.org/jira/browse/SPARK-21908
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 1.6.0
>Reporter: Venkat Gurukrishna
>
> In our Spark 1.6 CDH 5.8.3, job application, we are using the broadcast 
> variables and when we checkpoint them and restart the spark job getting error.
> Even tried with making the broadcast variable as transient. But we are 
> getting different exception.
> I have checked this JIRA link: 
> https://issues.apache.org/jira/browse/SPARK-5206
> which had mentioned to use singleton reference to broadcast variable and also 
> to use the transient. 
> Whether this needs to be done in the driver side or at the executor side?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21909) Checkpoint HBASE Context in Spark Streaming Job

2017-09-04 Thread Venkat Gurukrishna (JIRA)

Venkat Gurukrishna created SPARK-21909:
--

 Summary: Checkpoint HBASE Context in Spark Streaming Job
 Key: SPARK-21909
 URL: https://issues.apache.org/jira/browse/SPARK-21909
 Project: Spark
  Issue Type: Bug
  Components: DStreams
Affects Versions: 1.6.0
Reporter: Venkat Gurukrishna


In our Spark 1.6 CDH 5.8.3, job application, when using the HBaseContext with 
checkpoint and restart, it is giving exception. How to handle the checkpointing 
for HBaseContext?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21908) Checkpoint broadcast variable in spark streaming job

2017-09-04 Thread Venkat Gurukrishna (JIRA)

Venkat Gurukrishna created SPARK-21908:
--

 Summary: Checkpoint broadcast variable in spark streaming job
 Key: SPARK-21908
 URL: https://issues.apache.org/jira/browse/SPARK-21908
 Project: Spark
  Issue Type: Bug
  Components: DStreams
Affects Versions: 1.6.0
Reporter: Venkat Gurukrishna


In our Spark 1.6 CDH 5.8.3, job application, we are using the broadcast 
variables and when we checkpoint them and restart the spark job getting error.
Even tried with making the broadcast variable as transient. But we are getting 
different exception.
I have checked this JIRA link: https://issues.apache.org/jira/browse/SPARK-5206
which had mentioned to use singleton reference to broadcast variable and also 
to use the transient. 
Whether this needs to be done in the driver side or at the executor side?




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21727) Operating on an ArrayType in a SparkR DataFrame throws error

2017-09-04 Thread Yanbo Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16152639#comment-16152639
 ] 

Yanbo Liang commented on SPARK-21727:
-

[~felixcheung] What do you mean for this comment?
{quote}
But with that said, I think we could and should make a minor change to support 
that implicitly
https://github.com/apache/spark/blob/master/R/pkg/R/serialize.R#L39
{quote}
How we can get the SerDe type of atomic vector? Just like I mentioned above,
{code}
> class(rep(0, 20))
[1] "numeric"
> class(as.list(rep(0, 20)))
[1] "list"
{code}
_class_ function can't return type _vector_, how we can determine the type of 
object is _vector_ or _numeric_ ? Thanks.


> Operating on an ArrayType in a SparkR DataFrame throws error
> 
>
> Key: SPARK-21727
> URL: https://issues.apache.org/jira/browse/SPARK-21727
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Neil McQuarrie
>
> Previously 
> [posted|https://stackoverflow.com/questions/45056973/sparkr-dataframe-with-r-lists-as-elements]
>  this as a stack overflow question but it seems to be a bug.
> If I have an R data.frame where one of the column data types is an integer 
> *list* -- i.e., each of the elements in the column embeds an entire R list of 
> integers -- then it seems I can convert this data.frame to a SparkR DataFrame 
> just fine... SparkR treats the column as ArrayType(Double). 
> However, any subsequent operation on this SparkR DataFrame appears to throw 
> an error.
> Create an example R data.frame:
> {code}
> indices <- 1:4
> myDf <- data.frame(indices)
> myDf$data <- list(rep(0, 20))}}
> {code}
> Examine it to make sure it looks okay:
> {code}
> > str(myDf) 
> 'data.frame':   4 obs. of  2 variables:  
>  $ indices: int  1 2 3 4  
>  $ data   :List of 4
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
> > head(myDf)   
>   indices   data 
> 1   1 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 2   2 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 3   3 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 4   4 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
> {code}
> Convert it to a SparkR DataFrame:
> {code}
> library(SparkR, lib.loc=paste0(Sys.getenv("SPARK_HOME"),"/R/lib"))
> sparkR.session(master = "local[*]")
> mySparkDf <- as.DataFrame(myDf)
> {code}
> Examine the SparkR DataFrame schema; notice that the list column was 
> successfully converted to ArrayType:
> {code}
> > schema(mySparkDf)
> StructType
> |-name = "indices", type = "IntegerType", nullable = TRUE
> |-name = "data", type = "ArrayType(DoubleType,true)", nullable = TRUE
> {code}
> However, operating on the SparkR DataFrame throws an error:
> {code}
> > collect(mySparkDf)
> 17/07/13 17:23:00 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 
> (TID 1)
> java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: 
> java.lang.Double is not a valid external type for schema of array
> if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null 
> else validateexternaltype(getexternalrowfield(assertnotnull(input[0, 
> org.apache.spark.sql.Row, true]), 0, indices), IntegerType) AS indices#0
> ... long stack trace ...
> {code}
> Using Spark 2.2.0, R 3.4.0, Java 1.8.0_131, Windows 10.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21905) ClassCastException when call sqlContext.sql on temp table

2017-09-04 Thread bluejoe (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

bluejoe updated SPARK-21905:

Description: 
{code:java}
val schema = StructType(List(
StructField("name", DataTypes.StringType, true),
StructField("location", new PointUDT, true)))

val rowRdd = sqlContext.sparkContext.parallelize(Seq("bluejoe", "alex"), 
4).map({ x: String ⇒ Row.fromSeq(Seq(x, Point(100, 100))) });
val dataFrame = sqlContext.createDataFrame(rowRdd, schema)
dataFrame.createOrReplaceTempView("person");
sqlContext.sql("SELECT * FROM person").foreach(println(_));
{code}

the last statement throws exception:


{code:java}
Caused by: java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.GenericRow cannot be cast to 
org.apache.spark.sql.catalyst.InternalRow
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.evalIfFalseExpr1$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:287)
... 18 more
{code}



  was:

{code:java}
val schema = StructType(List(
StructField("name", DataTypes.StringType, true),
StructField("location", new PointUDT, true)))

val rowRdd = sqlContext.sparkContext.parallelize(Seq("bluejoe", 
"alex"), 4).map({ x: String ⇒ Row.fromSeq(Seq(x, Point(100, 100))) });
val dataFrame = sqlContext.createDataFrame(rowRdd, schema)
dataFrame.createOrReplaceTempView("person");
sqlContext.sql("SELECT * FROM person").foreach(println(_));
{code}

the last statement throws exception:


{code:java}
Caused by: java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.GenericRow cannot be cast to 
org.apache.spark.sql.catalyst.InternalRow
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.evalIfFalseExpr1$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:287)
... 18 more
{code}




> ClassCastException when call sqlContext.sql on temp table
> -
>
> Key: SPARK-21905
> URL: https://issues.apache.org/jira/browse/SPARK-21905
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: bluejoe
>
> {code:java}
> val schema = StructType(List(
>   StructField("name", DataTypes.StringType, true),
>   StructField("location", new PointUDT, true)))
> val rowRdd = sqlContext.sparkContext.parallelize(Seq("bluejoe", "alex"), 
> 4).map({ x: String ⇒ Row.fromSeq(Seq(x, Point(100, 100))) });
> val dataFrame = sqlContext.createDataFrame(rowRdd, schema)
> dataFrame.createOrReplaceTempView("person");
> sqlContext.sql("SELECT * FROM person").foreach(println(_));
> {code}
> the last statement throws exception:
> {code:java}
> Caused by: java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericRow cannot be cast to 
> org.apache.spark.sql.catalyst.InternalRow
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.evalIfFalseExpr1$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:287)
>   ... 18 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18085) SPIP: Better History Server scalability for many / large applications

2017-09-04 Thread jincheng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16152409#comment-16152409
 ] 

jincheng commented on SPARK-18085:
--

*{color:red}Here is a picture of how it looks{color}*

!screenshot-1.png!

{color:red}*and I also tried in spark 2.0.it looks like this *{color}

!screenshot-2.png!

the code is located at : 
org.apache.spark.ui.jobs.TaskPagedTable.table(StagePage.scala:108) 
and it calls pagetable.pageData(PagedTable.scala:56)  and throws an exception
{code:java}
  def pageData(page: Int): PageData[T] = {
val totalPages = (dataSize + pageSize - 1) / pageSize
if (page <= 0 || page > totalPages) {
  throw new IndexOutOfBoundsException(
s"Page $page is out of range. Please select a page number between 1 and 
$totalPages.")
}
val from = (page - 1) * pageSize
val to = dataSize.min(page * pageSize)
PageData(totalPages, sliceData(from, to))
  }
{code}

it looks page=1 but totalPages = 0. so datasize + pagesize = 1.
as 

{code:java}
private[ui] abstract class PagedDataSource[T](val pageSize: Int) {

  if (pageSize <= 0) {
throw new IllegalArgumentException("Page size must be positive")
  }
{code}

we did not meet this exception. so datasize = 0.
this matches the case that no completed tasks, but instead all failed tasks 
should displayed just like spark 2.0.

> SPIP: Better History Server scalability for many / large applications
> -
>
> Key: SPARK-18085
> URL: https://issues.apache.org/jira/browse/SPARK-18085
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>  Labels: SPIP
> Attachments: screenshot-1.png, screenshot-2.png, spark_hs_next_gen.pdf
>
>
> It's a known fact that the History Server currently has some annoying issues 
> when serving lots of applications, and when serving large applications.
> I'm filing this umbrella to track work related to addressing those issues. 
> I'll be attaching a document shortly describing the issues and suggesting a 
> path to how to solve them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21907) NullPointerException in UnsafeExternalSorter.spill()

2017-09-04 Thread Juliusz Sompolski (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16152408#comment-16152408
 ] 

Juliusz Sompolski commented on SPARK-21907:
---

Note that UnsafeExternalSorter.spill appears twice on the stack trace, so it's 
nested spilling: the first triggered spilling triggers another spilling through 
UnsafeInMemorySorter.reset.

Possibly it's messing up something by nested-spilling itself twice?
Or messing something with
{code:java}
if (trigger != this) {
  if (readingIterator != null) {
return readingIterator.spill();
  }
  return 0L; // this should throw exception
}
{code}
in spill()

> NullPointerException in UnsafeExternalSorter.spill()
> 
>
> Key: SPARK-21907
> URL: https://issues.apache.org/jira/browse/SPARK-21907
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Juliusz Sompolski
>
> I see NPE during sorting with the following stacktrace:
> {code}
> java.lang.NullPointerException
>   at 
> org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:383)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:63)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:43)
>   at 
> org.apache.spark.util.collection.TimSort.countRunAndMakeAscending(TimSort.java:270)
>   at org.apache.spark.util.collection.TimSort.sort(TimSort.java:142)
>   at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:345)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:206)
>   at 
> org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:203)
>   at 
> org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:281)
>   at 
> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:90)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.reset(UnsafeInMemorySorter.java:173)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:221)
>   at 
> org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:203)
>   at 
> org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:281)
>   at 
> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:90)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:349)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:400)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:109)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
>   at 
> org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:83)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedStreamed(SortMergeJoinExec.scala:778)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextInnerJoinRows(SortMergeJoinExec.scala:685)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinExec$$anonfun$doExecute$1$$anon$2.advanceNext(SortMergeJoinExec.scala:259)
>   at 
> org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:68)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125

[jira] [Created] (SPARK-21907) NullPointerException in UnsafeExternalSorter.spill()

2017-09-04 Thread Juliusz Sompolski (JIRA)

Juliusz Sompolski created SPARK-21907:
-

 Summary: NullPointerException in UnsafeExternalSorter.spill()
 Key: SPARK-21907
 URL: https://issues.apache.org/jira/browse/SPARK-21907
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Juliusz Sompolski


I see NPE during sorting with the following stacktrace:
{code}
java.lang.NullPointerException
at 
org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:383)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:63)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:43)
at 
org.apache.spark.util.collection.TimSort.countRunAndMakeAscending(TimSort.java:270)
at org.apache.spark.util.collection.TimSort.sort(TimSort.java:142)
at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:345)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:206)
at 
org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:203)
at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:281)
at 
org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:90)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.reset(UnsafeInMemorySorter.java:173)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:221)
at 
org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:203)
at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:281)
at 
org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:90)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:349)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:400)
at 
org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:109)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at 
org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:83)
at 
org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedStreamed(SortMergeJoinExec.scala:778)
at 
org.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextInnerJoinRows(SortMergeJoinExec.scala:685)
at 
org.apache.spark.sql.execution.joins.SortMergeJoinExec$$anonfun$doExecute$1$$anon$2.advanceNext(SortMergeJoinExec.scala:259)
at 
org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:68)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:346)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
{code}




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mai

[jira] [Updated] (SPARK-21866) SPIP: Image support in Spark

2017-09-04 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-21866:
--
Shepherd: Joseph K. Bradley

> SPIP: Image support in Spark
> 
>
> Key: SPARK-21866
> URL: https://issues.apache.org/jira/browse/SPARK-21866
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Timothy Hunter
>  Labels: SPIP
> Attachments: SPIP - Image support for Apache Spark V1.1.pdf
>
>
> h2. Background and motivation
> As Apache Spark is being used more and more in the industry, some new use 
> cases are emerging for different data formats beyond the traditional SQL 
> types or the numerical types (vectors and matrices). Deep Learning 
> applications commonly deal with image processing. A number of projects add 
> some Deep Learning capabilities to Spark (see list below), but they struggle 
> to  communicate with each other or with MLlib pipelines because there is no 
> standard way to represent an image in Spark DataFrames. We propose to 
> federate efforts for representing images in Spark by defining a 
> representation that caters to the most common needs of users and library 
> developers.
> This SPIP proposes a specification to represent images in Spark DataFrames 
> and Datasets (based on existing industrial standards), and an interface for 
> loading sources of images. It is not meant to be a full-fledged image 
> processing library, but rather the core description that other libraries and 
> users can rely on. Several packages already offer various processing 
> facilities for transforming images or doing more complex operations, and each 
> has various design tradeoffs that make them better as standalone solutions.
> This project is a joint collaboration between Microsoft and Databricks, which 
> have been testing this design in two open source packages: MMLSpark and Deep 
> Learning Pipelines.
> The proposed image format is an in-memory, decompressed representation that 
> targets low-level applications. It is significantly more liberal in memory 
> usage than compressed image representations such as JPEG, PNG, etc., but it 
> allows easy communication with popular image processing libraries and has no 
> decoding overhead.
> h2. Targets users and personas:
> Data scientists, data engineers, library developers.
> The following libraries define primitives for loading and representing 
> images, and will gain from a common interchange format (in alphabetical 
> order):
> * BigDL
> * DeepLearning4J
> * Deep Learning Pipelines
> * MMLSpark
> * TensorFlow (Spark connector)
> * TensorFlowOnSpark
> * TensorFrames
> * Thunder
> h2. Goals:
> * Simple representation of images in Spark DataFrames, based on pre-existing 
> industrial standards (OpenCV)
> * This format should eventually allow the development of high-performance 
> integration points with image processing libraries such as libOpenCV, Google 
> TensorFlow, CNTK, and other C libraries.
> * The reader should be able to read popular formats of images from 
> distributed sources.
> h2. Non-Goals:
> Images are a versatile medium and encompass a very wide range of formats and 
> representations. This SPIP explicitly aims at the most common use case in the 
> industry currently: multi-channel matrices of binary, int32, int64, float or 
> double data that can fit comfortably in the heap of the JVM:
> * the total size of an image should be restricted to less than 2GB (roughly)
> * the meaning of color channels is application-specific and is not mandated 
> by the standard (in line with the OpenCV standard)
> * specialized formats used in meteorology, the medical field, etc. are not 
> supported
> * this format is specialized to images and does not attempt to solve the more 
> general problem of representing n-dimensional tensors in Spark
> h2. Proposed API changes
> We propose to add a new package in the package structure, under the MLlib 
> project:
> {{org.apache.spark.image}}
> h3. Data format
> We propose to add the following structure:
> imageSchema = StructType([
> * StructField("mode", StringType(), False),
> ** The exact representation of the data.
> ** The values are described in the following OpenCV convention. Basically, 
> the type has both "depth" and "number of channels" info: in particular, type 
> "CV_8UC3" means "3 channel unsigned bytes". BGRA format would be CV_8UC4 
> (value 32 in the table) with the channel order specified by convention.
> ** The exact channel ordering and meaning of each channel is dictated by 
> convention. By default, the order is RGB (3 channels) and BGRA (4 channels).
> If the image failed to load, the value is the empty string "".
> * StructField("origin", StringType(), True),
> ** Some information about the origin of the image. The content of this

[jira] [Updated] (SPARK-15689) Data source API v2

2017-09-04 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-15689:
--
 Shepherd: Reynold Xin
Affects Version/s: 2.3.0

> Data source API v2
> --
>
> Key: SPARK-15689
> URL: https://issues.apache.org/jira/browse/SPARK-15689
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Reynold Xin
>  Labels: SPIP, releasenotes
> Attachments: SPIP Data Source API V2.pdf
>
>
> This ticket tracks progress in creating the v2 of data source API. This new 
> API should focus on:
> 1. Have a small surface so it is easy to freeze and maintain compatibility 
> for a long time. Ideally, this API should survive architectural rewrites and 
> user-facing API revamps of Spark.
> 2. Have a well-defined column batch interface for high performance. 
> Convenience methods should exist to convert row-oriented formats into column 
> batches for data source developers.
> 3. Still support filter push down, similar to the existing API.
> 4. Nice-to-have: support additional common operators, including limit and 
> sampling.
> Note that both 1 and 2 are problems that the current data source API (v1) 
> suffers. The current data source API has a wide surface with dependency on 
> DataFrame/SQLContext, making the data source API compatibility depending on 
> the upper level API. The current data source API is also only row oriented 
> and has to go through an expensive external data type conversion to internal 
> data type.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18085) SPIP: Better History Server scalability for many / large applications

2017-09-04 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18085:
--
Shepherd: Marcelo Vanzin

> SPIP: Better History Server scalability for many / large applications
> -
>
> Key: SPARK-18085
> URL: https://issues.apache.org/jira/browse/SPARK-18085
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>  Labels: SPIP
> Attachments: screenshot-1.png, screenshot-2.png, spark_hs_next_gen.pdf
>
>
> It's a known fact that the History Server currently has some annoying issues 
> when serving lots of applications, and when serving large applications.
> I'm filing this umbrella to track work related to addressing those issues. 
> I'll be attaching a document shortly describing the issues and suggesting a 
> path to how to solve them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18085) SPIP: Better History Server scalability for many / large applications

2017-09-04 Thread jincheng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jincheng updated SPARK-18085:
-
Attachment: screenshot-2.png

> SPIP: Better History Server scalability for many / large applications
> -
>
> Key: SPARK-18085
> URL: https://issues.apache.org/jira/browse/SPARK-18085
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>  Labels: SPIP
> Attachments: screenshot-1.png, screenshot-2.png, spark_hs_next_gen.pdf
>
>
> It's a known fact that the History Server currently has some annoying issues 
> when serving lots of applications, and when serving large applications.
> I'm filing this umbrella to track work related to addressing those issues. 
> I'll be attaching a document shortly describing the issues and suggesting a 
> path to how to solve them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21906) No need to runAsSparkUser to switch UserGroupInformation in YARN mode

2017-09-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21906:


Assignee: Apache Spark

> No need to runAsSparkUser to switch UserGroupInformation in YARN mode
> -
>
> Key: SPARK-21906
> URL: https://issues.apache.org/jira/browse/SPARK-21906
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 2.2.0
>Reporter: Kent Yao
>Assignee: Apache Spark
>
> 1、The Yarn application‘s ugi is determined by the ugi launching it
> 2、 runAsSparkUser is used to switch a ugi as same as itself, because we have 
> already set {code:java} env("SPARK_USER") = 
> UserGroupInformation.getCurrentUser().getShortUserName() {code} in the am 
> container context
> {code:java}
>  def runAsSparkUser(func: () => Unit) {
> val user = Utils.getCurrentUserName()  // get the user itself
> logDebug("running as user: " + user)
> val ugi = UserGroupInformation.createRemoteUser(user) // create a new ugi 
> use itself
> transferCredentials(UserGroupInformation.getCurrentUser(), ugi) // 
> transfer its own credentials 
> ugi.doAs(new PrivilegedExceptionAction[Unit] { // doAs as itseft
>   def run: Unit = func()
> })
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21906) No need to runAsSparkUser to switch UserGroupInformation in YARN mode

2017-09-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21906:


Assignee: (was: Apache Spark)

> No need to runAsSparkUser to switch UserGroupInformation in YARN mode
> -
>
> Key: SPARK-21906
> URL: https://issues.apache.org/jira/browse/SPARK-21906
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 2.2.0
>Reporter: Kent Yao
>
> 1、The Yarn application‘s ugi is determined by the ugi launching it
> 2、 runAsSparkUser is used to switch a ugi as same as itself, because we have 
> already set {code:java} env("SPARK_USER") = 
> UserGroupInformation.getCurrentUser().getShortUserName() {code} in the am 
> container context
> {code:java}
>  def runAsSparkUser(func: () => Unit) {
> val user = Utils.getCurrentUserName()  // get the user itself
> logDebug("running as user: " + user)
> val ugi = UserGroupInformation.createRemoteUser(user) // create a new ugi 
> use itself
> transferCredentials(UserGroupInformation.getCurrentUser(), ugi) // 
> transfer its own credentials 
> ugi.doAs(new PrivilegedExceptionAction[Unit] { // doAs as itseft
>   def run: Unit = func()
> })
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18085) SPIP: Better History Server scalability for many / large applications

2017-09-04 Thread jincheng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jincheng updated SPARK-18085:
-
Attachment: screenshot-1.png

> SPIP: Better History Server scalability for many / large applications
> -
>
> Key: SPARK-18085
> URL: https://issues.apache.org/jira/browse/SPARK-18085
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>  Labels: SPIP
> Attachments: screenshot-1.png, spark_hs_next_gen.pdf
>
>
> It's a known fact that the History Server currently has some annoying issues 
> when serving lots of applications, and when serving large applications.
> I'm filing this umbrella to track work related to addressing those issues. 
> I'll be attaching a document shortly describing the issues and suggesting a 
> path to how to solve them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21906) No need to runAsSparkUser to switch UserGroupInformation in YARN mode

2017-09-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16152295#comment-16152295
 ] 

Apache Spark commented on SPARK-21906:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/19121

> No need to runAsSparkUser to switch UserGroupInformation in YARN mode
> -
>
> Key: SPARK-21906
> URL: https://issues.apache.org/jira/browse/SPARK-21906
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 2.2.0
>Reporter: Kent Yao
>
> 1、The Yarn application‘s ugi is determined by the ugi launching it
> 2、 runAsSparkUser is used to switch a ugi as same as itself, because we have 
> already set {code:java} env("SPARK_USER") = 
> UserGroupInformation.getCurrentUser().getShortUserName() {code} in the am 
> container context
> {code:java}
>  def runAsSparkUser(func: () => Unit) {
> val user = Utils.getCurrentUserName()  // get the user itself
> logDebug("running as user: " + user)
> val ugi = UserGroupInformation.createRemoteUser(user) // create a new ugi 
> use itself
> transferCredentials(UserGroupInformation.getCurrentUser(), ugi) // 
> transfer its own credentials 
> ugi.doAs(new PrivilegedExceptionAction[Unit] { // doAs as itseft
>   def run: Unit = func()
> })
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21900) Numerical Error in simple Skewness Computation

2017-09-04 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16152284#comment-16152284
 ] 

Sean Owen commented on SPARK-21900:
---

I don't feel strongly about it, but this is a reasonable issue to report. 
Especially since it didn't seem like it acted this way in 2.2. I don't have a 
suggested change but would be open to a patch for this if someone finds a 
method to compute the higher-order moments more accurately without sacrificing 
(much) speed.

> Numerical Error in simple Skewness Computation
> --
>
> Key: SPARK-21900
> URL: https://issues.apache.org/jira/browse/SPARK-21900
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jakob Bach
>Priority: Minor
>
> The skewness() aggregate SQL function in the Scala implementation 
> (org.apache.spark.sql.skewness) seems to be buggy .The following code
> {code:java}
> import org.apache.spark.sql.functions
> import org.apache.spark.sql.SparkSession
> object SkewTest {
>   def main(args: Array[String]): Unit = {
> val spark = SparkSession.
>   builder().
>   appName("Skewness example").
>   master("local[1]").
>   getOrCreate()
> 
> spark.createDataFrame(Seq(4,1,2,3).map(Tuple1(_))).agg(functions.skewness("_1")).show()
>   }
> }
> {code}
> should output 0 (as it does for Seq(1,2,3,4)), but outputs
> {code:none}
> ++
> |skewness(_1)|
> ++
> |5.958081967793454...|
> ++
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21892) status code is 200 OK when kill application fail via spark master rest api

2017-09-04 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21892.
---
Resolution: Not A Problem

> status code is 200 OK  when kill application fail via spark master rest api
> ---
>
> Key: SPARK-21892
> URL: https://issues.apache.org/jira/browse/SPARK-21892
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Zhuang Xueyin
>Priority: Minor
>
> Sent a post request to spark master restapi, eg:
> http://:6066/v1/submissions/kill/driver-xxx
> Request body:
> {
> "action" : "KillSubmissionRequest",
> "clientSparkVersion" : "2.1.0",
> }
> Response body:
> {
>   "action" : "KillSubmissionResponse",
>   "message" : "Driver driver-xxx has already finished or does not exist",
>   "serverSparkVersion" : "2.1.0",
>   "submissionId" : "driver-xxx",
>   "success" : false
> }
> Response headers:
> *Status Code: 200 OK*
> Content-Length: 203
> Content-Type: application/json; charset=UTF-8
> Date: Fri, 01 Sep 2017 05:56:04 GMT
> Server: Jetty(9.2.z-SNAPSHOT)
> Result:
> status code is 200 OK  when kill application fail via spark master rest api. 
> While the response body indicates that the update is not successfully, this 
> is not rest api standard, suggest to improve it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21906) No need to runAsSparkUser to switch UserGroupInformation in YARN mode

2017-09-04 Thread Kent Yao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-21906:
-
Description: 
1、The Yarn application‘s ugi is determined by the ugi launching it
2、 runAsSparkUser is used to switch a ugi as same as itself, because we have 
already set {code:java} env("SPARK_USER") = 
UserGroupInformation.getCurrentUser().getShortUserName() {code} in the am 
container context
{code:java}
 def runAsSparkUser(func: () => Unit) {
val user = Utils.getCurrentUserName()  // get the user itself
logDebug("running as user: " + user)
val ugi = UserGroupInformation.createRemoteUser(user) // create a new ugi 
use itself
transferCredentials(UserGroupInformation.getCurrentUser(), ugi) // transfer 
its own credentials 
ugi.doAs(new PrivilegedExceptionAction[Unit] { // doAs as itseft
  def run: Unit = func()
})
  }
{code}

  was:
1、The Yarn application‘s ugi is determined by the ugi launching it
2、 runAsSparkUser is used to switch a ugi as same as itself, because we have 
already set bq. env("SPARK_USER") = 
UserGroupInformation.getCurrentUser().getShortUserName()
 in the am container context
{code:java}
 def runAsSparkUser(func: () => Unit) {
val user = Utils.getCurrentUserName()  // get the user itself
logDebug("running as user: " + user)
val ugi = UserGroupInformation.createRemoteUser(user) // create a new ugi 
use itself
transferCredentials(UserGroupInformation.getCurrentUser(), ugi) // transfer 
its own credentials 
ugi.doAs(new PrivilegedExceptionAction[Unit] { // doAs as itseft
  def run: Unit = func()
})
  }
{code}


> No need to runAsSparkUser to switch UserGroupInformation in YARN mode
> -
>
> Key: SPARK-21906
> URL: https://issues.apache.org/jira/browse/SPARK-21906
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 2.2.0
>Reporter: Kent Yao
>
> 1、The Yarn application‘s ugi is determined by the ugi launching it
> 2、 runAsSparkUser is used to switch a ugi as same as itself, because we have 
> already set {code:java} env("SPARK_USER") = 
> UserGroupInformation.getCurrentUser().getShortUserName() {code} in the am 
> container context
> {code:java}
>  def runAsSparkUser(func: () => Unit) {
> val user = Utils.getCurrentUserName()  // get the user itself
> logDebug("running as user: " + user)
> val ugi = UserGroupInformation.createRemoteUser(user) // create a new ugi 
> use itself
> transferCredentials(UserGroupInformation.getCurrentUser(), ugi) // 
> transfer its own credentials 
> ugi.doAs(new PrivilegedExceptionAction[Unit] { // doAs as itseft
>   def run: Unit = func()
> })
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21906) No need to runAsSparkUser to switch UserGroupInformation in YARN mode

2017-09-04 Thread Kent Yao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-21906:
-
Description: 
1、The Yarn application‘s ugi is determined by the ugi launching it
2、 runAsSparkUser is used to switch a ugi as same as itself, because we have 
already set bq. env("SPARK_USER") = 
UserGroupInformation.getCurrentUser().getShortUserName()
 in the am container context
{code:java}
 def runAsSparkUser(func: () => Unit) {
val user = Utils.getCurrentUserName()  // get the user itself
logDebug("running as user: " + user)
val ugi = UserGroupInformation.createRemoteUser(user) // create a new ugi 
use itself
transferCredentials(UserGroupInformation.getCurrentUser(), ugi) // transfer 
its own credentials 
ugi.doAs(new PrivilegedExceptionAction[Unit] { // doAs as itseft
  def run: Unit = func()
})
  }
{code}

  was:
1、The Yarn application‘s ugi is determined by the ugi launching it
2、 runAsSparkUser is used to switch a ugi as same as itself, because we have 
already set ` env("SPARK_USER") = 
UserGroupInformation.getCurrentUser().getShortUserName()
` in the am container context
{code:java}
 def runAsSparkUser(func: () => Unit) {
val user = Utils.getCurrentUserName()  // get the user itself
logDebug("running as user: " + user)
val ugi = UserGroupInformation.createRemoteUser(user) // create a new ugi 
use itself
transferCredentials(UserGroupInformation.getCurrentUser(), ugi) // transfer 
its own credentials 
ugi.doAs(new PrivilegedExceptionAction[Unit] { // doAs as itseft
  def run: Unit = func()
})
  }
{code}


> No need to runAsSparkUser to switch UserGroupInformation in YARN mode
> -
>
> Key: SPARK-21906
> URL: https://issues.apache.org/jira/browse/SPARK-21906
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 2.2.0
>Reporter: Kent Yao
>
> 1、The Yarn application‘s ugi is determined by the ugi launching it
> 2、 runAsSparkUser is used to switch a ugi as same as itself, because we have 
> already set bq. env("SPARK_USER") = 
> UserGroupInformation.getCurrentUser().getShortUserName()
>  in the am container context
> {code:java}
>  def runAsSparkUser(func: () => Unit) {
> val user = Utils.getCurrentUserName()  // get the user itself
> logDebug("running as user: " + user)
> val ugi = UserGroupInformation.createRemoteUser(user) // create a new ugi 
> use itself
> transferCredentials(UserGroupInformation.getCurrentUser(), ugi) // 
> transfer its own credentials 
> ugi.doAs(new PrivilegedExceptionAction[Unit] { // doAs as itseft
>   def run: Unit = func()
> })
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21906) No need to runAsSparkUser to switch UserGroupInformation in YARN mode

2017-09-04 Thread Kent Yao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-21906:
-
Description: 
1、The Yarn application‘s ugi is determined by the ugi launching it
2、 runAsSparkUser is used to switch a ugi as same as itself, because we have 
already set ` env("SPARK_USER") = 
UserGroupInformation.getCurrentUser().getShortUserName()
` in the am container context
{code:java}
 def runAsSparkUser(func: () => Unit) {
val user = Utils.getCurrentUserName()  // get the user itself
logDebug("running as user: " + user)
val ugi = UserGroupInformation.createRemoteUser(user) // create a new ugi 
use itself
transferCredentials(UserGroupInformation.getCurrentUser(), ugi) // transfer 
its own credentials 
ugi.doAs(new PrivilegedExceptionAction[Unit] { // doAs as itseft
  def run: Unit = func()
})
  }
{code}

  was:
1、The Yarn application‘s ugi is determined by the ugi launching it
2、 runAsSparkUser is used to switch a ugi as same as itself, because we have 
already set ` env("SPARK_USER") = 
UserGroupInformation.getCurrentUser().getShortUserName()
` in the am container context
{code|java}
 def runAsSparkUser(func: () => Unit) {
val user = Utils.getCurrentUserName()  // get the user itself
logDebug("running as user: " + user)
val ugi = UserGroupInformation.createRemoteUser(user) // create a new ugi 
use itself
transferCredentials(UserGroupInformation.getCurrentUser(), ugi) // transfer 
its own credentials 
ugi.doAs(new PrivilegedExceptionAction[Unit] { // doAs as itseft
  def run: Unit = func()
})
  }
{code}


> No need to runAsSparkUser to switch UserGroupInformation in YARN mode
> -
>
> Key: SPARK-21906
> URL: https://issues.apache.org/jira/browse/SPARK-21906
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 2.2.0
>Reporter: Kent Yao
>
> 1、The Yarn application‘s ugi is determined by the ugi launching it
> 2、 runAsSparkUser is used to switch a ugi as same as itself, because we have 
> already set ` env("SPARK_USER") = 
> UserGroupInformation.getCurrentUser().getShortUserName()
> ` in the am container context
> {code:java}
>  def runAsSparkUser(func: () => Unit) {
> val user = Utils.getCurrentUserName()  // get the user itself
> logDebug("running as user: " + user)
> val ugi = UserGroupInformation.createRemoteUser(user) // create a new ugi 
> use itself
> transferCredentials(UserGroupInformation.getCurrentUser(), ugi) // 
> transfer its own credentials 
> ugi.doAs(new PrivilegedExceptionAction[Unit] { // doAs as itseft
>   def run: Unit = func()
> })
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21906) No need to runAsSparkUser to switch UserGroupInformation in YARN mode

2017-09-04 Thread Kent Yao (JIRA)

Kent Yao created SPARK-21906:


 Summary: No need to runAsSparkUser to switch UserGroupInformation 
in YARN mode
 Key: SPARK-21906
 URL: https://issues.apache.org/jira/browse/SPARK-21906
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, YARN
Affects Versions: 2.2.0
Reporter: Kent Yao


1、The Yarn application‘s ugi is determined by the ugi launching it
2、 runAsSparkUser is used to switch a ugi as same as itself, because we have 
already set ` env("SPARK_USER") = 
UserGroupInformation.getCurrentUser().getShortUserName()
` in the am container context
{code|java}
 def runAsSparkUser(func: () => Unit) {
val user = Utils.getCurrentUserName()  // get the user itself
logDebug("running as user: " + user)
val ugi = UserGroupInformation.createRemoteUser(user) // create a new ugi 
use itself
transferCredentials(UserGroupInformation.getCurrentUser(), ugi) // transfer 
its own credentials 
ugi.doAs(new PrivilegedExceptionAction[Unit] { // doAs as itseft
  def run: Unit = func()
})
  }
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21850) SparkSQL cannot perform LIKE someColumn if someColumn's value contains a backslash \

2017-09-04 Thread Adrien Lavoillotte (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16144943#comment-16144943
 ] 

Adrien Lavoillotte edited comment on SPARK-21850 at 9/4/17 8:41 AM:


I am not saying it should _interpret the \n_, quite the opposite.

I'm saying it comes from a column, so the \ should be _auto-escaped_ and not 
crash. As it stands, *LIKE + column will crash if the column value contains a 
backslash* not followed by \, _ or % precisely because it tries to interpret it.

Also, please note that this behaviour is buggy only in Spark 2.2.0, but works 
in -every- other database/SQL engine that we tested, including hive and earlier 
versions of SparkSQL.


was (Author: instanceof me):
I am not saying it should _interpret the \n_, quite the opposite.

I'm saying it comes from a column, so the \ should be _auto-escaped_ and not 
crash. As it stands, *LIKE + column will crash if the column value contains a 
backslash* not followed by \, _ or % precisely because it tries to interpret it.

Also, please note that this behaviour is buggy only in Spark 2.2.0, but -works 
in every other database/SQL engine that we tested-, including hive and earlier 
versions of SparkSQL.

> SparkSQL cannot perform LIKE someColumn if someColumn's value contains a 
> backslash \
> 
>
> Key: SPARK-21850
> URL: https://issues.apache.org/jira/browse/SPARK-21850
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Adrien Lavoillotte
>
> I have a test table looking like this:
> {code:none}
> spark.sql("select * from `test`.`types_basic`").show()
> {code}
> ||id||c_tinyint|| [...] || c_string||
> |  0| -128| [...] |  string|
> |  1|0| [...] |string 'with' "qu...|
> |  2|  127| [...] |  unicod€ strĭng|
> |  3|   42| [...] |there is a \n in ...|
> |  4| null| [...] |null|
> Note the line with ID 3, which has a literal \n in c_string (e.g. "some \\n 
> string", not a line break). I would like to join another table using a LIKE 
> condition (to join on prefix). If I do this:
> {code:none}
> spark.sql("select * from `test`.`types_basic` a where a.`c_string` LIKE 
> CONCAT(a.`c_string`, '%')").show()
> {code}
> I get the following error in spark 2.2 (but not in any earlier version):
> {noformat}
> 17/08/28 12:47:38 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 9.0 
> (TID 12, cdh5.local, executor 2): org.apache.spark.sql.AnalysisException: the 
> pattern 'there is a \n in this line%' is invalid, the escape character is not 
> allowed to precede 'n';
>   at 
> org.apache.spark.sql.catalyst.util.StringUtils$.fail$1(StringUtils.scala:42)
>   at 
> org.apache.spark.sql.catalyst.util.StringUtils$.escapeLikeRegex(StringUtils.scala:51)
>   at 
> org.apache.spark.sql.catalyst.util.StringUtils.escapeLikeRegex(StringUtils.scala)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}
> It seems to me that if LIKE requires special escaping there, then it should 
> be provided by SparkSQL on the value of the column.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21905) ClassCastException when call sqlContext.sql on temp table

2017-09-04 Thread bluejoe (JIRA)

bluejoe created SPARK-21905:
---

 Summary: ClassCastException when call sqlContext.sql on temp table
 Key: SPARK-21905
 URL: https://issues.apache.org/jira/browse/SPARK-21905
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: bluejoe



{code:java}
val schema = StructType(List(
StructField("name", DataTypes.StringType, true),
StructField("location", new PointUDT, true)))

val rowRdd = sqlContext.sparkContext.parallelize(Seq("bluejoe", 
"alex"), 4).map({ x: String ⇒ Row.fromSeq(Seq(x, Point(100, 100))) });
val dataFrame = sqlContext.createDataFrame(rowRdd, schema)
dataFrame.createOrReplaceTempView("person");
sqlContext.sql("SELECT * FROM person").foreach(println(_));
{code}

the last statement throws exception:


{code:java}
Caused by: java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.GenericRow cannot be cast to 
org.apache.spark.sql.catalyst.InternalRow
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.evalIfFalseExpr1$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:287)
... 18 more
{code}





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-21850) SparkSQL cannot perform LIKE someColumn if someColumn's value contains a backslash \

2017-09-04 Thread Adrien Lavoillotte (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Lavoillotte closed SPARK-21850.
--
Resolution: Not A Bug

> SparkSQL cannot perform LIKE someColumn if someColumn's value contains a 
> backslash \
> 
>
> Key: SPARK-21850
> URL: https://issues.apache.org/jira/browse/SPARK-21850
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Adrien Lavoillotte
>
> I have a test table looking like this:
> {code:none}
> spark.sql("select * from `test`.`types_basic`").show()
> {code}
> ||id||c_tinyint|| [...] || c_string||
> |  0| -128| [...] |  string|
> |  1|0| [...] |string 'with' "qu...|
> |  2|  127| [...] |  unicod€ strĭng|
> |  3|   42| [...] |there is a \n in ...|
> |  4| null| [...] |null|
> Note the line with ID 3, which has a literal \n in c_string (e.g. "some \\n 
> string", not a line break). I would like to join another table using a LIKE 
> condition (to join on prefix). If I do this:
> {code:none}
> spark.sql("select * from `test`.`types_basic` a where a.`c_string` LIKE 
> CONCAT(a.`c_string`, '%')").show()
> {code}
> I get the following error in spark 2.2 (but not in any earlier version):
> {noformat}
> 17/08/28 12:47:38 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 9.0 
> (TID 12, cdh5.local, executor 2): org.apache.spark.sql.AnalysisException: the 
> pattern 'there is a \n in this line%' is invalid, the escape character is not 
> allowed to precede 'n';
>   at 
> org.apache.spark.sql.catalyst.util.StringUtils$.fail$1(StringUtils.scala:42)
>   at 
> org.apache.spark.sql.catalyst.util.StringUtils$.escapeLikeRegex(StringUtils.scala:51)
>   at 
> org.apache.spark.sql.catalyst.util.StringUtils.escapeLikeRegex(StringUtils.scala)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}
> It seems to me that if LIKE requires special escaping there, then it should 
> be provided by SparkSQL on the value of the column.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21850) SparkSQL cannot perform LIKE someColumn if someColumn's value contains a backslash \

2017-09-04 Thread Adrien Lavoillotte (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16152241#comment-16152241
 ] 

Adrien Lavoillotte commented on SPARK-21850:


The behaviour seems indeed logical if it takes the actual value without 
escaping it, and I actually replicated it in some other DBs (our earlier tests 
were wrong, each DB having its own rules for escaping \), although they just 
don't match instead of failing, which is arguably preferable. I'll close the 
issue, thank you for your help!

> SparkSQL cannot perform LIKE someColumn if someColumn's value contains a 
> backslash \
> 
>
> Key: SPARK-21850
> URL: https://issues.apache.org/jira/browse/SPARK-21850
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Adrien Lavoillotte
>
> I have a test table looking like this:
> {code:none}
> spark.sql("select * from `test`.`types_basic`").show()
> {code}
> ||id||c_tinyint|| [...] || c_string||
> |  0| -128| [...] |  string|
> |  1|0| [...] |string 'with' "qu...|
> |  2|  127| [...] |  unicod€ strĭng|
> |  3|   42| [...] |there is a \n in ...|
> |  4| null| [...] |null|
> Note the line with ID 3, which has a literal \n in c_string (e.g. "some \\n 
> string", not a line break). I would like to join another table using a LIKE 
> condition (to join on prefix). If I do this:
> {code:none}
> spark.sql("select * from `test`.`types_basic` a where a.`c_string` LIKE 
> CONCAT(a.`c_string`, '%')").show()
> {code}
> I get the following error in spark 2.2 (but not in any earlier version):
> {noformat}
> 17/08/28 12:47:38 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 9.0 
> (TID 12, cdh5.local, executor 2): org.apache.spark.sql.AnalysisException: the 
> pattern 'there is a \n in this line%' is invalid, the escape character is not 
> allowed to precede 'n';
>   at 
> org.apache.spark.sql.catalyst.util.StringUtils$.fail$1(StringUtils.scala:42)
>   at 
> org.apache.spark.sql.catalyst.util.StringUtils$.escapeLikeRegex(StringUtils.scala:51)
>   at 
> org.apache.spark.sql.catalyst.util.StringUtils.escapeLikeRegex(StringUtils.scala)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}
> It seems to me that if LIKE requires special escaping there, then it should 
> be provided by SparkSQL on the value of the column.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21850) SparkSQL cannot perform LIKE someColumn if someColumn's value contains a backslash \

2017-09-04 Thread Adrien Lavoillotte (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16144943#comment-16144943
 ] 

Adrien Lavoillotte edited comment on SPARK-21850 at 9/4/17 8:07 AM:


I am not saying it should _interpret the \n_, quite the opposite.

I'm saying it comes from a column, so the \ should be _auto-escaped_ and not 
crash. As it stands, *LIKE + column will crash if the column value contains a 
backslash* not followed by \, _ or % precisely because it tries to interpret it.

Also, please note that this behaviour is buggy only in Spark 2.2.0, but -works 
in every other database/SQL engine that we tested-, including hive and earlier 
versions of SparkSQL.


was (Author: instanceof me):
I am not saying it should _interpret the \n_, quite the opposite.

I'm saying it comes from a column, so the \ should be _auto-escaped_ and not 
crash. As it stands, *LIKE + column will crash if the column value contains a 
backslash* not followed by \, _ or % precisely because it tries to interpret it.

Also, please note that this behaviour is buggy only in Spark 2.2.0, but works 
in every other database/SQL engine that we tested, including hive and earlier 
versions of SparkSQL.

> SparkSQL cannot perform LIKE someColumn if someColumn's value contains a 
> backslash \
> 
>
> Key: SPARK-21850
> URL: https://issues.apache.org/jira/browse/SPARK-21850
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Adrien Lavoillotte
>
> I have a test table looking like this:
> {code:none}
> spark.sql("select * from `test`.`types_basic`").show()
> {code}
> ||id||c_tinyint|| [...] || c_string||
> |  0| -128| [...] |  string|
> |  1|0| [...] |string 'with' "qu...|
> |  2|  127| [...] |  unicod€ strĭng|
> |  3|   42| [...] |there is a \n in ...|
> |  4| null| [...] |null|
> Note the line with ID 3, which has a literal \n in c_string (e.g. "some \\n 
> string", not a line break). I would like to join another table using a LIKE 
> condition (to join on prefix). If I do this:
> {code:none}
> spark.sql("select * from `test`.`types_basic` a where a.`c_string` LIKE 
> CONCAT(a.`c_string`, '%')").show()
> {code}
> I get the following error in spark 2.2 (but not in any earlier version):
> {noformat}
> 17/08/28 12:47:38 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 9.0 
> (TID 12, cdh5.local, executor 2): org.apache.spark.sql.AnalysisException: the 
> pattern 'there is a \n in this line%' is invalid, the escape character is not 
> allowed to precede 'n';
>   at 
> org.apache.spark.sql.catalyst.util.StringUtils$.fail$1(StringUtils.scala:42)
>   at 
> org.apache.spark.sql.catalyst.util.StringUtils$.escapeLikeRegex(StringUtils.scala:51)
>   at 
> org.apache.spark.sql.catalyst.util.StringUtils.escapeLikeRegex(StringUtils.scala)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}
> It seems to me that if LIKE requires special escaping there, then it should 
> be provided by SparkSQL on the value of the column.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

84 matches

Mail list logo