[jira] [Assigned] (SPARK-20090) Add StructType.fieldNames to Python API

2017-07-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20090:


Assignee: Apache Spark

> Add StructType.fieldNames to Python API
> ---
>
> Key: SPARK-20090
> URL: https://issues.apache.org/jira/browse/SPARK-20090
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>Priority: Trivial
>
> The Scala/Java API for {{StructType}} has a method {{fieldNames}}.  It would 
> be nice if the Python {{StructType}} did as well.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20090) Add StructType.fieldNames to Python API

2017-07-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085224#comment-16085224
 ] 

Apache Spark commented on SPARK-20090:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/18618

> Add StructType.fieldNames to Python API
> ---
>
> Key: SPARK-20090
> URL: https://issues.apache.org/jira/browse/SPARK-20090
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Joseph K. Bradley
>Priority: Trivial
>
> The Scala/Java API for {{StructType}} has a method {{fieldNames}}.  It would 
> be nice if the Python {{StructType}} did as well.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20090) Add StructType.fieldNames to Python API

2017-07-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20090:


Assignee: (was: Apache Spark)

> Add StructType.fieldNames to Python API
> ---
>
> Key: SPARK-20090
> URL: https://issues.apache.org/jira/browse/SPARK-20090
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Joseph K. Bradley
>Priority: Trivial
>
> The Scala/Java API for {{StructType}} has a method {{fieldNames}}.  It would 
> be nice if the Python {{StructType}} did as well.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21396) Spark Hive Thriftserver doesn't return UDT field

2017-07-12 Thread Haopu Wang (JIRA)
Haopu Wang created SPARK-21396:
--

 Summary: Spark Hive Thriftserver doesn't return UDT field
 Key: SPARK-21396
 URL: https://issues.apache.org/jira/browse/SPARK-21396
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.1
Reporter: Haopu Wang


I want to query a table with a MLLib Vector field and get below exception.
Can Spark Hive Thriftserver be enhanced to return UDT field?

==
2017-07-13 13:14:25,435 WARN  
[org.apache.hive.service.cli.thrift.ThriftCLIService] 
(HiveServer2-Handler-Pool: Thread-18537;) Error fetching results: 
java.lang.RuntimeException: scala.MatchError: 
org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 (of class 
org.apache.spark.ml.linalg.VectorUDT)
at 
org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:83)
at 
org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36)
at 
org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
at 
org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59)
at com.sun.proxy.$Proxy29.fetchResults(Unknown Source)
at 
org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:454)
at 
org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:621)
at 
org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1553)
at 
org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1538)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
at 
org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:56)
at 
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: scala.MatchError: org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 (of 
class org.apache.spark.ml.linalg.VectorUDT)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.addNonNullColumnValue(SparkExecuteStatementOperation.scala:80)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.getNextRowSet(SparkExecuteStatementOperation.scala:144)
at 
org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:220)
at 
org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:685)
at sun.reflect.GeneratedMethodAccessor54.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:78)
... 18 more




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20703) Add an operator for writing data out

2017-07-12 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085017#comment-16085017
 ] 

Liang-Chi Hsieh edited comment on SPARK-20703 at 7/13/17 5:28 AM:
--

Thanks [~ste...@apache.org] for voicing this.

For the latency on object store, I am not sure the actual implementation of the 
Committer for the object store (S3?) you use. I take the following S3 Committer 
as an example:

https://github.com/rdblue/s3committer/blob/44d41d475488edf60f5dbbe2224c0ef9227e55dc/src/main/java/com/netflix/bdp/s3/S3MultipartOutputCommitter.java#L213

The temp file for a task in Spark's FileCommitProtocol would use the work path 
of the S3 Committer as the staging path, and it is a path on local FS.

So I assume that the latency will only be there if the staging path is not 
based on a local path?

For the FNFE issue, currently the call to getFileSize is happened after the 
current OutputWriter is closed. I agree that we should catche all IOEs in 
getFileSize, so a patch is welcome, but I am also curious that the cases that 
the file is not there, do we have another change to materialize the file later? 
Otherwise, should the commit of the task later be failed if the temp file is 
not there?











was (Author: viirya):
Thanks [~ste...@apache.org] for voicing this.

For the latency on object store, I am not sure the actual implementation of the 
Committer for the object store (S3?) you use. I take the following S3 Committer 
as an example:

https://github.com/rdblue/s3committer/blob/44d41d475488edf60f5dbbe2224c0ef9227e55dc/src/main/java/com/netflix/bdp/s3/S3MultipartOutputCommitter.java#L213

The temp file for a task in Spark's FileCommitProtocol would use the work path 
of the S3 Committer as the staging path, and it is a path on local FS.

So I assume that the latency will only be there if the staging path is not 
based on a local path?

For the FNFE issue, currently the call to getFileSize is happened after the 
current OutputWriter is closed. I agree that we should catche all IOEs in 
getFileSize, so a patch is welcome, but I am also curious that the cases that 
the file is not there, do we have another change to materialize the file later?










> Add an operator for writing data out
> 
>
> Key: SPARK-20703
> URL: https://issues.apache.org/jira/browse/SPARK-20703
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Liang-Chi Hsieh
> Fix For: 2.3.0
>
>
> We should add an operator for writing data out. Right now in the explain plan 
> / UI there is no way to tell whether a query is writing data out, and also 
> there is no way to associate metrics with data writes. It'd be tremendously 
> valuable to do this for adding metrics and for visibility.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21376) Token is not renewed in yarn client process in cluster mode

2017-07-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085160#comment-16085160
 ] 

Apache Spark commented on SPARK-21376:
--

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/18617

> Token is not renewed in yarn client process in cluster mode
> ---
>
> Key: SPARK-21376
> URL: https://issues.apache.org/jira/browse/SPARK-21376
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Yesha Vora
>Priority: Minor
>
> STR:
> * Set below config in spark-default.conf
> {code}
> spark.yarn.security.credentials.hbase.enabled true
> spark.hbase.connector.security.credentials.enabled false{code}
> * Set below config in hdfs-site.xml
> {code}
> 'dfs.namenode.delegation.token.max-lifetime':'4320'
> 'dfs.namenode.delegation.token.renew-interval':'2880' {code}
> * Run HDFSWordcount streaming app in yarn-cluster mode  for 25 hrs. 
> After 25 hours, noticing that HDFS Wordcount job is hitting 
> HDFS_DELEGATION_TOKEN renewal issue. 
> {code}
> 17/06/28 10:49:47 WARN Client: Exception encountered while connecting to the 
> server : 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (HDFS_DELEGATION_TOKEN token 230 for hrt_qa) is expired
> 17/06/28 10:49:47 WARN Client: Failed to cleanup staging dir 
> hdfs://mycluster0/user/hrt_qa/.sparkStaging/application_1498539861056_0015
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (HDFS_DELEGATION_TOKEN token 230 for hrt_qa) is expired
> at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1554)
> at org.apache.hadoop.ipc.Client.call(Client.java:1498){code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21376) Token is not renewed in yarn client process in cluster mode

2017-07-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21376:


Assignee: Apache Spark

> Token is not renewed in yarn client process in cluster mode
> ---
>
> Key: SPARK-21376
> URL: https://issues.apache.org/jira/browse/SPARK-21376
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Yesha Vora
>Assignee: Apache Spark
>Priority: Minor
>
> STR:
> * Set below config in spark-default.conf
> {code}
> spark.yarn.security.credentials.hbase.enabled true
> spark.hbase.connector.security.credentials.enabled false{code}
> * Set below config in hdfs-site.xml
> {code}
> 'dfs.namenode.delegation.token.max-lifetime':'4320'
> 'dfs.namenode.delegation.token.renew-interval':'2880' {code}
> * Run HDFSWordcount streaming app in yarn-cluster mode  for 25 hrs. 
> After 25 hours, noticing that HDFS Wordcount job is hitting 
> HDFS_DELEGATION_TOKEN renewal issue. 
> {code}
> 17/06/28 10:49:47 WARN Client: Exception encountered while connecting to the 
> server : 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (HDFS_DELEGATION_TOKEN token 230 for hrt_qa) is expired
> 17/06/28 10:49:47 WARN Client: Failed to cleanup staging dir 
> hdfs://mycluster0/user/hrt_qa/.sparkStaging/application_1498539861056_0015
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (HDFS_DELEGATION_TOKEN token 230 for hrt_qa) is expired
> at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1554)
> at org.apache.hadoop.ipc.Client.call(Client.java:1498){code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21376) Token is not renewed in yarn client process in cluster mode

2017-07-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21376:


Assignee: (was: Apache Spark)

> Token is not renewed in yarn client process in cluster mode
> ---
>
> Key: SPARK-21376
> URL: https://issues.apache.org/jira/browse/SPARK-21376
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Yesha Vora
>Priority: Minor
>
> STR:
> * Set below config in spark-default.conf
> {code}
> spark.yarn.security.credentials.hbase.enabled true
> spark.hbase.connector.security.credentials.enabled false{code}
> * Set below config in hdfs-site.xml
> {code}
> 'dfs.namenode.delegation.token.max-lifetime':'4320'
> 'dfs.namenode.delegation.token.renew-interval':'2880' {code}
> * Run HDFSWordcount streaming app in yarn-cluster mode  for 25 hrs. 
> After 25 hours, noticing that HDFS Wordcount job is hitting 
> HDFS_DELEGATION_TOKEN renewal issue. 
> {code}
> 17/06/28 10:49:47 WARN Client: Exception encountered while connecting to the 
> server : 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (HDFS_DELEGATION_TOKEN token 230 for hrt_qa) is expired
> 17/06/28 10:49:47 WARN Client: Failed to cleanup staging dir 
> hdfs://mycluster0/user/hrt_qa/.sparkStaging/application_1498539861056_0015
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (HDFS_DELEGATION_TOKEN token 230 for hrt_qa) is expired
> at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1554)
> at org.apache.hadoop.ipc.Client.call(Client.java:1498){code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21391) Cannot convert a Seq of Map whose value type is again a seq, into a dataset

2017-07-12 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085099#comment-16085099
 ] 

Kazuaki Ishizaki edited comment on SPARK-21391 at 7/13/17 3:42 AM:
---

[~hyukjin.kwon] I think that SPARK-19254 and/or SPARK-19104 fixed this issue.


was (Author: kiszk):
[~hyukjin.kwon] I think that 
[SPARK-19254|https://issues.apache.org/jira/browse/SPARK-19254] and/or 
[SPARK-19104|https://issues.apache.org/jira/browse/SPARK-19104] fixed this 
issue.

> Cannot convert a Seq of Map whose value type is again a seq, into a dataset 
> 
>
> Key: SPARK-21391
> URL: https://issues.apache.org/jira/browse/SPARK-21391
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: Seen on mac OSX, scala 2.11, java 8
>Reporter: indraneel rao
>
> There is an error while trying to create a dataset from a sequence of Maps, 
> whose values have any kind of collections. Even when they are wrapped in a 
> case class. 
> Eg : The following piece of code throws an error:
>
> {code:java}
> case class Values(values: Seq[Double])
> case class ItemProperties(properties:Map[String,Values])
> val l1 = List(ItemProperties(
>   Map(
> "A1" -> Values(Seq(1.0,2.0)),
> "B1" -> Values(Seq(44.0,55.0))
>   )
> ),
>   ItemProperties(
> Map(
>   "A2" -> Values(Seq(123.0,25.0)),
>   "B2" -> Values(Seq(445.0,35.0))
> )
>   )
> )
> l1.toDS().show()
> {code}
> Heres the error:
> 17/07/12 21:59:35 ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 65, Column 46: Expression "ExternalMapToCatalyst_value_isNull0" is not an 
> rvalue
> /* 001 */ public java.lang.Object generate(Object[] references) {
> /* 002 */   return new SpecificUnsafeProjection(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ class SpecificUnsafeProjection extends 
> org.apache.spark.sql.catalyst.expressions.UnsafeProjection {
> /* 006 */
> /* 007 */   private Object[] references;
> /* 008 */   private boolean resultIsNull;
> /* 009 */   private java.lang.String argValue;
> /* 010 */   private Object[] values;
> /* 011 */   private boolean resultIsNull1;
> /* 012 */   private scala.collection.Seq argValue1;
> /* 013 */   private boolean isNull11;
> /* 014 */   private boolean value11;
> /* 015 */   private boolean isNull12;
> /* 016 */   private InternalRow value12;
> /* 017 */   private boolean isNull13;
> /* 018 */   private InternalRow value13;
> /* 019 */   private UnsafeRow result;
> /* 020 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder holder;
> /* 021 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter rowWriter;
> /* 022 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter 
> arrayWriter;
> /* 023 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter 
> arrayWriter1;
> /* 024 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter rowWriter1;
> /* 025 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter 
> arrayWriter2;
> /* 026 */
> /* 027 */   public SpecificUnsafeProjection(Object[] references) {
> /* 028 */ this.references = references;
> /* 029 */
> /* 030 */
> /* 031 */ this.values = null;
> /* 032 */
> /* 033 */
> /* 034 */ isNull11 = false;
> /* 035 */ value11 = false;
> /* 036 */ isNull12 = false;
> /* 037 */ value12 = null;
> /* 038 */ isNull13 = false;
> /* 039 */ value13 = null;
> /* 040 */ result = new UnsafeRow(1);
> /* 041 */ this.holder = new 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(result, 32);
> /* 042 */ this.rowWriter = new 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(holder, 1);
> /* 043 */ this.arrayWriter = new 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter();
> /* 044 */ this.arrayWriter1 = new 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter();
> /* 045 */ this.rowWriter1 = new 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(holder, 1);
> /* 046 */ this.arrayWriter2 = new 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter();
> /* 047 */
> /* 048 */   }
> /* 049 */
> /* 050 */   public void initialize(int partitionIndex) {
> /* 051 */
> /* 052 */   }
> /* 053 */
> /* 054 */
> /* 055 */   private void evalIfTrueExpr(InternalRow i) {
> /* 056 */ final InternalRow value7 = null;
> /* 057 */ isNull12 = true;
> /* 058 */ value12 = value7;
> /* 059 */   }
> /* 060 */
> /* 061 */
> /* 062 */   

[jira] [Commented] (SPARK-21391) Cannot convert a Seq of Map whose value type is again a seq, into a dataset

2017-07-12 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085099#comment-16085099
 ] 

Kazuaki Ishizaki commented on SPARK-21391:
--

[~hyukjin.kwon] I think that 
[SPARK-19254|https://issues.apache.org/jira/browse/SPARK-19254] and/or 
[SPARK-19104|https://issues.apache.org/jira/browse/SPARK-19104] fixed this 
issue.

> Cannot convert a Seq of Map whose value type is again a seq, into a dataset 
> 
>
> Key: SPARK-21391
> URL: https://issues.apache.org/jira/browse/SPARK-21391
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: Seen on mac OSX, scala 2.11, java 8
>Reporter: indraneel rao
>
> There is an error while trying to create a dataset from a sequence of Maps, 
> whose values have any kind of collections. Even when they are wrapped in a 
> case class. 
> Eg : The following piece of code throws an error:
>
> {code:java}
> case class Values(values: Seq[Double])
> case class ItemProperties(properties:Map[String,Values])
> val l1 = List(ItemProperties(
>   Map(
> "A1" -> Values(Seq(1.0,2.0)),
> "B1" -> Values(Seq(44.0,55.0))
>   )
> ),
>   ItemProperties(
> Map(
>   "A2" -> Values(Seq(123.0,25.0)),
>   "B2" -> Values(Seq(445.0,35.0))
> )
>   )
> )
> l1.toDS().show()
> {code}
> Heres the error:
> 17/07/12 21:59:35 ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 65, Column 46: Expression "ExternalMapToCatalyst_value_isNull0" is not an 
> rvalue
> /* 001 */ public java.lang.Object generate(Object[] references) {
> /* 002 */   return new SpecificUnsafeProjection(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ class SpecificUnsafeProjection extends 
> org.apache.spark.sql.catalyst.expressions.UnsafeProjection {
> /* 006 */
> /* 007 */   private Object[] references;
> /* 008 */   private boolean resultIsNull;
> /* 009 */   private java.lang.String argValue;
> /* 010 */   private Object[] values;
> /* 011 */   private boolean resultIsNull1;
> /* 012 */   private scala.collection.Seq argValue1;
> /* 013 */   private boolean isNull11;
> /* 014 */   private boolean value11;
> /* 015 */   private boolean isNull12;
> /* 016 */   private InternalRow value12;
> /* 017 */   private boolean isNull13;
> /* 018 */   private InternalRow value13;
> /* 019 */   private UnsafeRow result;
> /* 020 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder holder;
> /* 021 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter rowWriter;
> /* 022 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter 
> arrayWriter;
> /* 023 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter 
> arrayWriter1;
> /* 024 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter rowWriter1;
> /* 025 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter 
> arrayWriter2;
> /* 026 */
> /* 027 */   public SpecificUnsafeProjection(Object[] references) {
> /* 028 */ this.references = references;
> /* 029 */
> /* 030 */
> /* 031 */ this.values = null;
> /* 032 */
> /* 033 */
> /* 034 */ isNull11 = false;
> /* 035 */ value11 = false;
> /* 036 */ isNull12 = false;
> /* 037 */ value12 = null;
> /* 038 */ isNull13 = false;
> /* 039 */ value13 = null;
> /* 040 */ result = new UnsafeRow(1);
> /* 041 */ this.holder = new 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(result, 32);
> /* 042 */ this.rowWriter = new 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(holder, 1);
> /* 043 */ this.arrayWriter = new 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter();
> /* 044 */ this.arrayWriter1 = new 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter();
> /* 045 */ this.rowWriter1 = new 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(holder, 1);
> /* 046 */ this.arrayWriter2 = new 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter();
> /* 047 */
> /* 048 */   }
> /* 049 */
> /* 050 */   public void initialize(int partitionIndex) {
> /* 051 */
> /* 052 */   }
> /* 053 */
> /* 054 */
> /* 055 */   private void evalIfTrueExpr(InternalRow i) {
> /* 056 */ final InternalRow value7 = null;
> /* 057 */ isNull12 = true;
> /* 058 */ value12 = value7;
> /* 059 */   }
> /* 060 */
> /* 061 */
> /* 062 */   private void evalIfCondExpr(InternalRow i) {
> /* 063 */
> /* 064 */ isNull11 = false;
> /* 065 */ value11 = ExternalMapToCatalyst_value_isNull0;
> 

[jira] [Updated] (SPARK-21395) Spark SQL hive-thriftserver doesn't register operation log before execute sql statement

2017-07-12 Thread Chaozhong Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chaozhong Yang updated SPARK-21395:
---
Description: 
In HiveServer2, TFetchResultsReq has a member which is named as `fetchType`. If 
fetchType is equal to be `1`, the thrift server should return operation log to 
client. However, we found Spark SQL's thrift server always return nothing to 
client for  TFetchResultsReq with fetchType(1). We 
 have checked the ${HIVE_SERVER2_LOGGING_OPERATION_LOG_LOCATION}/${session-id} 
directory carefully and found that there were existed operation log files with 
zero bytes(empty file). Why? Let's take a look at SQLOperation.java in Hive:


{code:java}
  @Override
  public void runInternal() throws HiveSQLException {
setState(OperationState.PENDING);
final HiveConf opConfig = getConfigForOperation();
prepare(opConfig);
if (!shouldRunAsync()) {
  runQuery(opConfig);
} else {
  // We'll pass ThreadLocals in the background thread from the foreground 
(handler) thread
  final SessionState parentSessionState = SessionState.get();
  // ThreadLocal Hive object needs to be set in background thread.
  // The metastore client in Hive is associated with right user.
  final Hive parentHive = getSessionHive();
  // Current UGI will get used by metastore when metsatore is in embedded 
mode
  // So this needs to get passed to the new background thread
  final UserGroupInformation currentUGI = getCurrentUGI(opConfig);
  // Runnable impl to call runInternal asynchronously,
  // from a different thread
  Runnable backgroundOperation = new Runnable() {
@Override
public void run() {
  PrivilegedExceptionAction doAsAction = new 
PrivilegedExceptionAction() {
@Override
public Object run() throws HiveSQLException {
  Hive.set(parentHive);
  SessionState.setCurrentSessionState(parentSessionState);
  // Set current OperationLog in this async thread for keeping on 
saving query log.
  registerCurrentOperationLog();
  try {
runQuery(opConfig);
  } catch (HiveSQLException e) {
setOperationException(e);
LOG.error("Error running hive query: ", e);
  } finally {
unregisterOperationLog();
  }
  return null;
}
  };

  try {
currentUGI.doAs(doAsAction);
  } catch (Exception e) {
setOperationException(new HiveSQLException(e));
LOG.error("Error running hive query as user : " + 
currentUGI.getShortUserName(), e);
  }
  finally {
/**
 * We'll cache the ThreadLocal RawStore object for this background 
thread for an orderly cleanup
 * when this thread is garbage collected later.
 * @see 
org.apache.hive.service.server.ThreadWithGarbageCleanup#finalize()
 */
if (ThreadWithGarbageCleanup.currentThread() instanceof 
ThreadWithGarbageCleanup) {
  ThreadWithGarbageCleanup currentThread =
  (ThreadWithGarbageCleanup) 
ThreadWithGarbageCleanup.currentThread();
  currentThread.cacheThreadLocalRawStore();
}
  }
}
  };
  try {
// This submit blocks if no background threads are available to run 
this operation
Future backgroundHandle =

getParentSession().getSessionManager().submitBackgroundOperation(backgroundOperation);
setBackgroundHandle(backgroundHandle);
  } catch (RejectedExecutionException rejected) {
setState(OperationState.ERROR);
throw new HiveSQLException("The background threadpool cannot accept" +
" new task for execution, please retry the operation", rejected);
  }
}
  }
{code}

Obviously, registerOperationLog is the key point that Hive can produce and 
return operation log to client.

But, in Spark SQL, SparkExecuteStatementOperation doesn't registerOperationLog 
before execute sql statement:

{code:scala}
  override def runInternal(): Unit = {
setState(OperationState.PENDING)
setHasResultSet(true) // avoid no resultset for async run

if (!runInBackground) {
  execute()
} else {
  val sparkServiceUGI = Utils.getUGI()

  // Runnable impl to call runInternal asynchronously,
  // from a different thread
  val backgroundOperation = new Runnable() {

override def run(): Unit = {
  val doAsAction = new PrivilegedExceptionAction[Unit]() {
override def run(): Unit = {
  try {
execute()
  } catch {
case e: HiveSQLException =>
  setOperationException(e)
  log.error("Error running hive query: ", e)
  }
  

[jira] [Created] (SPARK-21395) Spark SQL hive-thriftserver doesn't register operation log before execute sql statement

2017-07-12 Thread Chaozhong Yang (JIRA)
Chaozhong Yang created SPARK-21395:
--

 Summary: Spark SQL hive-thriftserver doesn't register operation 
log before execute sql statement
 Key: SPARK-21395
 URL: https://issues.apache.org/jira/browse/SPARK-21395
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.1, 2.1.0
Reporter: Chaozhong Yang


In HiveServer2, TFetchResultsReq has a member which is named as `fetchType`. If 
fetchType is equal to be `1`, the thrift server should return operation log to 
client. However, we found Spark SQL's thrift server always return nothing to 
client for  TFetchResultsReq with fetchType(1). We 
 have checked the ${HIVE_SERVER2_LOGGING_OPERATION_LOG_LOCATION}/${session-id} 
directory carefully and found that there were existed operation log file with 
zero bytes(empty file). Why? Let's take a look at SQLOperation.java in Hive:


{code:java}
  @Override
  public void runInternal() throws HiveSQLException {
setState(OperationState.PENDING);
final HiveConf opConfig = getConfigForOperation();
prepare(opConfig);
if (!shouldRunAsync()) {
  runQuery(opConfig);
} else {
  // We'll pass ThreadLocals in the background thread from the foreground 
(handler) thread
  final SessionState parentSessionState = SessionState.get();
  // ThreadLocal Hive object needs to be set in background thread.
  // The metastore client in Hive is associated with right user.
  final Hive parentHive = getSessionHive();
  // Current UGI will get used by metastore when metsatore is in embedded 
mode
  // So this needs to get passed to the new background thread
  final UserGroupInformation currentUGI = getCurrentUGI(opConfig);
  // Runnable impl to call runInternal asynchronously,
  // from a different thread
  Runnable backgroundOperation = new Runnable() {
@Override
public void run() {
  PrivilegedExceptionAction doAsAction = new 
PrivilegedExceptionAction() {
@Override
public Object run() throws HiveSQLException {
  Hive.set(parentHive);
  SessionState.setCurrentSessionState(parentSessionState);
  // Set current OperationLog in this async thread for keeping on 
saving query log.
  registerCurrentOperationLog();
  try {
runQuery(opConfig);
  } catch (HiveSQLException e) {
setOperationException(e);
LOG.error("Error running hive query: ", e);
  } finally {
unregisterOperationLog();
  }
  return null;
}
  };

  try {
currentUGI.doAs(doAsAction);
  } catch (Exception e) {
setOperationException(new HiveSQLException(e));
LOG.error("Error running hive query as user : " + 
currentUGI.getShortUserName(), e);
  }
  finally {
/**
 * We'll cache the ThreadLocal RawStore object for this background 
thread for an orderly cleanup
 * when this thread is garbage collected later.
 * @see 
org.apache.hive.service.server.ThreadWithGarbageCleanup#finalize()
 */
if (ThreadWithGarbageCleanup.currentThread() instanceof 
ThreadWithGarbageCleanup) {
  ThreadWithGarbageCleanup currentThread =
  (ThreadWithGarbageCleanup) 
ThreadWithGarbageCleanup.currentThread();
  currentThread.cacheThreadLocalRawStore();
}
  }
}
  };
  try {
// This submit blocks if no background threads are available to run 
this operation
Future backgroundHandle =

getParentSession().getSessionManager().submitBackgroundOperation(backgroundOperation);
setBackgroundHandle(backgroundHandle);
  } catch (RejectedExecutionException rejected) {
setState(OperationState.ERROR);
throw new HiveSQLException("The background threadpool cannot accept" +
" new task for execution, please retry the operation", rejected);
  }
}
  }
{code}

Obviously, registerOperationLog is the key point that Hive can produce and 
return operation log to client.

But, in Spark SQL, SparkExecuteStatementOperation doesn't registerOperationLog 
before execute sql statement:

{code:scala}
  override def runInternal(): Unit = {
setState(OperationState.PENDING)
setHasResultSet(true) // avoid no resultset for async run

if (!runInBackground) {
  execute()
} else {
  val sparkServiceUGI = Utils.getUGI()

  // Runnable impl to call runInternal asynchronously,
  // from a different thread
  val backgroundOperation = new Runnable() {

override def run(): Unit = {
  val doAsAction = new PrivilegedExceptionAction[Unit]() {
override def run(): Unit = {
 

[jira] [Updated] (SPARK-21297) Add count in 'JDBC/ODBC Server' page.

2017-07-12 Thread guoxiaolongzte (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guoxiaolongzte updated SPARK-21297:
---
Description: 1.Add count about 'Session Statistics' and 'SQL Statistics' in 
'JDBC/ODBC Server' page.The purpose is to know the statistics clearly.  (was: 
1.Add State in 'Session Statistics' table  and add count in 'JDBC/ODBC Server' 
page.The purpose is to identify the status of online or offline, if there is a 
large number of Sessions.

2.add count about 'Session Statistics' and 'SQL Statistics' in 'JDBC/ODBC 
Server' page.The purpose is to know the statistics clearly.)

> Add count in 'JDBC/ODBC Server' page.
> -
>
> Key: SPARK-21297
> URL: https://issues.apache.org/jira/browse/SPARK-21297
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: guoxiaolongzte
>Priority: Minor
>
> 1.Add count about 'Session Statistics' and 'SQL Statistics' in 'JDBC/ODBC 
> Server' page.The purpose is to know the statistics clearly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20703) Add an operator for writing data out

2017-07-12 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085017#comment-16085017
 ] 

Liang-Chi Hsieh commented on SPARK-20703:
-

Thanks [~ste...@apache.org] for voicing this.

For the latency on object store, I am not sure the actual implementation of the 
Committer for the object store (S3?) you use. I take the following S3 Committer 
as an example:

https://github.com/rdblue/s3committer/blob/44d41d475488edf60f5dbbe2224c0ef9227e55dc/src/main/java/com/netflix/bdp/s3/S3MultipartOutputCommitter.java#L213

The temp file for a task in Spark's FileCommitProtocol would use the work path 
of the S3 Committer as the staging path, and it is a path on local FS.

So I assume that the latency will only be there if the staging path is not 
based on a local path?

For the FNFE issue, currently the call to getFileSize is happened after the 
current OutputWriter is closed. I agree that we should catche all IOEs in 
getFileSize, so a patch is welcome, but I am also curious that the cases that 
the file is not there, do we have another change to materialize the file later?










> Add an operator for writing data out
> 
>
> Key: SPARK-20703
> URL: https://issues.apache.org/jira/browse/SPARK-20703
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Liang-Chi Hsieh
> Fix For: 2.3.0
>
>
> We should add an operator for writing data out. Right now in the explain plan 
> / UI there is no way to tell whether a query is writing data out, and also 
> there is no way to associate metrics with data writes. It'd be tremendously 
> valuable to do this for adding metrics and for visibility.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21297) Add count in 'JDBC/ODBC Server' page.

2017-07-12 Thread guoxiaolongzte (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guoxiaolongzte updated SPARK-21297:
---
Summary: Add count in 'JDBC/ODBC Server' page.  (was: Add State in 'Session 
Statistics' table and add count in 'JDBC/ODBC Server' page.)

> Add count in 'JDBC/ODBC Server' page.
> -
>
> Key: SPARK-21297
> URL: https://issues.apache.org/jira/browse/SPARK-21297
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: guoxiaolongzte
>Priority: Minor
>
> 1.Add State in 'Session Statistics' table  and add count in 'JDBC/ODBC 
> Server' page.The purpose is to identify the status of online or offline, if 
> there is a large number of Sessions.
> 2.add count about 'Session Statistics' and 'SQL Statistics' in 'JDBC/ODBC 
> Server' page.The purpose is to know the statistics clearly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18646) ExecutorClassLoader for spark-shell does not honor spark.executor.userClassPathFirst

2017-07-12 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-18646.
-
   Resolution: Fixed
 Assignee: Min Shen
Fix Version/s: 2.3.0

> ExecutorClassLoader for spark-shell does not honor 
> spark.executor.userClassPathFirst
> 
>
> Key: SPARK-18646
> URL: https://issues.apache.org/jira/browse/SPARK-18646
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.6.2
>Reporter: Min Shen
>Assignee: Min Shen
> Fix For: 2.3.0
>
>
> When submitting a spark-shell application, the executor side classloader is 
> set to be {{ExecutorClassLoader}}.
> However, it appears that when {{ExecutorClassLoader}} is used, parameter 
> {{spark.executor.userClassPathFirst}} is not honored.
> It turns out that, since {{ExecutorClassLoader}} class is defined as
> {noformat}
> class ExecutorClassLoader(conf: SparkConf, classUri: String, parent: 
> ClassLoader,
> userClassPathFirst: Boolean) extends ClassLoader with Logging
> {noformat}
> its parent classloader is actually the system default classloader (due to 
> {{ClassLoader}} class's default constructor) rather than the "parent" 
> classloader specified in {{ExecutorClassLoader}}'s constructor.
> As a result, when {{spark.executor.userClassPathFirst}} is set to true, even 
> though the "parent" classloader is {{ChildFirstURLClassLoader}}, 
> {{ExecutorClassLoader.getParent()}} will return the system default 
> classloader.
> Thus, when {{ExecutorClassLoader}} tries to load a class, it will first 
> attempt to load it through the system default classloader, and this will 
> break the {{spark.executor.userClassPathFirst}} behavior.
> A simple fix would be to define {{ExecutorClassLoader}} as:
> {noformat}
> class ExecutorClassLoader(conf: SparkConf, classUri: String, parent: 
> ClassLoader,
> userClassPathFirst: Boolean) extends ClassLoader(parent) with Logging
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21377) Jars specified with --jars or --packages are not added into AM's system classpath

2017-07-12 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-21377:

Summary: Jars specified with --jars or --packages are not added into AM's 
system classpath  (was: Add a new configuration to extend AM classpath in yarn 
client mode)

> Jars specified with --jars or --packages are not added into AM's system 
> classpath
> -
>
> Key: SPARK-21377
> URL: https://issues.apache.org/jira/browse/SPARK-21377
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.2.0
>Reporter: Yesha Vora
>Priority: Minor
>
> In this issue we have a long running Spark application with secure HBase, 
> which requires {{HBaseCredentialProvider}} to get tokens periodically, we 
> specify HBase related jars with {{\--packages}}, but these dependencies are 
> not added into AM classpath, so when {{HBaseCredentialProvider}} tries to 
> initialize HBase connections to get tokens, it will be failed.
> Currently because jars specified with {{\--jars}} or {{\--packages}} are not 
> added into AM classpath, the only way to extend AM classpath is to use 
> "spark.driver.extraClassPath" which supposed to be used in yarn cluster mode.
> So here we should figure out a solution  either to put these dependencies to 
> AM classpath or to extend AM classpath with correct configuration.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21377) Add a new configuration to extend AM classpath in yarn client mode

2017-07-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084929#comment-16084929
 ] 

Apache Spark commented on SPARK-21377:
--

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/18616

> Add a new configuration to extend AM classpath in yarn client mode
> --
>
> Key: SPARK-21377
> URL: https://issues.apache.org/jira/browse/SPARK-21377
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.2.0
>Reporter: Yesha Vora
>Priority: Minor
>
> In this issue we have a long running Spark application with secure HBase, 
> which requires {{HBaseCredentialProvider}} to get tokens periodically, we 
> specify HBase related jars with {{\--packages}}, but these dependencies are 
> not added into AM classpath, so when {{HBaseCredentialProvider}} tries to 
> initialize HBase connections to get tokens, it will be failed.
> Currently because jars specified with {{\--jars}} or {{\--packages}} are not 
> added into AM classpath, the only way to extend AM classpath is to use 
> "spark.driver.extraClassPath" which supposed to be used in yarn cluster mode.
> So here we should figure out a solution  either to put these dependencies to 
> AM classpath or to extend AM classpath with correct configuration.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13534) Implement Apache Arrow serializer for Spark DataFrame for use in DataFrame.toPandas

2017-07-12 Thread Ruslan Dautkhanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084886#comment-16084886
 ] 

Ruslan Dautkhanov commented on SPARK-13534:
---

[~bryanc], thanks for the feedback. We sometimes have issues with caching 
dataframes in Spark, so we wanted to see if Airflow could be a better fit in 
PySpark than Pickle / cPickle for caching dataframes? 

Thanks for the link - I will check that. On a separarate note, Arrow batched 
columnar storage can still be iterated over for reads? For non-batched writes 
PySpark serializer can fall back to non-Arrow format. So might be interesting 
to explore if there are two serializers can be active at the same time - 
batched Airflow and fall-back to cPickle if necessary? 

> Implement Apache Arrow serializer for Spark DataFrame for use in 
> DataFrame.toPandas
> ---
>
> Key: SPARK-13534
> URL: https://issues.apache.org/jira/browse/SPARK-13534
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Wes McKinney
>Assignee: Bryan Cutler
> Fix For: 2.3.0
>
> Attachments: benchmark.py
>
>
> The current code path for accessing Spark DataFrame data in Python using 
> PySpark passes through an inefficient serialization-deserialiation process 
> that I've examined at a high level here: 
> https://gist.github.com/wesm/0cb5531b1c2e346a0007. Currently, RDD[Row] 
> objects are being deserialized in pure Python as a list of tuples, which are 
> then converted to pandas.DataFrame using its {{from_records}} alternate 
> constructor. This also uses a large amount of memory.
> For flat (no nested types) schemas, the Apache Arrow memory layout 
> (https://github.com/apache/arrow/tree/master/format) can be deserialized to 
> {{pandas.DataFrame}} objects with comparatively small overhead compared with 
> memcpy / system memory bandwidth -- Arrow's bitmasks must be examined, 
> replacing the corresponding null values with pandas's sentinel values (None 
> or NaN as appropriate).
> I will be contributing patches to Arrow in the coming weeks for converting 
> between Arrow and pandas in the general case, so if Spark can send Arrow 
> memory to PySpark, we will hopefully be able to increase the Python data 
> access throughput by an order of magnitude or more. I propose to add an new 
> serializer for Spark DataFrame and a new method that can be invoked from 
> PySpark to request a Arrow memory-layout byte stream, prefixed by a data 
> header indicating array buffer offsets and sizes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21392) Unable to infer schema when loading large Parquet file

2017-07-12 Thread Stuart Reynolds (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084837#comment-16084837
 ] 

Stuart Reynolds edited comment on SPARK-21392 at 7/12/17 10:27 PM:
---

I've simplified the example a little more and also found the limiting the query 
size to 100 rows succeeds, whereas if I select all 500k rows * 2 columns, it 
fails.

In case it helps, my utility function get_sparkSQLContextWithTables loaded the 
full table 'outcomes' from postgres into spark, with 10 partitions with:
{code:non}
index="eid"
index_min=min(eid)
index_max=max(eid)
{code}


was (Author: stuartreynolds):
I've simplified the example a little more and also found the limiting the query 
size to 100 rows succeeds, whereas if I select all 500k rows * 2 columns, it 
fails.

> Unable to infer schema when loading large Parquet file
> --
>
> Key: SPARK-21392
> URL: https://issues.apache.org/jira/browse/SPARK-21392
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.1
> Environment: Spark 2.1.1. python 2.7.6
>Reporter: Stuart Reynolds
>  Labels: parquet, pyspark
>
> The following boring code works
> {code:none}
> response = "mi_or_chd_5"
> sc = get_spark_context() # custom
> sqlc = get_sparkSQLContextWithTables(sc, tables=["outcomes"]) # custom
> rdd = sqlc.sql("SELECT eid,mi_or_chd_5 FROM outcomes")
> print rdd.schema
> #>>
> StructType(List(StructField(eid,IntegerType,true),StructField(mi_or_chd_5,ShortType,true)))
> rdd.show()
> #+---+---+
> #|eid|mi_or_chd_5|
> #+---+---+
> #|216|   null|
> #|431|   null|
> #|978|  0|
> #|852|  0|
> #|418|  0|
> rdd.write.parquet(response, mode="overwrite") # success!
> rdd2 = sqlc.read.parquet(response) # fail
> {code}
> 
> fails with:
> {code:none}AnalysisException: u'Unable to infer schema for Parquet. It must 
> be specified manually.;'
> {code}
> in 
> {code:none} 
> /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
>  in deco(*a, **kw)
> {code}
> The documentation for parquet says the format is self describing, and the 
> full schema was available when the parquet file was saved. What gives?
> The error doesn't happen if I add "limit 10" to the sql query. The whole 
> selected table is 500k rows with an int and short column.
> Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but 
> which claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21393) spark (pyspark) crashes unpredictably when using show() or toPandas()

2017-07-12 Thread Zahra (JIRA)
Zahra created SPARK-21393:
-

 Summary: spark (pyspark) crashes unpredictably when using show() 
or toPandas()
 Key: SPARK-21393
 URL: https://issues.apache.org/jira/browse/SPARK-21393
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.1.1, 2.2.1
 Environment: Windows 10
python 2.7
Reporter: Zahra


unpredictbly run into this error either when using 
`pyspark.sql.DataFrame.show()` or `pyspark.sql.DataFrame.toPandas()`

error log starts with  (truncated) :
{noformat}
17/07/12 16:03:09 ERROR CodeGenerator: failed to compile: 
org.codehaus.janino.JaninoRuntimeException: Code of method 
"apply_47$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V"
 of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
 grows beyond 64 KB
/* 001 */ public java.lang.Object generate(Object[] references) {
/* 002 */   return new SpecificUnsafeProjection(references);
/* 003 */ }
/* 004 */
/* 005 */ class SpecificUnsafeProjection extends 
org.apache.spark.sql.catalyst.expressions.UnsafeProjection {
/* 006 */
/* 007 */   private Object[] references;
/* 008 */   private scala.collection.immutable.Set set;
/* 009 */   private scala.collection.immutable.Set set1;
/* 010 */   private scala.collection.immutable.Set set2;
/* 011 */   private scala.collection.immutable.Set set3;
/* 012 */   private UTF8String.IntWrapper wrapper;
/* 013 */   private UTF8String.IntWrapper wrapper1;
/* 014 */   private scala.collection.immutable.Set set4;
/* 015 */   private UTF8String.IntWrapper wrapper2;
/* 016 */   private UTF8String.IntWrapper wrapper3;
/* 017 */   private scala.collection.immutable.Set set5;
/* 018 */   private scala.collection.immutable.Set set6;
/* 019 */   private scala.collection.immutable.Set set7;
/* 020 */   private UTF8String.IntWrapper wrapper4;
/* 021 */   private UTF8String.IntWrapper wrapper5;
/* 022 */   private scala.collection.immutable.Set set8;
/* 023 */   private UTF8String.IntWrapper wrapper6;
/* 024 */   private UTF8String.IntWrapper wrapper7;
/* 025 */   private scala.collection.immutable.Set set9;
/* 026 */   private scala.collection.immutable.Set set10;
/* 027 */   private scala.collection.immutable.Set set11;
/* 028 */   private UTF8String.IntWrapper wrapper8;
/* 029 */   private UTF8String.IntWrapper wrapper9;
/* 030 */   private scala.collection.immutable.Set set12;
/* 031 */   private UTF8String.IntWrapper wrapper10;
/* 032 */   private UTF8String.IntWrapper wrapper11;
/* 033 */   private scala.collection.immutable.Set set13;
/* 034 */   private scala.collection.immutable.Set set14;
/* 035 */   private scala.collection.immutable.Set set15;
/* 036 */   private UTF8String.IntWrapper wrapper12;
/* 037 */   private UTF8String.IntWrapper wrapper13;
/* 038 */   private scala.collection.immutable.Set set16;
/* 039 */   private UTF8String.IntWrapper wrapper14;
/* 040 */   private UTF8String.IntWrapper wrapper15;
/* 041 */   private scala.collection.immutable.Set set17;
/* 042 */   private scala.collection.immutable.Set set18;
/* 043 */   private scala.collection.immutable.Set set19;
/* 044 */   private UTF8String.IntWrapper wrapper16;
/* 045 */   private UTF8String.IntWrapper wrapper17;
/* 046 */   private scala.collection.immutable.Set set20;
/* 047 */   private UTF8String.IntWrapper wrapper18;
/* 048 */   private UTF8String.IntWrapper wrapper19;
/* 049 */   private scala.collection.immutable.Set set21;
/* 050 */   private scala.collection.immutable.Set set22;
/* 051 */   private scala.collection.immutable.Set set23;
/* 052 */   private UTF8String.IntWrapper wrapper20;
/* 053 */   private UTF8String.IntWrapper wrapper21;
/* 054 */   private scala.collection.immutable.Set set24;
/* 055 */   private UTF8String.IntWrapper wrapper22;
/* 056 */   private UTF8String.IntWrapper wrapper23;
/* 057 */   private scala.collection.immutable.Set set25;
/* 058 */   private scala.collection.immutable.Set set26;
/* 059 */   private scala.collection.immutable.Set set27;
/* 060 */   private UTF8String.IntWrapper wrapper24;
/* 061 */   private UTF8String.IntWrapper wrapper25;
/* 062 */   private scala.collection.immutable.Set set28;
/* 063 */   private UTF8String.IntWrapper wrapper26;
/* 064 */   private UTF8String.IntWrapper wrapper27;
/* 065 */   private scala.collection.immutable.Set set29;
/* 066 */   private scala.collection.immutable.Set set30;
/* 067 */   private scala.collection.immutable.Set set31;
/* 068 */   private UTF8String.IntWrapper wrapper28;
/* 069 */   private UTF8String.IntWrapper wrapper29;
/* 070 */   private scala.collection.immutable.Set set32;
/* 071 */   private UTF8String.IntWrapper wrapper30;
/* 072 */   private UTF8String.IntWrapper wrapper31;
/* 073 */   private scala.collection.immutable.Set set33;
/* 074 */   private 

[jira] [Commented] (SPARK-21392) Unable to infer schema when loading large Parquet file

2017-07-12 Thread Stuart Reynolds (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084837#comment-16084837
 ] 

Stuart Reynolds commented on SPARK-21392:
-

I've simplified the example a little more and also found the limiting the query 
size to 100 rows succeeds, whereas if I select all 500k rows * 2 columns, it 
fails.

> Unable to infer schema when loading large Parquet file
> --
>
> Key: SPARK-21392
> URL: https://issues.apache.org/jira/browse/SPARK-21392
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.1
> Environment: Spark 2.1.1. python 2.7.6
>Reporter: Stuart Reynolds
>  Labels: parquet, pyspark
>
> The following boring code works
> {code:none}
> response = "mi_or_chd_5"
> sc = get_spark_context() # custom
> sqlc = get_sparkSQLContextWithTables(sc, tables=["outcomes"]) # custom
> rdd = sqlc.sql("SELECT eid,mi_or_chd_5 FROM outcomes")
> print rdd.schema
> #>>
> StructType(List(StructField(eid,IntegerType,true),StructField(mi_or_chd_5,ShortType,true)))
> rdd.show()
> #+---+---+
> #|eid|mi_or_chd_5|
> #+---+---+
> #|216|   null|
> #|431|   null|
> #|978|  0|
> #|852|  0|
> #|418|  0|
> rdd.write.parquet(response, mode="overwrite") # success!
> rdd2 = sqlc.read.parquet(response) # fail
> {code}
> 
> fails with:
> {code:none}AnalysisException: u'Unable to infer schema for Parquet. It must 
> be specified manually.;'
> {code}
> in 
> {code:none} 
> /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
>  in deco(*a, **kw)
> {code}
> The documentation for parquet says the format is self describing, and the 
> full schema was available when the parquet file was saved. What gives?
> The error doesn't happen if I add "limit 10" to the sql query. The whole 
> selected table is 500k rows with an int and short column.
> Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but 
> which claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21392) Unable to infer schema when loading large Parquet file

2017-07-12 Thread Stuart Reynolds (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stuart Reynolds updated SPARK-21392:

Description: 
The following boring code works

{code:none}
response = "mi_or_chd_5"
sc = get_spark_context() # custom
sqlc = get_sparkSQLContextWithTables(sc, tables=["outcomes"]) # custom


rdd = sqlc.sql("SELECT eid,mi_or_chd_5 FROM outcomes")
print rdd.schema
#>>
StructType(List(StructField(eid,IntegerType,true),StructField(mi_or_chd_5,ShortType,true)))
rdd.show()
#+---+---+
#|eid|mi_or_chd_5|
#+---+---+
#|226|   null|
#|442|   null|
#|978|  0|
#|851|  0|
#|428|  0|

rdd.write.parquet(response, mode="overwrite") # success!
rdd2 = sqlc.read.parquet(response) # fail
{code}

fails with:

{code:none}AnalysisException: u'Unable to infer schema for Parquet. It must be 
specified manually.;'
{code}

in 

{code:none} 
/usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
 in deco(*a, **kw)
{code}

The documentation for parquet says the format is self describing, and the full 
schema was available when the parquet file was saved. What gives?

The error doesn't happen if I add "limit 10" to the sql query. The whole 
selected table is 500k rows with an int and short column.

Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but which 
claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)


  was:
The following boring code works

{code:none}
response = "mi_or_chd_5"
sc = get_spark_context() # custom
sqlc = get_sparkSQLContextWithTables(sc, tables=["outcomes"]) # custom


rdd = sqlc.sql("SELECT eid,mi_or_chd_5 FROM outcomes")
print rdd.schema
#>>
StructType(List(StructField(eid,IntegerType,true),StructField(mi_or_chd_5,ShortType,true)))
rdd.show()
#+---+---+
#|eid|mi_or_chd_5|
#+---+---+
#|216|   null|
#|431|   null|
#|978|  0|
#|852|  0|
#|418|  0|

rdd.write.parquet(response, mode="overwrite") # success!
rdd2 = sqlc.read.parquet(response) # fail
{code}

fails with:

{code:none}AnalysisException: u'Unable to infer schema for Parquet. It must be 
specified manually.;'
{code}

in 

{code:none} 
/usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
 in deco(*a, **kw)
{code}

The documentation for parquet says the format is self describing, and the full 
schema was available when the parquet file was saved. What gives?

The error doesn't happen if I add "limit 10" to the sql query. The whole 
selected table is 500k rows with an int and short column.

Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but which 
claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)



> Unable to infer schema when loading large Parquet file
> --
>
> Key: SPARK-21392
> URL: https://issues.apache.org/jira/browse/SPARK-21392
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.1
> Environment: Spark 2.1.1. python 2.7.6
>Reporter: Stuart Reynolds
>  Labels: parquet, pyspark
>
> The following boring code works
> {code:none}
> response = "mi_or_chd_5"
> sc = get_spark_context() # custom
> sqlc = get_sparkSQLContextWithTables(sc, tables=["outcomes"]) # custom
> rdd = sqlc.sql("SELECT eid,mi_or_chd_5 FROM outcomes")
> print rdd.schema
> #>>
> StructType(List(StructField(eid,IntegerType,true),StructField(mi_or_chd_5,ShortType,true)))
> rdd.show()
> #+---+---+
> #|eid|mi_or_chd_5|
> #+---+---+
> #|226|   null|
> #|442|   null|
> #|978|  0|
> #|851|  0|
> #|428|  0|
> rdd.write.parquet(response, mode="overwrite") # success!
> rdd2 = sqlc.read.parquet(response) # fail
> {code}
> 
> fails with:
> {code:none}AnalysisException: u'Unable to infer schema for Parquet. It must 
> be specified manually.;'
> {code}
> in 
> {code:none} 
> /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
>  in deco(*a, **kw)
> {code}
> The documentation for parquet says the format is self describing, and the 
> full schema was available when the parquet file was saved. What gives?
> The error doesn't happen if I add "limit 10" to the sql query. The whole 
> selected table is 500k rows with an int and short column.
> Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but 
> which claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12559) Cluster mode doesn't work with --packages

2017-07-12 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084776#comment-16084776
 ] 

Marcelo Vanzin commented on SPARK-12559:


[~skonto], could you look at whether the same approach works in your patch 
works in standalone mode? Otherwise please clone this bug and set the clone's 
component to "Mesos", so we don't lose track that this still won't work for 
standalone mode.

> Cluster mode doesn't work with --packages
> -
>
> Key: SPARK-12559
> URL: https://issues.apache.org/jira/browse/SPARK-12559
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.3.0
>Reporter: Andrew Or
>
> From the mailing list:
> {quote}
> Another problem I ran into that you also might is that --packages doesn't
> work with --deploy-mode cluster.  It downloads the packages to a temporary
> location on the node running spark-submit, then passes those paths to the
> node that is running the Driver, but since that isn't the same machine, it
> can't find anything and fails.  The driver process *should* be the one
> doing the downloading, but it isn't. I ended up having to create a fat JAR
> with all of the dependencies to get around that one.
> {quote}
> The problem is that we currently don't upload jars to the cluster. It seems 
> to fix this we either (1) do upload jars, or (2) just run the packages code 
> on the driver side. I slightly prefer (2) because it's simpler.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12559) Cluster mode doesn't work with --packages

2017-07-12 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084776#comment-16084776
 ] 

Marcelo Vanzin edited comment on SPARK-12559 at 7/12/17 9:49 PM:
-

[~skonto], could you look at whether the same approach in your patch works in 
standalone mode? Otherwise please clone this bug and set the clone's component 
to "Mesos", so we don't lose track that this still won't work for standalone 
mode.


was (Author: vanzin):
[~skonto], could you look at whether the same approach works in your patch 
works in standalone mode? Otherwise please clone this bug and set the clone's 
component to "Mesos", so we don't lose track that this still won't work for 
standalone mode.

> Cluster mode doesn't work with --packages
> -
>
> Key: SPARK-12559
> URL: https://issues.apache.org/jira/browse/SPARK-12559
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.3.0
>Reporter: Andrew Or
>
> From the mailing list:
> {quote}
> Another problem I ran into that you also might is that --packages doesn't
> work with --deploy-mode cluster.  It downloads the packages to a temporary
> location on the node running spark-submit, then passes those paths to the
> node that is running the Driver, but since that isn't the same machine, it
> can't find anything and fails.  The driver process *should* be the one
> doing the downloading, but it isn't. I ended up having to create a fat JAR
> with all of the dependencies to get around that one.
> {quote}
> The problem is that we currently don't upload jars to the cluster. It seems 
> to fix this we either (1) do upload jars, or (2) just run the packages code 
> on the driver side. I slightly prefer (2) because it's simpler.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21392) Unable to infer schema when loading large Parquet file

2017-07-12 Thread Stuart Reynolds (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084837#comment-16084837
 ] 

Stuart Reynolds edited comment on SPARK-21392 at 7/12/17 10:27 PM:
---

I've simplified the example a little more and also found the limiting the query 
size to 100 rows succeeds, whereas if I select all 500k rows * 2 columns, it 
fails.

In case it helps, my utility function get_sparkSQLContextWithTables loaded the 
full table 'outcomes' from postgres into spark, with 10 partitions with:
{code:none}
index="eid"
index_min=min(eid)
index_max=max(eid)
{code}


was (Author: stuartreynolds):
I've simplified the example a little more and also found the limiting the query 
size to 100 rows succeeds, whereas if I select all 500k rows * 2 columns, it 
fails.

In case it helps, my utility function get_sparkSQLContextWithTables loaded the 
full table 'outcomes' from postgres into spark, with 10 partitions with:
{code:non}
index="eid"
index_min=min(eid)
index_max=max(eid)
{code}

> Unable to infer schema when loading large Parquet file
> --
>
> Key: SPARK-21392
> URL: https://issues.apache.org/jira/browse/SPARK-21392
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.1
> Environment: Spark 2.1.1. python 2.7.6
>Reporter: Stuart Reynolds
>  Labels: parquet, pyspark
>
> The following boring code works
> {code:none}
> response = "mi_or_chd_5"
> sc = get_spark_context() # custom
> sqlc = get_sparkSQLContextWithTables(sc, tables=["outcomes"]) # custom
> rdd = sqlc.sql("SELECT eid,mi_or_chd_5 FROM outcomes")
> print rdd.schema
> #>>
> StructType(List(StructField(eid,IntegerType,true),StructField(mi_or_chd_5,ShortType,true)))
> rdd.show()
> #+---+---+
> #|eid|mi_or_chd_5|
> #+---+---+
> #|216|   null|
> #|431|   null|
> #|978|  0|
> #|852|  0|
> #|418|  0|
> rdd.write.parquet(response, mode="overwrite") # success!
> rdd2 = sqlc.read.parquet(response) # fail
> {code}
> 
> fails with:
> {code:none}AnalysisException: u'Unable to infer schema for Parquet. It must 
> be specified manually.;'
> {code}
> in 
> {code:none} 
> /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
>  in deco(*a, **kw)
> {code}
> The documentation for parquet says the format is self describing, and the 
> full schema was available when the parquet file was saved. What gives?
> The error doesn't happen if I add "limit 10" to the sql query. The whole 
> selected table is 500k rows with an int and short column.
> Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but 
> which claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21392) Unable to infer schema when loading large Parquet file

2017-07-12 Thread Stuart Reynolds (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stuart Reynolds updated SPARK-21392:

Description: 
The following boring code works

{code:none}
response = "mi_or_chd_5"
sc = get_spark_context() # custom
sqlc = get_sparkSQLContextWithTables(sc, tables=["outcomes"]) # custom


rdd = sqlc.sql("SELECT eid,mi_or_chd_5 FROM outcomes")
print rdd.schema
#>>
StructType(List(StructField(eid,IntegerType,true),StructField(mi_or_chd_5,ShortType,true)))
rdd.show()
#+---+---+
#|eid|mi_or_chd_5|
#+---+---+
#|216|   null|
#|431|   null|
#|978|  0|
#|852|  0|
#|418|  0|

rdd.write.parquet(response, mode="overwrite") # success!
rdd2 = sqlc.read.parquet(response) # fail
{code}

fails with:

{code:none}AnalysisException: u'Unable to infer schema for Parquet. It must be 
specified manually.;'
{code}

in 

{code:none} 
/usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
 in deco(*a, **kw)
{code}

The documentation for parquet says the format is self describing, and the full 
schema was available when the parquet file was saved. What gives?

The error doesn't happen if I add "limit 10" to the sql query. The whole 
selected table is 500k rows with an int and short column.

Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but which 
claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)


  was:
The following boring code works

{code:none}
response = "mi_or_chd_5"

outcome = sqlc.sql("""select eid,{response} as response
from outcomes
where {response} IS NOT NULL""".format(response=response))
outcome.write.parquet(response, mode="overwrite")

>>> print outcome.schema

StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true)))
{code}

But then,
{code:none}
outcome2 = sqlc.read.parquet(response)  # fail
{code}

fails with:

{code:none}AnalysisException: u'Unable to infer schema for Parquet. It must be 
specified manually.;'
{code}

in 

{code:none} 
/usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
 in deco(*a, **kw)
{code}

The documentation for parquet says the format is self describing, and the full 
schema was available when the parquet file was saved. What gives?

Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but which 
claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)


Summary: Unable to infer schema when loading large Parquet file  (was: 
Unable to infer schema when loading Parquet file)

> Unable to infer schema when loading large Parquet file
> --
>
> Key: SPARK-21392
> URL: https://issues.apache.org/jira/browse/SPARK-21392
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.1
> Environment: Spark 2.1.1. python 2.7.6
>Reporter: Stuart Reynolds
>  Labels: parquet, pyspark
>
> The following boring code works
> {code:none}
> response = "mi_or_chd_5"
> sc = get_spark_context() # custom
> sqlc = get_sparkSQLContextWithTables(sc, tables=["outcomes"]) # custom
> rdd = sqlc.sql("SELECT eid,mi_or_chd_5 FROM outcomes")
> print rdd.schema
> #>>
> StructType(List(StructField(eid,IntegerType,true),StructField(mi_or_chd_5,ShortType,true)))
> rdd.show()
> #+---+---+
> #|eid|mi_or_chd_5|
> #+---+---+
> #|216|   null|
> #|431|   null|
> #|978|  0|
> #|852|  0|
> #|418|  0|
> rdd.write.parquet(response, mode="overwrite") # success!
> rdd2 = sqlc.read.parquet(response) # fail
> {code}
> 
> fails with:
> {code:none}AnalysisException: u'Unable to infer schema for Parquet. It must 
> be specified manually.;'
> {code}
> in 
> {code:none} 
> /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
>  in deco(*a, **kw)
> {code}
> The documentation for parquet says the format is self describing, and the 
> full schema was available when the parquet file was saved. What gives?
> The error doesn't happen if I add "limit 10" to the sql query. The whole 
> selected table is 500k rows with an int and short column.
> Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but 
> which claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21392) Unable to infer schema when loading Parquet file

2017-07-12 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084821#comment-16084821
 ] 

Hyukjin Kwon commented on SPARK-21392:
--

I think this is unrelated with that JIRA ^ too.

> Unable to infer schema when loading Parquet file
> 
>
> Key: SPARK-21392
> URL: https://issues.apache.org/jira/browse/SPARK-21392
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.1
> Environment: Spark 2.1.1. python 2.7.6
>Reporter: Stuart Reynolds
>  Labels: parquet, pyspark
>
> The following boring code works
> {code:none}
> response = "mi_or_chd_5"
> outcome = sqlc.sql("""select eid,{response} as response
> from outcomes
> where {response} IS NOT NULL""".format(response=response))
> outcome.write.parquet(response, mode="overwrite")
> 
> >>> print outcome.schema
> 
> StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true)))
> {code}
> 
> But then,
> {code:none}
> outcome2 = sqlc.read.parquet(response)  # fail
> {code}
> fails with:
> {code:none}AnalysisException: u'Unable to infer schema for Parquet. It must 
> be specified manually.;'
> {code}
> in 
> {code:none} 
> /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
>  in deco(*a, **kw)
> {code}
> The documentation for parquet says the format is self describing, and the 
> full schema was available when the parquet file was saved. What gives?
> Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but 
> which claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21393) spark (pyspark) crashes unpredictably when using show() or toPandas()

2017-07-12 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084689#comment-16084689
 ] 

Hyukjin Kwon edited comment on SPARK-21393 at 7/12/17 10:12 PM:


Would you mind sharing your codes? I want to reproduce this issue but looks I 
can't given the information here.


was (Author: hyukjin.kwon):
Would you mind sharing your codes? I want to reproduce this issue but looks I 
can given the information here.

> spark (pyspark) crashes unpredictably when using show() or toPandas()
> -
>
> Key: SPARK-21393
> URL: https://issues.apache.org/jira/browse/SPARK-21393
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.1
> Environment: Windows 10
> python 2.7
>Reporter: Zahra
>
> unpredictbly run into this error either when using 
> `pyspark.sql.DataFrame.show()` or `pyspark.sql.DataFrame.toPandas()`
> error log starts with  (truncated) :
> {noformat}
> 17/07/12 16:03:09 ERROR CodeGenerator: failed to compile: 
> org.codehaus.janino.JaninoRuntimeException: Code of method 
> "apply_47$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
>  grows beyond 64 KB
> /* 001 */ public java.lang.Object generate(Object[] references) {
> /* 002 */   return new SpecificUnsafeProjection(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ class SpecificUnsafeProjection extends 
> org.apache.spark.sql.catalyst.expressions.UnsafeProjection {
> /* 006 */
> /* 007 */   private Object[] references;
> /* 008 */   private scala.collection.immutable.Set set;
> /* 009 */   private scala.collection.immutable.Set set1;
> /* 010 */   private scala.collection.immutable.Set set2;
> /* 011 */   private scala.collection.immutable.Set set3;
> /* 012 */   private UTF8String.IntWrapper wrapper;
> /* 013 */   private UTF8String.IntWrapper wrapper1;
> /* 014 */   private scala.collection.immutable.Set set4;
> /* 015 */   private UTF8String.IntWrapper wrapper2;
> /* 016 */   private UTF8String.IntWrapper wrapper3;
> /* 017 */   private scala.collection.immutable.Set set5;
> /* 018 */   private scala.collection.immutable.Set set6;
> /* 019 */   private scala.collection.immutable.Set set7;
> /* 020 */   private UTF8String.IntWrapper wrapper4;
> /* 021 */   private UTF8String.IntWrapper wrapper5;
> /* 022 */   private scala.collection.immutable.Set set8;
> /* 023 */   private UTF8String.IntWrapper wrapper6;
> /* 024 */   private UTF8String.IntWrapper wrapper7;
> /* 025 */   private scala.collection.immutable.Set set9;
> /* 026 */   private scala.collection.immutable.Set set10;
> /* 027 */   private scala.collection.immutable.Set set11;
> /* 028 */   private UTF8String.IntWrapper wrapper8;
> /* 029 */   private UTF8String.IntWrapper wrapper9;
> /* 030 */   private scala.collection.immutable.Set set12;
> /* 031 */   private UTF8String.IntWrapper wrapper10;
> /* 032 */   private UTF8String.IntWrapper wrapper11;
> /* 033 */   private scala.collection.immutable.Set set13;
> /* 034 */   private scala.collection.immutable.Set set14;
> /* 035 */   private scala.collection.immutable.Set set15;
> /* 036 */   private UTF8String.IntWrapper wrapper12;
> /* 037 */   private UTF8String.IntWrapper wrapper13;
> /* 038 */   private scala.collection.immutable.Set set16;
> /* 039 */   private UTF8String.IntWrapper wrapper14;
> /* 040 */   private UTF8String.IntWrapper wrapper15;
> /* 041 */   private scala.collection.immutable.Set set17;
> /* 042 */   private scala.collection.immutable.Set set18;
> /* 043 */   private scala.collection.immutable.Set set19;
> /* 044 */   private UTF8String.IntWrapper wrapper16;
> /* 045 */   private UTF8String.IntWrapper wrapper17;
> /* 046 */   private scala.collection.immutable.Set set20;
> /* 047 */   private UTF8String.IntWrapper wrapper18;
> /* 048 */   private UTF8String.IntWrapper wrapper19;
> /* 049 */   private scala.collection.immutable.Set set21;
> /* 050 */   private scala.collection.immutable.Set set22;
> /* 051 */   private scala.collection.immutable.Set set23;
> /* 052 */   private UTF8String.IntWrapper wrapper20;
> /* 053 */   private UTF8String.IntWrapper wrapper21;
> /* 054 */   private scala.collection.immutable.Set set24;
> /* 055 */   private UTF8String.IntWrapper wrapper22;
> /* 056 */   private UTF8String.IntWrapper wrapper23;
> /* 057 */   private scala.collection.immutable.Set set25;
> /* 058 */   private scala.collection.immutable.Set set26;
> /* 059 */   private scala.collection.immutable.Set set27;
> /* 060 */   private UTF8String.IntWrapper wrapper24;
> /* 061 */   private UTF8String.IntWrapper wrapper25;
> /* 062 */   private 

[jira] [Assigned] (SPARK-21394) Reviving broken callable objects in UDF in PySpark

2017-07-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21394:


Assignee: Apache Spark

> Reviving broken callable objects in UDF in PySpark
> --
>
> Key: SPARK-21394
> URL: https://issues.apache.org/jira/browse/SPARK-21394
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>
> After SPARK-19161, we happened to break callable objects as UDFs in Python as 
> below:
> {code}
> >>> from pyspark.sql import functions
> >>> class F(object):
> ... def __call__(self, x):
> ... return x
> ...
> >>> foo = F()
> >>> foo(1)
> 1
> >>> udf = functions.udf(foo)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File ".../spark/python/pyspark/sql/functions.py", line 2142, in udf
> return _udf(f=f, returnType=returnType)
>   File ".../spark/python/pyspark/sql/functions.py", line 2133, in _udf
> return udf_obj._wrapped()
>   File ".../spark/python/pyspark/sql/functions.py", line 2090, in _wrapped
> @functools.wraps(self.func)
>   File 
> "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/functools.py",
>  line 33, in update_wrapper
> setattr(wrapper, attr, getattr(wrapped, attr))
> AttributeError: F instance has no attribute '__name__'
> {code}
> Note that this works in Spark 2.1 as below:
> {code}
> >>> from pyspark.sql import functions
> >>> class F(object):
> ... def __call__(self, x):
> ... return x
> ...
> >>> foo = F()
> >>> foo(1)
> 1
> >>> udf = functions.udf(foo)
> >>> spark.range(1).select(udf("id")).show()
> +-+
> |F(id)|
> +-+
> |0|
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21394) Reviving broken callable objects in UDF in PySpark

2017-07-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084787#comment-16084787
 ] 

Apache Spark commented on SPARK-21394:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/18615

> Reviving broken callable objects in UDF in PySpark
> --
>
> Key: SPARK-21394
> URL: https://issues.apache.org/jira/browse/SPARK-21394
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Hyukjin Kwon
>
> After SPARK-19161, we happened to break callable objects as UDFs in Python as 
> below:
> {code}
> >>> from pyspark.sql import functions
> >>> class F(object):
> ... def __call__(self, x):
> ... return x
> ...
> >>> foo = F()
> >>> foo(1)
> 1
> >>> udf = functions.udf(foo)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File ".../spark/python/pyspark/sql/functions.py", line 2142, in udf
> return _udf(f=f, returnType=returnType)
>   File ".../spark/python/pyspark/sql/functions.py", line 2133, in _udf
> return udf_obj._wrapped()
>   File ".../spark/python/pyspark/sql/functions.py", line 2090, in _wrapped
> @functools.wraps(self.func)
>   File 
> "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/functools.py",
>  line 33, in update_wrapper
> setattr(wrapper, attr, getattr(wrapped, attr))
> AttributeError: F instance has no attribute '__name__'
> {code}
> Note that this works in Spark 2.1 as below:
> {code}
> >>> from pyspark.sql import functions
> >>> class F(object):
> ... def __call__(self, x):
> ... return x
> ...
> >>> foo = F()
> >>> foo(1)
> 1
> >>> udf = functions.udf(foo)
> >>> spark.range(1).select(udf("id")).show()
> +-+
> |F(id)|
> +-+
> |0|
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21394) Reviving broken callable objects in UDF in PySpark

2017-07-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21394:


Assignee: (was: Apache Spark)

> Reviving broken callable objects in UDF in PySpark
> --
>
> Key: SPARK-21394
> URL: https://issues.apache.org/jira/browse/SPARK-21394
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Hyukjin Kwon
>
> After SPARK-19161, we happened to break callable objects as UDFs in Python as 
> below:
> {code}
> >>> from pyspark.sql import functions
> >>> class F(object):
> ... def __call__(self, x):
> ... return x
> ...
> >>> foo = F()
> >>> foo(1)
> 1
> >>> udf = functions.udf(foo)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File ".../spark/python/pyspark/sql/functions.py", line 2142, in udf
> return _udf(f=f, returnType=returnType)
>   File ".../spark/python/pyspark/sql/functions.py", line 2133, in _udf
> return udf_obj._wrapped()
>   File ".../spark/python/pyspark/sql/functions.py", line 2090, in _wrapped
> @functools.wraps(self.func)
>   File 
> "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/functools.py",
>  line 33, in update_wrapper
> setattr(wrapper, attr, getattr(wrapped, attr))
> AttributeError: F instance has no attribute '__name__'
> {code}
> Note that this works in Spark 2.1 as below:
> {code}
> >>> from pyspark.sql import functions
> >>> class F(object):
> ... def __call__(self, x):
> ... return x
> ...
> >>> foo = F()
> >>> foo(1)
> 1
> >>> udf = functions.udf(foo)
> >>> spark.range(1).select(udf("id")).show()
> +-+
> |F(id)|
> +-+
> |0|
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21374) Reading globbed paths from S3 into DF doesn't work if filesystem caching is disabled

2017-07-12 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084764#comment-16084764
 ] 

Shixiong Zhu commented on SPARK-21374:
--

Yeah, org.apache.spark.deploy.SparkHadoopUtil.globPath uses a wrong Hadoop 
configuration. Welcome to submit a PR to fix it.

Right now as a workaround, you can use the following codes to set your keys:

{code}
val conf = org.apache.spark.deploy.SparkHadoopUtil.get.conf
conf.set(...)
{code}

> Reading globbed paths from S3 into DF doesn't work if filesystem caching is 
> disabled
> 
>
> Key: SPARK-21374
> URL: https://issues.apache.org/jira/browse/SPARK-21374
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.2, 2.1.1
>Reporter: Andrey Taptunov
>
> *Motivation:*
> In my case I want to disable filesystem cache to be able to change S3's 
> access key and secret key on the fly to read from buckets with different 
> permissions. This works perfectly fine for RDDs but doesn't work for DFs.
> *Example (works for RDD but fails for DataFrame):*
> {code:java}
> import org.apache.spark.SparkContext
> import org.apache.spark.SparkConf
> import org.apache.spark.sql.SparkSession
> object SimpleApp {
>   def main(args: Array[String]) {
> val awsAccessKeyId = "something"
> val awsSecretKey = "something else"
> val conf = new SparkConf().setAppName("Simple 
> Application").setMaster("local[*]")
> val sc = new SparkContext(conf)
> sc.hadoopConfiguration.set("fs.s3.awsAccessKeyId", awsAccessKeyId)
> sc.hadoopConfiguration.set("fs.s3.awsSecretAccessKey", awsSecretKey)
> sc.hadoopConfiguration.setBoolean("fs.s3.impl.disable.cache",true)
> 
> sc.hadoopConfiguration.set("fs.s3.impl","org.apache.hadoop.fs.s3native.NativeS3FileSystem")
> sc.hadoopConfiguration.set("fs.s3.buffer.dir","/tmp")
> val spark = SparkSession.builder().config(conf).getOrCreate()
> val rddFile = sc.textFile("s3://bucket/file.csv").count // ok
> val rddGlob = sc.textFile("s3://bucket/*").count // ok
> val dfFile = spark.read.format("csv").load("s3://bucket/file.csv").count 
> // ok
> 
> val dfGlob = spark.read.format("csv").load("s3://bucket/*").count 
> // IllegalArgumentExcepton. AWS Access Key ID and Secret Access Key must 
> be specified as the username or password (respectively)
> // of a s3 URL, or by setting the fs.s3.awsAccessKeyId or 
> fs.s3.awsSecretAccessKey properties (respectively).
>
> sc.stop()
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21378) Spark Poll timeout when specific offsets are passed

2017-07-12 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084772#comment-16084772
 ] 

Shixiong Zhu edited comment on SPARK-21378 at 7/12/17 9:49 PM:
---

bq. Digging deeper shows that there's an assert statement such that if no 
records are returned (which is a valid case) then a failure will happen.

That's actually not a valid case. CachedKafkaConsumer.scala uses the offset 
range generated in the driver, so the records are supposed to be in Kafka. If 
not, then it means timeout, or the data is missing. If it's just because of 
timeout, you can increase "spark.streaming.kafka.consumer.poll.ms" (available 
since Spark 2.1.0).




was (Author: zsxwing):
bq. Digging deeper shows that there's an assert statement such that if no 
records are returned (which is a valid case) then a failure will happen.

That's actually not a valid case. CachedKafkaConsumer.scala uses the offset 
range generated in the driver, so the records are supposed to be in Kafka. If 
not, then it means timeout, or the data is missing. If it's just because of 
timeout, you can increase "spark.streaming.kafka.consumer.poll.ms".



> Spark Poll timeout when specific offsets are passed
> ---
>
> Key: SPARK-21378
> URL: https://issues.apache.org/jira/browse/SPARK-21378
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0, 2.0.2
>Reporter: Ambud Sharma
>
> Kafka direct stream fails with poll timeout:
> {code:java}
> JavaInputDStream> stream = 
> KafkaUtils.createDirectStream(ssc,
>   LocationStrategies.PreferConsistent(),
>   ConsumerStrategies. String>Subscribe(topicsCollection, kafkaParams, fromOffsets));
> {code}
> Digging deeper shows that there's an assert statement such that if no records 
> are returned (which is a valid case) then a failure will happen.
> https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/CachedKafkaConsumer.scala#L75
> This solution: https://issues.apache.org/jira/browse/SPARK-19275 keeps 
> getting "Added jobs for time" and eventually leads to "Failed to get records 
> for spark-x after polling for 3000"; in this case batch size is 3seconds
> We can increase it to an even bigger number which leads to OOM.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21378) Spark Poll timeout when specific offsets are passed

2017-07-12 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084772#comment-16084772
 ] 

Shixiong Zhu commented on SPARK-21378:
--

bq. Digging deeper shows that there's an assert statement such that if no 
records are returned (which is a valid case) then a failure will happen.

That's actually not a valid case. CachedKafkaConsumer.scala uses the offset 
range generated in the driver, so the records are supposed to be in Kafka. If 
not, then it means timeout, or the data is missing. If it's just because of 
timeout, you can increase "spark.streaming.kafka.consumer.poll.ms".



> Spark Poll timeout when specific offsets are passed
> ---
>
> Key: SPARK-21378
> URL: https://issues.apache.org/jira/browse/SPARK-21378
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0, 2.0.2
>Reporter: Ambud Sharma
>
> Kafka direct stream fails with poll timeout:
> {code:java}
> JavaInputDStream> stream = 
> KafkaUtils.createDirectStream(ssc,
>   LocationStrategies.PreferConsistent(),
>   ConsumerStrategies. String>Subscribe(topicsCollection, kafkaParams, fromOffsets));
> {code}
> Digging deeper shows that there's an assert statement such that if no records 
> are returned (which is a valid case) then a failure will happen.
> https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/CachedKafkaConsumer.scala#L75
> This solution: https://issues.apache.org/jira/browse/SPARK-19275 keeps 
> getting "Added jobs for time" and eventually leads to "Failed to get records 
> for spark-x after polling for 3000"; in this case batch size is 3seconds
> We can increase it to an even bigger number which leads to OOM.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-12559) Cluster mode doesn't work with --packages

2017-07-12 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reopened SPARK-12559:


> Cluster mode doesn't work with --packages
> -
>
> Key: SPARK-12559
> URL: https://issues.apache.org/jira/browse/SPARK-12559
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.3.0
>Reporter: Andrew Or
>
> From the mailing list:
> {quote}
> Another problem I ran into that you also might is that --packages doesn't
> work with --deploy-mode cluster.  It downloads the packages to a temporary
> location on the node running spark-submit, then passes those paths to the
> node that is running the Driver, but since that isn't the same machine, it
> can't find anything and fails.  The driver process *should* be the one
> doing the downloading, but it isn't. I ended up having to create a fat JAR
> with all of the dependencies to get around that one.
> {quote}
> The problem is that we currently don't upload jars to the cluster. It seems 
> to fix this we either (1) do upload jars, or (2) just run the packages code 
> on the driver side. I slightly prefer (2) because it's simpler.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21378) Spark Poll timeout when specific offsets are passed

2017-07-12 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-21378:
-
Component/s: (was: Spark Core)
 DStreams

> Spark Poll timeout when specific offsets are passed
> ---
>
> Key: SPARK-21378
> URL: https://issues.apache.org/jira/browse/SPARK-21378
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0, 2.0.2
>Reporter: Ambud Sharma
>
> Kafka direct stream fails with poll timeout:
> {code:java}
> JavaInputDStream> stream = 
> KafkaUtils.createDirectStream(ssc,
>   LocationStrategies.PreferConsistent(),
>   ConsumerStrategies. String>Subscribe(topicsCollection, kafkaParams, fromOffsets));
> {code}
> Digging deeper shows that there's an assert statement such that if no records 
> are returned (which is a valid case) then a failure will happen.
> https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/CachedKafkaConsumer.scala#L75
> This solution: https://issues.apache.org/jira/browse/SPARK-19275 keeps 
> getting "Added jobs for time" and eventually leads to "Failed to get records 
> for spark-x after polling for 3000"; in this case batch size is 3seconds
> We can increase it to an even bigger number which leads to OOM.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21391) Cannot convert a Seq of Map whose value type is again a seq, into a dataset

2017-07-12 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-21391.
--
Resolution: Cannot Reproduce

I can't reproduce this against the current master as described in the JIRA but 
I don't know which JIRA fixes this. I am resolving this but it might be nicer 
if we can identify the JIRA fixing this and backports if applicable.

[~kiszk], please fix my resolution here if you know which JIRA fixes this.

> Cannot convert a Seq of Map whose value type is again a seq, into a dataset 
> 
>
> Key: SPARK-21391
> URL: https://issues.apache.org/jira/browse/SPARK-21391
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: Seen on mac OSX, scala 2.11, java 8
>Reporter: indraneel rao
>
> There is an error while trying to create a dataset from a sequence of Maps, 
> whose values have any kind of collections. Even when they are wrapped in a 
> case class. 
> Eg : The following piece of code throws an error:
>
> {code:java}
> case class Values(values: Seq[Double])
> case class ItemProperties(properties:Map[String,Values])
> val l1 = List(ItemProperties(
>   Map(
> "A1" -> Values(Seq(1.0,2.0)),
> "B1" -> Values(Seq(44.0,55.0))
>   )
> ),
>   ItemProperties(
> Map(
>   "A2" -> Values(Seq(123.0,25.0)),
>   "B2" -> Values(Seq(445.0,35.0))
> )
>   )
> )
> l1.toDS().show()
> {code}
> Heres the error:
> 17/07/12 21:59:35 ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 65, Column 46: Expression "ExternalMapToCatalyst_value_isNull0" is not an 
> rvalue
> /* 001 */ public java.lang.Object generate(Object[] references) {
> /* 002 */   return new SpecificUnsafeProjection(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ class SpecificUnsafeProjection extends 
> org.apache.spark.sql.catalyst.expressions.UnsafeProjection {
> /* 006 */
> /* 007 */   private Object[] references;
> /* 008 */   private boolean resultIsNull;
> /* 009 */   private java.lang.String argValue;
> /* 010 */   private Object[] values;
> /* 011 */   private boolean resultIsNull1;
> /* 012 */   private scala.collection.Seq argValue1;
> /* 013 */   private boolean isNull11;
> /* 014 */   private boolean value11;
> /* 015 */   private boolean isNull12;
> /* 016 */   private InternalRow value12;
> /* 017 */   private boolean isNull13;
> /* 018 */   private InternalRow value13;
> /* 019 */   private UnsafeRow result;
> /* 020 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder holder;
> /* 021 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter rowWriter;
> /* 022 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter 
> arrayWriter;
> /* 023 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter 
> arrayWriter1;
> /* 024 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter rowWriter1;
> /* 025 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter 
> arrayWriter2;
> /* 026 */
> /* 027 */   public SpecificUnsafeProjection(Object[] references) {
> /* 028 */ this.references = references;
> /* 029 */
> /* 030 */
> /* 031 */ this.values = null;
> /* 032 */
> /* 033 */
> /* 034 */ isNull11 = false;
> /* 035 */ value11 = false;
> /* 036 */ isNull12 = false;
> /* 037 */ value12 = null;
> /* 038 */ isNull13 = false;
> /* 039 */ value13 = null;
> /* 040 */ result = new UnsafeRow(1);
> /* 041 */ this.holder = new 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(result, 32);
> /* 042 */ this.rowWriter = new 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(holder, 1);
> /* 043 */ this.arrayWriter = new 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter();
> /* 044 */ this.arrayWriter1 = new 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter();
> /* 045 */ this.rowWriter1 = new 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(holder, 1);
> /* 046 */ this.arrayWriter2 = new 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter();
> /* 047 */
> /* 048 */   }
> /* 049 */
> /* 050 */   public void initialize(int partitionIndex) {
> /* 051 */
> /* 052 */   }
> /* 053 */
> /* 054 */
> /* 055 */   private void evalIfTrueExpr(InternalRow i) {
> /* 056 */ final InternalRow value7 = null;
> /* 057 */ isNull12 = true;
> /* 058 */ value12 = value7;
> /* 059 */   }
> /* 060 */
> /* 061 */
> /* 062 */   private void evalIfCondExpr(InternalRow i) {
> /* 

[jira] [Created] (SPARK-21394) Reviving broken callable objects in UDF in PySpark

2017-07-12 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-21394:


 Summary: Reviving broken callable objects in UDF in PySpark
 Key: SPARK-21394
 URL: https://issues.apache.org/jira/browse/SPARK-21394
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.2.0, 2.3.0
Reporter: Hyukjin Kwon


After SPARK-19161, we happened to break callable objects as UDFs in Python as 
below:

{code}
>>> from pyspark.sql import functions
>>> class F(object):
... def __call__(self, x):
... return x
...
>>> foo = F()
>>> foo(1)
1
>>> udf = functions.udf(foo)
Traceback (most recent call last):
  File "", line 1, in 
  File ".../spark/python/pyspark/sql/functions.py", line 2142, in udf
return _udf(f=f, returnType=returnType)
  File ".../spark/python/pyspark/sql/functions.py", line 2133, in _udf
return udf_obj._wrapped()
  File ".../spark/python/pyspark/sql/functions.py", line 2090, in _wrapped
@functools.wraps(self.func)
  File 
"/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/functools.py",
 line 33, in update_wrapper
setattr(wrapper, attr, getattr(wrapped, attr))
AttributeError: F instance has no attribute '__name__'
{code}


Note that this works in Spark 2.1 as below:

{code}
>>> from pyspark.sql import functions
>>> class F(object):
... def __call__(self, x):
... return x
...
>>> foo = F()
>>> foo(1)
1
>>> udf = functions.udf(foo)
>>> spark.range(1).select(udf("id")).show()
+-+
|F(id)|
+-+
|0|
+-+
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21392) Unable to infer schema when loading Parquet file

2017-07-12 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084697#comment-16084697
 ] 

Hyukjin Kwon commented on SPARK-21392:
--

Would you mind running {{outcome.show}} and attaching the result here?

> Unable to infer schema when loading Parquet file
> 
>
> Key: SPARK-21392
> URL: https://issues.apache.org/jira/browse/SPARK-21392
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.1
> Environment: Spark 2.1.1. python 2.7.6
>Reporter: Stuart Reynolds
>  Labels: parquet, pyspark
>
> The following boring code works
> {code:none}
> response = "mi_or_chd_5"
> outcome = sqlc.sql("""select eid,{response} as response
> from outcomes
> where {response} IS NOT NULL""".format(response=response))
> outcome.write.parquet(response, mode="overwrite")
> 
> >>> print outcome.schema
> 
> StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true)))
> {code}
> 
> But then,
> {code:none}
> outcome2 = sqlc.read.parquet(response)  # fail
> {code}
> fails with:
> {code:none}AnalysisException: u'Unable to infer schema for Parquet. It must 
> be specified manually.;'
> {code}
> in 
> {code:none} 
> /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
>  in deco(*a, **kw)
> {code}
> The documentation for parquet says the format is self describing, and the 
> full schema was available when the parquet file was saved. What gives?
> Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but 
> which claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21393) spark (pyspark) crashes unpredictably when using show() or toPandas()

2017-07-12 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-21393:
-
Affects Version/s: (was: 2.2.1)

> spark (pyspark) crashes unpredictably when using show() or toPandas()
> -
>
> Key: SPARK-21393
> URL: https://issues.apache.org/jira/browse/SPARK-21393
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.1
> Environment: Windows 10
> python 2.7
>Reporter: Zahra
>
> unpredictbly run into this error either when using 
> `pyspark.sql.DataFrame.show()` or `pyspark.sql.DataFrame.toPandas()`
> error log starts with  (truncated) :
> {noformat}
> 17/07/12 16:03:09 ERROR CodeGenerator: failed to compile: 
> org.codehaus.janino.JaninoRuntimeException: Code of method 
> "apply_47$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
>  grows beyond 64 KB
> /* 001 */ public java.lang.Object generate(Object[] references) {
> /* 002 */   return new SpecificUnsafeProjection(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ class SpecificUnsafeProjection extends 
> org.apache.spark.sql.catalyst.expressions.UnsafeProjection {
> /* 006 */
> /* 007 */   private Object[] references;
> /* 008 */   private scala.collection.immutable.Set set;
> /* 009 */   private scala.collection.immutable.Set set1;
> /* 010 */   private scala.collection.immutable.Set set2;
> /* 011 */   private scala.collection.immutable.Set set3;
> /* 012 */   private UTF8String.IntWrapper wrapper;
> /* 013 */   private UTF8String.IntWrapper wrapper1;
> /* 014 */   private scala.collection.immutable.Set set4;
> /* 015 */   private UTF8String.IntWrapper wrapper2;
> /* 016 */   private UTF8String.IntWrapper wrapper3;
> /* 017 */   private scala.collection.immutable.Set set5;
> /* 018 */   private scala.collection.immutable.Set set6;
> /* 019 */   private scala.collection.immutable.Set set7;
> /* 020 */   private UTF8String.IntWrapper wrapper4;
> /* 021 */   private UTF8String.IntWrapper wrapper5;
> /* 022 */   private scala.collection.immutable.Set set8;
> /* 023 */   private UTF8String.IntWrapper wrapper6;
> /* 024 */   private UTF8String.IntWrapper wrapper7;
> /* 025 */   private scala.collection.immutable.Set set9;
> /* 026 */   private scala.collection.immutable.Set set10;
> /* 027 */   private scala.collection.immutable.Set set11;
> /* 028 */   private UTF8String.IntWrapper wrapper8;
> /* 029 */   private UTF8String.IntWrapper wrapper9;
> /* 030 */   private scala.collection.immutable.Set set12;
> /* 031 */   private UTF8String.IntWrapper wrapper10;
> /* 032 */   private UTF8String.IntWrapper wrapper11;
> /* 033 */   private scala.collection.immutable.Set set13;
> /* 034 */   private scala.collection.immutable.Set set14;
> /* 035 */   private scala.collection.immutable.Set set15;
> /* 036 */   private UTF8String.IntWrapper wrapper12;
> /* 037 */   private UTF8String.IntWrapper wrapper13;
> /* 038 */   private scala.collection.immutable.Set set16;
> /* 039 */   private UTF8String.IntWrapper wrapper14;
> /* 040 */   private UTF8String.IntWrapper wrapper15;
> /* 041 */   private scala.collection.immutable.Set set17;
> /* 042 */   private scala.collection.immutable.Set set18;
> /* 043 */   private scala.collection.immutable.Set set19;
> /* 044 */   private UTF8String.IntWrapper wrapper16;
> /* 045 */   private UTF8String.IntWrapper wrapper17;
> /* 046 */   private scala.collection.immutable.Set set20;
> /* 047 */   private UTF8String.IntWrapper wrapper18;
> /* 048 */   private UTF8String.IntWrapper wrapper19;
> /* 049 */   private scala.collection.immutable.Set set21;
> /* 050 */   private scala.collection.immutable.Set set22;
> /* 051 */   private scala.collection.immutable.Set set23;
> /* 052 */   private UTF8String.IntWrapper wrapper20;
> /* 053 */   private UTF8String.IntWrapper wrapper21;
> /* 054 */   private scala.collection.immutable.Set set24;
> /* 055 */   private UTF8String.IntWrapper wrapper22;
> /* 056 */   private UTF8String.IntWrapper wrapper23;
> /* 057 */   private scala.collection.immutable.Set set25;
> /* 058 */   private scala.collection.immutable.Set set26;
> /* 059 */   private scala.collection.immutable.Set set27;
> /* 060 */   private UTF8String.IntWrapper wrapper24;
> /* 061 */   private UTF8String.IntWrapper wrapper25;
> /* 062 */   private scala.collection.immutable.Set set28;
> /* 063 */   private UTF8String.IntWrapper wrapper26;
> /* 064 */   private UTF8String.IntWrapper wrapper27;
> /* 065 */   private scala.collection.immutable.Set set29;
> /* 066 */   private scala.collection.immutable.Set set30;
> /* 067 */   private 

[jira] [Commented] (SPARK-21392) Unable to infer schema when loading Parquet file

2017-07-12 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084707#comment-16084707
 ] 

Dongjoon Hyun commented on SPARK-21392:
---

It seems to be a different issue. SPARK-16975 aims to read the parquet file 
written by Spark 1.6.X. So the content of fix is the following.
{code}
[SPARK-16975][SQL] Column-partition path starting '_' should be handled 
correctly
{code}
Here, I think you are using Spark 2.1.1 to write the parquet file.

> Unable to infer schema when loading Parquet file
> 
>
> Key: SPARK-21392
> URL: https://issues.apache.org/jira/browse/SPARK-21392
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.1
> Environment: Spark 2.1.1. python 2.7.6
>Reporter: Stuart Reynolds
>  Labels: parquet, pyspark
>
> The following boring code works
> {code:none}
> response = "mi_or_chd_5"
> outcome = sqlc.sql("""select eid,{response} as response
> from outcomes
> where {response} IS NOT NULL""".format(response=response))
> outcome.write.parquet(response, mode="overwrite")
> 
> >>> print outcome.schema
> 
> StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true)))
> {code}
> 
> But then,
> {code:none}
> outcome2 = sqlc.read.parquet(response)  # fail
> {code}
> fails with:
> {code:none}AnalysisException: u'Unable to infer schema for Parquet. It must 
> be specified manually.;'
> {code}
> in 
> {code:none} 
> /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
>  in deco(*a, **kw)
> {code}
> The documentation for parquet says the format is self describing, and the 
> full schema was available when the parquet file was saved. What gives?
> Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but 
> which claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21393) spark (pyspark) crashes unpredictably when using show() or toPandas()

2017-07-12 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084689#comment-16084689
 ] 

Hyukjin Kwon commented on SPARK-21393:
--

Would you mind sharing your codes? I want to reproduce this issue but looks I 
can given the information here.

> spark (pyspark) crashes unpredictably when using show() or toPandas()
> -
>
> Key: SPARK-21393
> URL: https://issues.apache.org/jira/browse/SPARK-21393
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.1
> Environment: Windows 10
> python 2.7
>Reporter: Zahra
>
> unpredictbly run into this error either when using 
> `pyspark.sql.DataFrame.show()` or `pyspark.sql.DataFrame.toPandas()`
> error log starts with  (truncated) :
> {noformat}
> 17/07/12 16:03:09 ERROR CodeGenerator: failed to compile: 
> org.codehaus.janino.JaninoRuntimeException: Code of method 
> "apply_47$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
>  grows beyond 64 KB
> /* 001 */ public java.lang.Object generate(Object[] references) {
> /* 002 */   return new SpecificUnsafeProjection(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ class SpecificUnsafeProjection extends 
> org.apache.spark.sql.catalyst.expressions.UnsafeProjection {
> /* 006 */
> /* 007 */   private Object[] references;
> /* 008 */   private scala.collection.immutable.Set set;
> /* 009 */   private scala.collection.immutable.Set set1;
> /* 010 */   private scala.collection.immutable.Set set2;
> /* 011 */   private scala.collection.immutable.Set set3;
> /* 012 */   private UTF8String.IntWrapper wrapper;
> /* 013 */   private UTF8String.IntWrapper wrapper1;
> /* 014 */   private scala.collection.immutable.Set set4;
> /* 015 */   private UTF8String.IntWrapper wrapper2;
> /* 016 */   private UTF8String.IntWrapper wrapper3;
> /* 017 */   private scala.collection.immutable.Set set5;
> /* 018 */   private scala.collection.immutable.Set set6;
> /* 019 */   private scala.collection.immutable.Set set7;
> /* 020 */   private UTF8String.IntWrapper wrapper4;
> /* 021 */   private UTF8String.IntWrapper wrapper5;
> /* 022 */   private scala.collection.immutable.Set set8;
> /* 023 */   private UTF8String.IntWrapper wrapper6;
> /* 024 */   private UTF8String.IntWrapper wrapper7;
> /* 025 */   private scala.collection.immutable.Set set9;
> /* 026 */   private scala.collection.immutable.Set set10;
> /* 027 */   private scala.collection.immutable.Set set11;
> /* 028 */   private UTF8String.IntWrapper wrapper8;
> /* 029 */   private UTF8String.IntWrapper wrapper9;
> /* 030 */   private scala.collection.immutable.Set set12;
> /* 031 */   private UTF8String.IntWrapper wrapper10;
> /* 032 */   private UTF8String.IntWrapper wrapper11;
> /* 033 */   private scala.collection.immutable.Set set13;
> /* 034 */   private scala.collection.immutable.Set set14;
> /* 035 */   private scala.collection.immutable.Set set15;
> /* 036 */   private UTF8String.IntWrapper wrapper12;
> /* 037 */   private UTF8String.IntWrapper wrapper13;
> /* 038 */   private scala.collection.immutable.Set set16;
> /* 039 */   private UTF8String.IntWrapper wrapper14;
> /* 040 */   private UTF8String.IntWrapper wrapper15;
> /* 041 */   private scala.collection.immutable.Set set17;
> /* 042 */   private scala.collection.immutable.Set set18;
> /* 043 */   private scala.collection.immutable.Set set19;
> /* 044 */   private UTF8String.IntWrapper wrapper16;
> /* 045 */   private UTF8String.IntWrapper wrapper17;
> /* 046 */   private scala.collection.immutable.Set set20;
> /* 047 */   private UTF8String.IntWrapper wrapper18;
> /* 048 */   private UTF8String.IntWrapper wrapper19;
> /* 049 */   private scala.collection.immutable.Set set21;
> /* 050 */   private scala.collection.immutable.Set set22;
> /* 051 */   private scala.collection.immutable.Set set23;
> /* 052 */   private UTF8String.IntWrapper wrapper20;
> /* 053 */   private UTF8String.IntWrapper wrapper21;
> /* 054 */   private scala.collection.immutable.Set set24;
> /* 055 */   private UTF8String.IntWrapper wrapper22;
> /* 056 */   private UTF8String.IntWrapper wrapper23;
> /* 057 */   private scala.collection.immutable.Set set25;
> /* 058 */   private scala.collection.immutable.Set set26;
> /* 059 */   private scala.collection.immutable.Set set27;
> /* 060 */   private UTF8String.IntWrapper wrapper24;
> /* 061 */   private UTF8String.IntWrapper wrapper25;
> /* 062 */   private scala.collection.immutable.Set set28;
> /* 063 */   private UTF8String.IntWrapper wrapper26;
> /* 064 */   private UTF8String.IntWrapper wrapper27;
> /* 065 */   private 

[jira] [Commented] (SPARK-21392) Unable to infer schema when loading Parquet file

2017-07-12 Thread Stuart Reynolds (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084601#comment-16084601
 ] 

Stuart Reynolds commented on SPARK-21392:
-

Done.

I'm simply trying to build a table of two columns ("eid" and whatever the named 
response variable is), and save the table as a parquet file (with the same name 
are the second column), and load it back up.

Saving works but loading fails, complaining it can't infer the schema, even 
though I could print the schema before saving it:
{code:none}
   StructType(
   List(
StructField(eid,IntegerType,true),
StructField(response,ShortType,true)))
{code}

> Unable to infer schema when loading Parquet file
> 
>
> Key: SPARK-21392
> URL: https://issues.apache.org/jira/browse/SPARK-21392
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.1
> Environment: Spark 2.1.1. python 2.7.6
>Reporter: Stuart Reynolds
>  Labels: parquet, pyspark
>
> The following boring code works
> {code:none}
> response = "mi_or_chd_5"
> outcome = sqlc.sql("""select eid,{response} as response
> from outcomes
> where {response} IS NOT NULL""".format(response=response))
> outcome.write.parquet(response, mode="overwrite")
> 
> >>> print outcome.schema
> 
> StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true)))
> {code}
> 
> But then,
> {code:none}
> outcome2 = sqlc.read.parquet(response)  # fail
> {code}
> fails with:
> {code:none}AnalysisException: u'Unable to infer schema for Parquet. It must 
> be specified manually.;'
> {code}
> in 
> {code:none} 
> /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
>  in deco(*a, **kw)
> {code}
> The documentation for parquet says the format is self describing, and the 
> full schema was available when the parquet file was saved. What gives?
> Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but 
> which claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21392) Unable to infer schema when loading Parquet file

2017-07-12 Thread Stuart Reynolds (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stuart Reynolds updated SPARK-21392:

Description: 
The following boring code works

{code:none}
response = "mi_or_chd_5"

outcome = sqlc.sql("""select eid,{response} as response
from outcomes
where {response} IS NOT NULL""".format(response=response))
outcome.write.parquet(response, mode="overwrite")

>>> print outcome.schema

StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true)))
{code}

But then,
{code:none}
outcome2 = sqlc.read.parquet(response)  # fail
{code}

fails with:

{code:none}AnalysisException: u'Unable to infer schema for Parquet. It must be 
specified manually.;'
{code}

in 

{code:none} 
/usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
 in deco(*a, **kw)
{code}

The documentation for parquet says the format is self describing, and the full 
schema was available when the parquet file was saved. What gives?

Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but which 
claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)


  was:
The following boring code works

{code:python}
response = "mi_or_chd_5"

outcome = sqlc.sql("""select eid,{response} as response
from outcomes
where {response} IS NOT NULL""".format(response=response))
outcome.write.parquet(response, mode="overwrite")

>>> print outcome.schema

StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true)))
{code}

But then,
{code:python}
outcome2 = sqlc.read.parquet(response)  # fail
{code}

fails with:

{code:python}AnalysisException: u'Unable to infer schema for Parquet. It must 
be specified manually.;'
{code}

in 

{code:python} 
/usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
 in deco(*a, **kw)
{code}

The documentation for parquet says the format is self describing, and the full 
schema was available when the parquet file was saved. What gives?

Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but which 
claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)



> Unable to infer schema when loading Parquet file
> 
>
> Key: SPARK-21392
> URL: https://issues.apache.org/jira/browse/SPARK-21392
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.1
> Environment: Spark 2.1.1. python 2.7.6
>Reporter: Stuart Reynolds
>  Labels: parquet, pyspark
>
> The following boring code works
> {code:none}
> response = "mi_or_chd_5"
> outcome = sqlc.sql("""select eid,{response} as response
> from outcomes
> where {response} IS NOT NULL""".format(response=response))
> outcome.write.parquet(response, mode="overwrite")
> 
> >>> print outcome.schema
> 
> StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true)))
> {code}
> 
> But then,
> {code:none}
> outcome2 = sqlc.read.parquet(response)  # fail
> {code}
> fails with:
> {code:none}AnalysisException: u'Unable to infer schema for Parquet. It must 
> be specified manually.;'
> {code}
> in 
> {code:none} 
> /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
>  in deco(*a, **kw)
> {code}
> The documentation for parquet says the format is self describing, and the 
> full schema was available when the parquet file was saved. What gives?
> Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but 
> which claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21392) Unable to infer schema when loading Parquet file

2017-07-12 Thread Stuart Reynolds (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stuart Reynolds updated SPARK-21392:

Description: 
The following boring code works

{code:none}
response = "mi_or_chd_5"

outcome = sqlc.sql("""select eid,{response} as response
from outcomes
where {response} IS NOT NULL""".format(response=response))
outcome.write.parquet(response, mode="overwrite")

>>> print outcome.schema

StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true)))
{code}

But then,
{code:none}
outcome2 = sqlc.read.parquet(response)  # fail
{code}

fails with:

{code:none}AnalysisException: u'Unable to infer schema for Parquet. It must be 
specified manually.;'
{code}

in 

{code:none} 
/usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
 in deco(*a, **kw)
{code}

The documentation for parquet says the format is self describing, and the full 
schema was available when the parquet file was saved. What gives?

Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but which 
claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)


  was:
The following boring code works

{code:none}
response = "mi_or_chd_5"

outcome = sqlc.sql("""select eid,{response} as response
from outcomes
where {response} IS NOT NULL""".format(response=response))
outcome.write.parquet(response, mode="overwrite")

>>> print outcome.schema

StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true)))
{code}

But then,
{code:none}
outcome2 = sqlc.read.parquet(response)  # fail
{code}

fails with:

{code:none}AnalysisException: u'Unable to infer schema for Parquet. It must be 
specified manually.;'
{code}

in 

{code:none} 
/usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
 in deco(*a, **kw)
{code}

The documentation for parquet says the format is self describing, and the full 
schema was available when the parquet file was saved. What gives?

Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but which 
claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)



> Unable to infer schema when loading Parquet file
> 
>
> Key: SPARK-21392
> URL: https://issues.apache.org/jira/browse/SPARK-21392
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.1
> Environment: Spark 2.1.1. python 2.7.6
>Reporter: Stuart Reynolds
>  Labels: parquet, pyspark
>
> The following boring code works
> {code:none}
> response = "mi_or_chd_5"
> outcome = sqlc.sql("""select eid,{response} as response
> from outcomes
> where {response} IS NOT NULL""".format(response=response))
> outcome.write.parquet(response, mode="overwrite")
> 
> >>> print outcome.schema
> 
> StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true)))
> {code}
> 
> But then,
> {code:none}
> outcome2 = sqlc.read.parquet(response)  # fail
> {code}
> fails with:
> {code:none}AnalysisException: u'Unable to infer schema for Parquet. It must 
> be specified manually.;'
> {code}
> in 
> {code:none} 
> /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
>  in deco(*a, **kw)
> {code}
> The documentation for parquet says the format is self describing, and the 
> full schema was available when the parquet file was saved. What gives?
> Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but 
> which claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21392) Unable to infer schema when loading Parquet file

2017-07-12 Thread Stuart Reynolds (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stuart Reynolds updated SPARK-21392:

Description: 
The following boring code works

{code:python}
   response = "mi_or_chd_5"
colname = "f_1000"

outcome = sqlc.sql("""select eid,{response} as response
from outcomes
where {response} IS NOT NULL""".format(response=response))
outcome.write.parquet(response, mode="overwrite")

col = sqlc.sql("""select eid,{colname} as {colname}
from baseline_denull
where {colname} IS NOT NULL""".format(colname=colname))
col.write.parquet(colname, mode="overwrite")

>>> print outcome.schema

StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true)))

>>> print col.schema

StructType(List(StructField(eid,IntegerType,true),StructField(f_1000,DoubleType,true)))
{code}

But then,
{code:python}
outcome2 = sqlc.read.parquet(response)  # fail
col2 = sqlc.read.parquet(colname) # fail
{code}

fails with:

{code:python}AnalysisException: u'Unable to infer schema for Parquet. It must 
be specified manually.;'
{code}

in 

{code:python} 
/usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
 in deco(*a, **kw)
{code}

The documentation for parquet says the format is self describing, and the full 
schema was available when the parquet file was saved. What gives?

Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but which 
claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)


  was:
The following boring code works

{{response = "mi_or_chd_5"
colname = "f_1000"

outcome = sqlc.sql("""select eid,{response} as response
from outcomes
where {response} IS NOT NULL""".format(response=response))
outcome.write.parquet(response, mode="overwrite")

col = sqlc.sql("""select eid,{colname} as {colname}
from baseline_denull
where {colname} IS NOT NULL""".format(colname=colname))
col.write.parquet(colname, mode="overwrite")

>>> print outcome.schema

StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true)))

>>> print col.schema

StructType(List(StructField(eid,IntegerType,true),StructField(f_1000,DoubleType,true)))
}}.

But then,
{{
outcome2 = sqlc.read.parquet(response)  # fail
col2 = sqlc.read.parquet(colname) # fail
}}.

fails with:

{{AnalysisException: u'Unable to infer schema for Parquet. It must be 
specified manually.;'
}}.

in 

{{
/usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
 in deco(*a, **kw)
}}.

The documentation for parquet says the format is self describing, and the full 
schema was available when the parquet file was saved. What gives?

Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but which 
claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)



> Unable to infer schema when loading Parquet file
> 
>
> Key: SPARK-21392
> URL: https://issues.apache.org/jira/browse/SPARK-21392
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.1
> Environment: Spark 2.1.1. python 2.7.6
>Reporter: Stuart Reynolds
>  Labels: parquet, pyspark
>
> The following boring code works
> {code:python}
>response = "mi_or_chd_5"
> colname = "f_1000"
> outcome = sqlc.sql("""select eid,{response} as response
> from outcomes
> where {response} IS NOT NULL""".format(response=response))
> outcome.write.parquet(response, mode="overwrite")
> 
> col = sqlc.sql("""select eid,{colname} as {colname}
> from baseline_denull
> where {colname} IS NOT NULL""".format(colname=colname))
> col.write.parquet(colname, mode="overwrite")
> >>> print outcome.schema
> 
> StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true)))
> >>> print col.schema
> 
> StructType(List(StructField(eid,IntegerType,true),StructField(f_1000,DoubleType,true)))
> {code}
> 
> But then,
> {code:python}
> outcome2 = sqlc.read.parquet(response)  # fail
> col2 = sqlc.read.parquet(colname) # fail
> {code}
> fails with:
> {code:python}AnalysisException: u'Unable to infer schema for Parquet. It must 
> be specified manually.;'
> {code}
> in 
> {code:python} 
> /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
>  in deco(*a, **kw)
> {code}
> The documentation for parquet says the format is self describing, and the 
> full schema was available when the parquet file was saved. What gives?
> Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but 
> which claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)



--
This 

[jira] [Updated] (SPARK-21392) Unable to infer schema when loading Parquet file

2017-07-12 Thread Stuart Reynolds (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stuart Reynolds updated SPARK-21392:

Description: 
The following boring code works

{code:python}
response = "mi_or_chd_5"

outcome = sqlc.sql("""select eid,{response} as response
from outcomes
where {response} IS NOT NULL""".format(response=response))
outcome.write.parquet(response, mode="overwrite")

>>> print outcome.schema

StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true)))
{code}

But then,
{code:python}
outcome2 = sqlc.read.parquet(response)  # fail
{code}

fails with:

{code:python}AnalysisException: u'Unable to infer schema for Parquet. It must 
be specified manually.;'
{code}

in 

{code:python} 
/usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
 in deco(*a, **kw)
{code}

The documentation for parquet says the format is self describing, and the full 
schema was available when the parquet file was saved. What gives?

Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but which 
claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)


  was:
The following boring code works

{code:python}
response = "mi_or_chd_5"
colname = "f123"

outcome = sqlc.sql("""select eid,{response} as response
from outcomes
where {response} IS NOT NULL""".format(response=response))
outcome.write.parquet(response, mode="overwrite")

col = sqlc.sql("""select eid,{colname} as {colname}
from baseline_denull
where {colname} IS NOT NULL""".format(colname=colname))
col.write.parquet(colname, mode="overwrite")

>>> print outcome.schema

StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true)))

>>> print col.schema

StructType(List(StructField(eid,IntegerType,true),StructField(f123,DoubleType,true)))
{code}

But then,
{code:python}
outcome2 = sqlc.read.parquet(response)  # fail
col2 = sqlc.read.parquet(colname) # fail
{code}

fails with:

{code:python}AnalysisException: u'Unable to infer schema for Parquet. It must 
be specified manually.;'
{code}

in 

{code:python} 
/usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
 in deco(*a, **kw)
{code}

The documentation for parquet says the format is self describing, and the full 
schema was available when the parquet file was saved. What gives?

Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but which 
claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)



> Unable to infer schema when loading Parquet file
> 
>
> Key: SPARK-21392
> URL: https://issues.apache.org/jira/browse/SPARK-21392
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.1
> Environment: Spark 2.1.1. python 2.7.6
>Reporter: Stuart Reynolds
>  Labels: parquet, pyspark
>
> The following boring code works
> {code:python}
> response = "mi_or_chd_5"
> outcome = sqlc.sql("""select eid,{response} as response
> from outcomes
> where {response} IS NOT NULL""".format(response=response))
> outcome.write.parquet(response, mode="overwrite")
> 
> >>> print outcome.schema
> 
> StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true)))
> {code}
> 
> But then,
> {code:python}
> outcome2 = sqlc.read.parquet(response)  # fail
> {code}
> fails with:
> {code:python}AnalysisException: u'Unable to infer schema for Parquet. It must 
> be specified manually.;'
> {code}
> in 
> {code:python} 
> /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
>  in deco(*a, **kw)
> {code}
> The documentation for parquet says the format is self describing, and the 
> full schema was available when the parquet file was saved. What gives?
> Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but 
> which claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21392) Unable to infer schema when loading Parquet file

2017-07-12 Thread Stuart Reynolds (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stuart Reynolds updated SPARK-21392:

Description: 
The following boring code works

{code:python}
response = "mi_or_chd_5"
colname = "f123"

outcome = sqlc.sql("""select eid,{response} as response
from outcomes
where {response} IS NOT NULL""".format(response=response))
outcome.write.parquet(response, mode="overwrite")

col = sqlc.sql("""select eid,{colname} as {colname}
from baseline_denull
where {colname} IS NOT NULL""".format(colname=colname))
col.write.parquet(colname, mode="overwrite")

>>> print outcome.schema

StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true)))

>>> print col.schema

StructType(List(StructField(eid,IntegerType,true),StructField(f123,DoubleType,true)))
{code}

But then,
{code:python}
outcome2 = sqlc.read.parquet(response)  # fail
col2 = sqlc.read.parquet(colname) # fail
{code}

fails with:

{code:python}AnalysisException: u'Unable to infer schema for Parquet. It must 
be specified manually.;'
{code}

in 

{code:python} 
/usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
 in deco(*a, **kw)
{code}

The documentation for parquet says the format is self describing, and the full 
schema was available when the parquet file was saved. What gives?

Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but which 
claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)


  was:
The following boring code works

{code:python}
   response = "mi_or_chd_5"
colname = "f_1000"

outcome = sqlc.sql("""select eid,{response} as response
from outcomes
where {response} IS NOT NULL""".format(response=response))
outcome.write.parquet(response, mode="overwrite")

col = sqlc.sql("""select eid,{colname} as {colname}
from baseline_denull
where {colname} IS NOT NULL""".format(colname=colname))
col.write.parquet(colname, mode="overwrite")

>>> print outcome.schema

StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true)))

>>> print col.schema

StructType(List(StructField(eid,IntegerType,true),StructField(f_1000,DoubleType,true)))
{code}

But then,
{code:python}
outcome2 = sqlc.read.parquet(response)  # fail
col2 = sqlc.read.parquet(colname) # fail
{code}

fails with:

{code:python}AnalysisException: u'Unable to infer schema for Parquet. It must 
be specified manually.;'
{code}

in 

{code:python} 
/usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
 in deco(*a, **kw)
{code}

The documentation for parquet says the format is self describing, and the full 
schema was available when the parquet file was saved. What gives?

Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but which 
claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)



> Unable to infer schema when loading Parquet file
> 
>
> Key: SPARK-21392
> URL: https://issues.apache.org/jira/browse/SPARK-21392
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.1
> Environment: Spark 2.1.1. python 2.7.6
>Reporter: Stuart Reynolds
>  Labels: parquet, pyspark
>
> The following boring code works
> {code:python}
> response = "mi_or_chd_5"
> colname = "f123"
> outcome = sqlc.sql("""select eid,{response} as response
> from outcomes
> where {response} IS NOT NULL""".format(response=response))
> outcome.write.parquet(response, mode="overwrite")
> 
> col = sqlc.sql("""select eid,{colname} as {colname}
> from baseline_denull
> where {colname} IS NOT NULL""".format(colname=colname))
> col.write.parquet(colname, mode="overwrite")
> >>> print outcome.schema
> 
> StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true)))
> >>> print col.schema
> 
> StructType(List(StructField(eid,IntegerType,true),StructField(f123,DoubleType,true)))
> {code}
> 
> But then,
> {code:python}
> outcome2 = sqlc.read.parquet(response)  # fail
> col2 = sqlc.read.parquet(colname) # fail
> {code}
> fails with:
> {code:python}AnalysisException: u'Unable to infer schema for Parquet. It must 
> be specified manually.;'
> {code}
> in 
> {code:python} 
> /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
>  in deco(*a, **kw)
> {code}
> The documentation for parquet says the format is self describing, and the 
> full schema was available when the parquet file was saved. What gives?
> Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but 
> which claims it was fixed in 

[jira] [Commented] (SPARK-21392) Unable to infer schema when loading Parquet file

2017-07-12 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084583#comment-16084583
 ] 

Sean Owen commented on SPARK-21392:
---

Can you format this so it's readable?
I don't understand how "response" can be what appears to be a column name, but 
then also a path to a Parquet file. Are you just mixing up arguments?

> Unable to infer schema when loading Parquet file
> 
>
> Key: SPARK-21392
> URL: https://issues.apache.org/jira/browse/SPARK-21392
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.1
> Environment: Spark 2.1.1. python 2.7.6
>Reporter: Stuart Reynolds
>  Labels: parquet, pyspark
>
> The following boring code works
> {{response = "mi_or_chd_5"
> colname = "f_1000"
> outcome = sqlc.sql("""select eid,{response} as response
> from outcomes
> where {response} IS NOT NULL""".format(response=response))
> outcome.write.parquet(response, mode="overwrite")
> 
> col = sqlc.sql("""select eid,{colname} as {colname}
> from baseline_denull
> where {colname} IS NOT NULL""".format(colname=colname))
> col.write.parquet(colname, mode="overwrite")
> >>> print outcome.schema
> 
> StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true)))
> >>> print col.schema
> 
> StructType(List(StructField(eid,IntegerType,true),StructField(f_1000,DoubleType,true)))
> }}.
> 
> But then,
> {{
> outcome2 = sqlc.read.parquet(response)  # fail
> col2 = sqlc.read.parquet(colname) # fail
> }}.
> fails with:
> {{AnalysisException: u'Unable to infer schema for Parquet. It must be 
> specified manually.;'
> }}.
> in 
> {{
> /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
>  in deco(*a, **kw)
> }}.
> The documentation for parquet says the format is self describing, and the 
> full schema was available when the parquet file was saved. What gives?
> Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but 
> which claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-21221) CrossValidator and TrainValidationSplit Persist Nested Estimators such as OneVsRest

2017-07-12 Thread Ajay Saini (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ajay Saini updated SPARK-21221:
---
Comment: was deleted

(was: Note: In order for python persistence of OneVsRest inside a 
CrossValidator/TrainValidationSplit to work this change needs to be merged 
because Python persistence of meta-algorithms relies on the Scala saving 
implementation.)

> CrossValidator and TrainValidationSplit Persist Nested Estimators such as 
> OneVsRest
> ---
>
> Key: SPARK-21221
> URL: https://issues.apache.org/jira/browse/SPARK-21221
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Ajay Saini
>Assignee: Ajay Saini
>
> Currently, the saving of parameters done in ValidatorParams.scala assumes 
> that all parameters in EstimatorParameterMaps are JSON serializable. Such an 
> assumption causes CrossValidator and TrainValidationSplit persistence to fail 
> when the internal estimator to these meta-algorithms is not JSON 
> serializable. One example is OneVsRest which has a classifier (estimator) as 
> a parameter.
> The changes would involve removing the assumption in validator params that 
> all the estimator params are JSON serializable. This could mean saving 
> parameters that are not JSON serializable separately at a specified path. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14280) Update change-version.sh and pom.xml to add Scala 2.12 profiles

2017-07-12 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen reassigned SPARK-14280:
--

Assignee: Josh Rosen

> Update change-version.sh and pom.xml to add Scala 2.12 profiles
> ---
>
> Key: SPARK-14280
> URL: https://issues.apache.org/jira/browse/SPARK-14280
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Project Infra
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> The following instructions will be kept quasi-up-to-date and are the best 
> starting point for building a Spark snapshot with Scala 2.12.0-M4:
> * Check out https://github.com/JoshRosen/spark/tree/build-for-2.12.
> * Install dependencies:
> ** chill: check out https://github.com/twitter/chill/pull/253 and run 
> {{sbt ++2.12.0-M4 publishLocal}}
> * Run {{./dev/change-scala-version.sh 2.12.0-M4}}
> * To compile Spark, run {{build/sbt -Dscala-2.12}}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14280) Update change-version.sh and pom.xml to add Scala 2.12 profiles

2017-07-12 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen reassigned SPARK-14280:
--

Assignee: (was: Josh Rosen)

> Update change-version.sh and pom.xml to add Scala 2.12 profiles
> ---
>
> Key: SPARK-14280
> URL: https://issues.apache.org/jira/browse/SPARK-14280
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Project Infra
>Reporter: Josh Rosen
>
> The following instructions will be kept quasi-up-to-date and are the best 
> starting point for building a Spark snapshot with Scala 2.12.0-M4:
> * Check out https://github.com/JoshRosen/spark/tree/build-for-2.12.
> * Install dependencies:
> ** chill: check out https://github.com/twitter/chill/pull/253 and run 
> {{sbt ++2.12.0-M4 publishLocal}}
> * Run {{./dev/change-scala-version.sh 2.12.0-M4}}
> * To compile Spark, run {{build/sbt -Dscala-2.12}}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14650) Compile Spark REPL for Scala 2.12

2017-07-12 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen reassigned SPARK-14650:
--

Assignee: (was: Josh Rosen)

> Compile Spark REPL for Scala 2.12
> -
>
> Key: SPARK-14650
> URL: https://issues.apache.org/jira/browse/SPARK-14650
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Project Infra
>Reporter: Josh Rosen
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14438) Cross-publish Breeze for Scala 2.12

2017-07-12 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-14438.

Resolution: Fixed

> Cross-publish Breeze for Scala 2.12
> ---
>
> Key: SPARK-14438
> URL: https://issues.apache.org/jira/browse/SPARK-14438
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Project Infra
>Reporter: Josh Rosen
>
> Spark relies on Breeze (https://github.com/scalanlp/breeze), so we'll need to 
> cross-publish that for Scala 2.12.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14280) Update change-version.sh and pom.xml to add Scala 2.12 profiles

2017-07-12 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen reassigned SPARK-14280:
--

Assignee: (was: Josh Rosen)

> Update change-version.sh and pom.xml to add Scala 2.12 profiles
> ---
>
> Key: SPARK-14280
> URL: https://issues.apache.org/jira/browse/SPARK-14280
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Project Infra
>Reporter: Josh Rosen
>
> The following instructions will be kept quasi-up-to-date and are the best 
> starting point for building a Spark snapshot with Scala 2.12.0-M4:
> * Check out https://github.com/JoshRosen/spark/tree/build-for-2.12.
> * Install dependencies:
> ** chill: check out https://github.com/twitter/chill/pull/253 and run 
> {{sbt ++2.12.0-M4 publishLocal}}
> * Run {{./dev/change-scala-version.sh 2.12.0-M4}}
> * To compile Spark, run {{build/sbt -Dscala-2.12}}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14519) Cross-publish Kafka for Scala 2.12

2017-07-12 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-14519.

Resolution: Fixed

> Cross-publish Kafka for Scala 2.12
> --
>
> Key: SPARK-14519
> URL: https://issues.apache.org/jira/browse/SPARK-14519
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Project Infra
>Reporter: Josh Rosen
>
> In order to build the streaming Kafka connector, we need to publish Kafka for 
> Scala 2.12.0-M4. Someone should file an issue against the Kafka project and 
> work with their developers to figure out what will block their upgrade / 
> release.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21376) Token is not renewed in yarn client process in cluster mode

2017-07-12 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084460#comment-16084460
 ] 

Saisai Shao edited comment on SPARK-21376 at 7/12/17 6:31 PM:
--

I'm referrring to o.a.s.deploy.yarn.Client this class, it will monitor yarn 
application and try to delete staging files when application is finished.


was (Author: jerryshao):
I'm referrring to o.a.s.deploy.yarn.Client this class, it will monitoring yarn 
application and try to delete staging files when application is finished.

> Token is not renewed in yarn client process in cluster mode
> ---
>
> Key: SPARK-21376
> URL: https://issues.apache.org/jira/browse/SPARK-21376
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Yesha Vora
>Priority: Minor
>
> STR:
> * Set below config in spark-default.conf
> {code}
> spark.yarn.security.credentials.hbase.enabled true
> spark.hbase.connector.security.credentials.enabled false{code}
> * Set below config in hdfs-site.xml
> {code}
> 'dfs.namenode.delegation.token.max-lifetime':'4320'
> 'dfs.namenode.delegation.token.renew-interval':'2880' {code}
> * Run HDFSWordcount streaming app in yarn-cluster mode  for 25 hrs. 
> After 25 hours, noticing that HDFS Wordcount job is hitting 
> HDFS_DELEGATION_TOKEN renewal issue. 
> {code}
> 17/06/28 10:49:47 WARN Client: Exception encountered while connecting to the 
> server : 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (HDFS_DELEGATION_TOKEN token 230 for hrt_qa) is expired
> 17/06/28 10:49:47 WARN Client: Failed to cleanup staging dir 
> hdfs://mycluster0/user/hrt_qa/.sparkStaging/application_1498539861056_0015
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (HDFS_DELEGATION_TOKEN token 230 for hrt_qa) is expired
> at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1554)
> at org.apache.hadoop.ipc.Client.call(Client.java:1498){code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21376) Token is not renewed in yarn client process in cluster mode

2017-07-12 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084460#comment-16084460
 ] 

Saisai Shao commented on SPARK-21376:
-

I'm referrring to o.a.s.deploy.yarn.Client this class, it will monitoring yarn 
application and try to delete staging files when application is finished.

> Token is not renewed in yarn client process in cluster mode
> ---
>
> Key: SPARK-21376
> URL: https://issues.apache.org/jira/browse/SPARK-21376
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Yesha Vora
>Priority: Minor
>
> STR:
> * Set below config in spark-default.conf
> {code}
> spark.yarn.security.credentials.hbase.enabled true
> spark.hbase.connector.security.credentials.enabled false{code}
> * Set below config in hdfs-site.xml
> {code}
> 'dfs.namenode.delegation.token.max-lifetime':'4320'
> 'dfs.namenode.delegation.token.renew-interval':'2880' {code}
> * Run HDFSWordcount streaming app in yarn-cluster mode  for 25 hrs. 
> After 25 hours, noticing that HDFS Wordcount job is hitting 
> HDFS_DELEGATION_TOKEN renewal issue. 
> {code}
> 17/06/28 10:49:47 WARN Client: Exception encountered while connecting to the 
> server : 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (HDFS_DELEGATION_TOKEN token 230 for hrt_qa) is expired
> 17/06/28 10:49:47 WARN Client: Failed to cleanup staging dir 
> hdfs://mycluster0/user/hrt_qa/.sparkStaging/application_1498539861056_0015
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (HDFS_DELEGATION_TOKEN token 230 for hrt_qa) is expired
> at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1554)
> at org.apache.hadoop.ipc.Client.call(Client.java:1498){code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21376) Token is not renewed in yarn client process in cluster mode

2017-07-12 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084458#comment-16084458
 ] 

Thomas Graves commented on SPARK-21376:
---

so you are referring to the org.apache.spark.launcher.SparkLauncher code that 
is launching a yarn cluster mode job?  or what do you mean by "local yarn 
launcher process"?

> Token is not renewed in yarn client process in cluster mode
> ---
>
> Key: SPARK-21376
> URL: https://issues.apache.org/jira/browse/SPARK-21376
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Yesha Vora
>Priority: Minor
>
> STR:
> * Set below config in spark-default.conf
> {code}
> spark.yarn.security.credentials.hbase.enabled true
> spark.hbase.connector.security.credentials.enabled false{code}
> * Set below config in hdfs-site.xml
> {code}
> 'dfs.namenode.delegation.token.max-lifetime':'4320'
> 'dfs.namenode.delegation.token.renew-interval':'2880' {code}
> * Run HDFSWordcount streaming app in yarn-cluster mode  for 25 hrs. 
> After 25 hours, noticing that HDFS Wordcount job is hitting 
> HDFS_DELEGATION_TOKEN renewal issue. 
> {code}
> 17/06/28 10:49:47 WARN Client: Exception encountered while connecting to the 
> server : 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (HDFS_DELEGATION_TOKEN token 230 for hrt_qa) is expired
> 17/06/28 10:49:47 WARN Client: Failed to cleanup staging dir 
> hdfs://mycluster0/user/hrt_qa/.sparkStaging/application_1498539861056_0015
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (HDFS_DELEGATION_TOKEN token 230 for hrt_qa) is expired
> at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1554)
> at org.apache.hadoop.ipc.Client.call(Client.java:1498){code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21221) CrossValidator and TrainValidationSplit Persist Nested Estimators such as OneVsRest

2017-07-12 Thread Ajay Saini (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ajay Saini updated SPARK-21221:
---
Summary: CrossValidator and TrainValidationSplit Persist Nested Estimators 
such as OneVsRest  (was: CrossValidator and TrainValidationSplit Persist Nested 
Estimators)

> CrossValidator and TrainValidationSplit Persist Nested Estimators such as 
> OneVsRest
> ---
>
> Key: SPARK-21221
> URL: https://issues.apache.org/jira/browse/SPARK-21221
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Ajay Saini
>Assignee: Ajay Saini
>
> Currently, the saving of parameters done in ValidatorParams.scala assumes 
> that all parameters in EstimatorParameterMaps are JSON serializable. Such an 
> assumption causes CrossValidator and TrainValidationSplit persistence to fail 
> when the internal estimator to these meta-algorithms is not JSON 
> serializable. One example is OneVsRest which has a classifier (estimator) as 
> a parameter.
> The changes would involve removing the assumption in validator params that 
> all the estimator params are JSON serializable. This could mean saving 
> parameters that are not JSON serializable separately at a specified path. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18085) SPIP: Better History Server scalability for many / large applications

2017-07-12 Thread Ye Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084451#comment-16084451
 ] 

Ye Zhou commented on SPARK-18085:
-

I want to add my own testing experience with the codes from the HEAD of your 
repo. I have built and deployed this Spark History Server 2 weeks ago and set 
the log dir to our dev cluster where around 1.5k spark applications daily 
application logs will be produced. And we continuously generate the loads to it 
by rest call. It works well. Thanks for the work.
I have gone through the codes partially and have some minor improvement ideas 
to fit our internal need especially if there are more than 50K log files in the 
log dir. Some of them might be also useful to upstream.

> SPIP: Better History Server scalability for many / large applications
> -
>
> Key: SPARK-18085
> URL: https://issues.apache.org/jira/browse/SPARK-18085
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>  Labels: SPIP
> Attachments: spark_hs_next_gen.pdf
>
>
> It's a known fact that the History Server currently has some annoying issues 
> when serving lots of applications, and when serving large applications.
> I'm filing this umbrella to track work related to addressing those issues. 
> I'll be attaching a document shortly describing the issues and suggesting a 
> path to how to solve them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21392) Unable to infer schema when loading Parquet file

2017-07-12 Thread Stuart Reynolds (JIRA)
Stuart Reynolds created SPARK-21392:
---

 Summary: Unable to infer schema when loading Parquet file
 Key: SPARK-21392
 URL: https://issues.apache.org/jira/browse/SPARK-21392
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.1.1
 Environment: Spark 2.1.1. python 2.7.6

Reporter: Stuart Reynolds


The following boring code works

{{response = "mi_or_chd_5"
colname = "f_1000"

outcome = sqlc.sql("""select eid,{response} as response
from outcomes
where {response} IS NOT NULL""".format(response=response))
outcome.write.parquet(response, mode="overwrite")

col = sqlc.sql("""select eid,{colname} as {colname}
from baseline_denull
where {colname} IS NOT NULL""".format(colname=colname))
col.write.parquet(colname, mode="overwrite")

>>> print outcome.schema

StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true)))

>>> print col.schema

StructType(List(StructField(eid,IntegerType,true),StructField(f_1000,DoubleType,true)))
}}.

But then,
{{
outcome2 = sqlc.read.parquet(response)  # fail
col2 = sqlc.read.parquet(colname) # fail
}}.

fails with:

{{AnalysisException: u'Unable to infer schema for Parquet. It must be 
specified manually.;'
}}.

in 

{{
/usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
 in deco(*a, **kw)
}}.

The documentation for parquet says the format is self describing, and the full 
schema was available when the parquet file was saved. What gives?

Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but which 
claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13534) Implement Apache Arrow serializer for Spark DataFrame for use in DataFrame.toPandas

2017-07-12 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084390#comment-16084390
 ] 

Bryan Cutler commented on SPARK-13534:
--

Hi [~tagar], the {{ArrowSerializer}} doesn't quite fit as a drop-in replacement 
because the standard PySpark serializers use iterators over elements and Arrow 
works on batches.  Trying to iterate over the batches to get individual 
elements would probably cancel out any performance gains.  So then you would 
need to operate on the data with an interface like Pandas.  I proposed 
something similar in my comment 
[here|https://issues.apache.org/jira/browse/SPARK-21190?focusedCommentId=16077390=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16077390]
 (some api 
[details|https://gist.github.com/BryanCutler/2d2ae04e81fa96ba4b61dc095726419f]).
  I'd like to hear what your use case is for working with Arrow data and what 
you'd want to see in Spark to support this?

> Implement Apache Arrow serializer for Spark DataFrame for use in 
> DataFrame.toPandas
> ---
>
> Key: SPARK-13534
> URL: https://issues.apache.org/jira/browse/SPARK-13534
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Wes McKinney
>Assignee: Bryan Cutler
> Fix For: 2.3.0
>
> Attachments: benchmark.py
>
>
> The current code path for accessing Spark DataFrame data in Python using 
> PySpark passes through an inefficient serialization-deserialiation process 
> that I've examined at a high level here: 
> https://gist.github.com/wesm/0cb5531b1c2e346a0007. Currently, RDD[Row] 
> objects are being deserialized in pure Python as a list of tuples, which are 
> then converted to pandas.DataFrame using its {{from_records}} alternate 
> constructor. This also uses a large amount of memory.
> For flat (no nested types) schemas, the Apache Arrow memory layout 
> (https://github.com/apache/arrow/tree/master/format) can be deserialized to 
> {{pandas.DataFrame}} objects with comparatively small overhead compared with 
> memcpy / system memory bandwidth -- Arrow's bitmasks must be examined, 
> replacing the corresponding null values with pandas's sentinel values (None 
> or NaN as appropriate).
> I will be contributing patches to Arrow in the coming weeks for converting 
> between Arrow and pandas in the general case, so if Spark can send Arrow 
> memory to PySpark, we will hopefully be able to increase the Python data 
> access throughput by an order of magnitude or more. I propose to add an new 
> serializer for Spark DataFrame and a new method that can be invoked from 
> PySpark to request a Arrow memory-layout byte stream, prefixed by a data 
> header indicating array buffer offsets and sizes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18085) SPIP: Better History Server scalability for many / large applications

2017-07-12 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084369#comment-16084369
 ] 

Marcelo Vanzin commented on SPARK-18085:


[~kanjilal] the code is pretty much all written at this point, most of the help 
I'd need right now is with testing and code reviews.

For those arriving now, all the code is at 
https://github.com/vanzin/spark/commits/shs-ng/HEAD, and you can check the 
sub-tasks for which patches are currently under review (since sending all at 
the same time would create a huge diff).

> SPIP: Better History Server scalability for many / large applications
> -
>
> Key: SPARK-18085
> URL: https://issues.apache.org/jira/browse/SPARK-18085
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>  Labels: SPIP
> Attachments: spark_hs_next_gen.pdf
>
>
> It's a known fact that the History Server currently has some annoying issues 
> when serving lots of applications, and when serving large applications.
> I'm filing this umbrella to track work related to addressing those issues. 
> I'll be attaching a document shortly describing the issues and suggesting a 
> path to how to solve them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18085) SPIP: Better History Server scalability for many / large applications

2017-07-12 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084361#comment-16084361
 ] 

Saikat Kanjilal commented on SPARK-18085:
-

[~vanzin] I would be interested in helping with this, I will first read the 
proposal and go through this thread and add my feedback, barring that are there 
areas where you need immediate help?

> SPIP: Better History Server scalability for many / large applications
> -
>
> Key: SPARK-18085
> URL: https://issues.apache.org/jira/browse/SPARK-18085
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>  Labels: SPIP
> Attachments: spark_hs_next_gen.pdf
>
>
> It's a known fact that the History Server currently has some annoying issues 
> when serving lots of applications, and when serving large applications.
> I'm filing this umbrella to track work related to addressing those issues. 
> I'll be attaching a document shortly describing the issues and suggesting a 
> path to how to solve them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20703) Add an operator for writing data out

2017-07-12 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084355#comment-16084355
 ] 

Steve Loughran commented on SPARK-20703:


this has just added a whole new stack trace for my collection :)

The thing to consider here is that {{numOutputBytes += 
getFileSize(taskAttemptContext.getConfiguration, currentPath)}} is not free on 
an object store; it's another HTTP HEAD request, which can add 100-200 ms of 
latency, and, if the file isn't actually there yet, can fail with an FNFE.

I don't see a more reliable way to get the aggregate stats off a newly written 
file though. Hadoop 2.8 has added the new way to get Stats off a filesystem 
instance, but that doesn't pick it up off a individual write, and while output 
streams do build up their own stats [e.g 
[S3aBlockOutputStream|https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInstrumentation.java#L817],
 they aren't exposed at the stream level, except in the toString() call, which 
is only there for logging & diagnostics. It'd really take an interface for 
asking stats off a stream to get at interesting facts, but as it'd need to be 
transitive over a stream chain, unworkable. For now, {{getFileStatus()}} is all 
there is.

What I can do is supply a patch which catches all IOEs related to asking for 
the file size and downgrading them to a warn & returning 0 bytes. That would 
still include the potential delay of the round trip to the server, but at least 
stop metric collection breaking jobs due to a transient consistency/visibility 
failure. Interested?

> Add an operator for writing data out
> 
>
> Key: SPARK-20703
> URL: https://issues.apache.org/jira/browse/SPARK-20703
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Liang-Chi Hsieh
> Fix For: 2.3.0
>
>
> We should add an operator for writing data out. Right now in the explain plan 
> / UI there is no way to tell whether a query is writing data out, and also 
> there is no way to associate metrics with data writes. It'd be tremendously 
> valuable to do this for adding metrics and for visibility.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21391) Cannot convert a Seq of Map whose value type is again a seq, into a dataset

2017-07-12 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084333#comment-16084333
 ] 

Kazuaki Ishizaki edited comment on SPARK-21391 at 7/12/17 5:19 PM:
---

This program works with the master or Spark 2.2. Would it be possible to use 
Spark 2.2?

{code}
++
|  properties|
++
|Map(A1 -> [Wrappe...|
|Map(A2 -> [Wrappe...|
++
{code}



was (Author: kiszk):
This program works with the master and Spark 2.2. Would it be possible to use 
Spark 2.2?

{code}
++
|  properties|
++
|Map(A1 -> [Wrappe...|
|Map(A2 -> [Wrappe...|
++
{code}


> Cannot convert a Seq of Map whose value type is again a seq, into a dataset 
> 
>
> Key: SPARK-21391
> URL: https://issues.apache.org/jira/browse/SPARK-21391
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: Seen on mac OSX, scala 2.11, java 8
>Reporter: indraneel rao
>
> There is an error while trying to create a dataset from a sequence of Maps, 
> whose values have any kind of collections. Even when they are wrapped in a 
> case class. 
> Eg : The following piece of code throws an error:
>
> {code:java}
> case class Values(values: Seq[Double])
> case class ItemProperties(properties:Map[String,Values])
> val l1 = List(ItemProperties(
>   Map(
> "A1" -> Values(Seq(1.0,2.0)),
> "B1" -> Values(Seq(44.0,55.0))
>   )
> ),
>   ItemProperties(
> Map(
>   "A2" -> Values(Seq(123.0,25.0)),
>   "B2" -> Values(Seq(445.0,35.0))
> )
>   )
> )
> l1.toDS().show()
> {code}
> Heres the error:
> 17/07/12 21:59:35 ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 65, Column 46: Expression "ExternalMapToCatalyst_value_isNull0" is not an 
> rvalue
> /* 001 */ public java.lang.Object generate(Object[] references) {
> /* 002 */   return new SpecificUnsafeProjection(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ class SpecificUnsafeProjection extends 
> org.apache.spark.sql.catalyst.expressions.UnsafeProjection {
> /* 006 */
> /* 007 */   private Object[] references;
> /* 008 */   private boolean resultIsNull;
> /* 009 */   private java.lang.String argValue;
> /* 010 */   private Object[] values;
> /* 011 */   private boolean resultIsNull1;
> /* 012 */   private scala.collection.Seq argValue1;
> /* 013 */   private boolean isNull11;
> /* 014 */   private boolean value11;
> /* 015 */   private boolean isNull12;
> /* 016 */   private InternalRow value12;
> /* 017 */   private boolean isNull13;
> /* 018 */   private InternalRow value13;
> /* 019 */   private UnsafeRow result;
> /* 020 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder holder;
> /* 021 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter rowWriter;
> /* 022 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter 
> arrayWriter;
> /* 023 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter 
> arrayWriter1;
> /* 024 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter rowWriter1;
> /* 025 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter 
> arrayWriter2;
> /* 026 */
> /* 027 */   public SpecificUnsafeProjection(Object[] references) {
> /* 028 */ this.references = references;
> /* 029 */
> /* 030 */
> /* 031 */ this.values = null;
> /* 032 */
> /* 033 */
> /* 034 */ isNull11 = false;
> /* 035 */ value11 = false;
> /* 036 */ isNull12 = false;
> /* 037 */ value12 = null;
> /* 038 */ isNull13 = false;
> /* 039 */ value13 = null;
> /* 040 */ result = new UnsafeRow(1);
> /* 041 */ this.holder = new 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(result, 32);
> /* 042 */ this.rowWriter = new 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(holder, 1);
> /* 043 */ this.arrayWriter = new 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter();
> /* 044 */ this.arrayWriter1 = new 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter();
> /* 045 */ this.rowWriter1 = new 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(holder, 1);
> /* 046 */ this.arrayWriter2 = new 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter();
> /* 047 */
> /* 048 */   }
> /* 049 */
> /* 050 */   public void initialize(int partitionIndex) {
> /* 051 */
> /* 052 */   }
> /* 053 */
> /* 054 */
> /* 055 

[jira] [Comment Edited] (SPARK-21391) Cannot convert a Seq of Map whose value type is again a seq, into a dataset

2017-07-12 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084333#comment-16084333
 ] 

Kazuaki Ishizaki edited comment on SPARK-21391 at 7/12/17 5:19 PM:
---

This program works with the master and Spark 2.2. Would it be possible to use 
Spark 2.2?

{code}
++
|  properties|
++
|Map(A1 -> [Wrappe...|
|Map(A2 -> [Wrappe...|
++
{code}



was (Author: kiszk):
This program works with the master.

{code}
++
|  properties|
++
|Map(A1 -> [Wrappe...|
|Map(A2 -> [Wrappe...|
++
{code}


> Cannot convert a Seq of Map whose value type is again a seq, into a dataset 
> 
>
> Key: SPARK-21391
> URL: https://issues.apache.org/jira/browse/SPARK-21391
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: Seen on mac OSX, scala 2.11, java 8
>Reporter: indraneel rao
>
> There is an error while trying to create a dataset from a sequence of Maps, 
> whose values have any kind of collections. Even when they are wrapped in a 
> case class. 
> Eg : The following piece of code throws an error:
>
> {code:java}
> case class Values(values: Seq[Double])
> case class ItemProperties(properties:Map[String,Values])
> val l1 = List(ItemProperties(
>   Map(
> "A1" -> Values(Seq(1.0,2.0)),
> "B1" -> Values(Seq(44.0,55.0))
>   )
> ),
>   ItemProperties(
> Map(
>   "A2" -> Values(Seq(123.0,25.0)),
>   "B2" -> Values(Seq(445.0,35.0))
> )
>   )
> )
> l1.toDS().show()
> {code}
> Heres the error:
> 17/07/12 21:59:35 ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 65, Column 46: Expression "ExternalMapToCatalyst_value_isNull0" is not an 
> rvalue
> /* 001 */ public java.lang.Object generate(Object[] references) {
> /* 002 */   return new SpecificUnsafeProjection(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ class SpecificUnsafeProjection extends 
> org.apache.spark.sql.catalyst.expressions.UnsafeProjection {
> /* 006 */
> /* 007 */   private Object[] references;
> /* 008 */   private boolean resultIsNull;
> /* 009 */   private java.lang.String argValue;
> /* 010 */   private Object[] values;
> /* 011 */   private boolean resultIsNull1;
> /* 012 */   private scala.collection.Seq argValue1;
> /* 013 */   private boolean isNull11;
> /* 014 */   private boolean value11;
> /* 015 */   private boolean isNull12;
> /* 016 */   private InternalRow value12;
> /* 017 */   private boolean isNull13;
> /* 018 */   private InternalRow value13;
> /* 019 */   private UnsafeRow result;
> /* 020 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder holder;
> /* 021 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter rowWriter;
> /* 022 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter 
> arrayWriter;
> /* 023 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter 
> arrayWriter1;
> /* 024 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter rowWriter1;
> /* 025 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter 
> arrayWriter2;
> /* 026 */
> /* 027 */   public SpecificUnsafeProjection(Object[] references) {
> /* 028 */ this.references = references;
> /* 029 */
> /* 030 */
> /* 031 */ this.values = null;
> /* 032 */
> /* 033 */
> /* 034 */ isNull11 = false;
> /* 035 */ value11 = false;
> /* 036 */ isNull12 = false;
> /* 037 */ value12 = null;
> /* 038 */ isNull13 = false;
> /* 039 */ value13 = null;
> /* 040 */ result = new UnsafeRow(1);
> /* 041 */ this.holder = new 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(result, 32);
> /* 042 */ this.rowWriter = new 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(holder, 1);
> /* 043 */ this.arrayWriter = new 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter();
> /* 044 */ this.arrayWriter1 = new 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter();
> /* 045 */ this.rowWriter1 = new 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(holder, 1);
> /* 046 */ this.arrayWriter2 = new 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter();
> /* 047 */
> /* 048 */   }
> /* 049 */
> /* 050 */   public void initialize(int partitionIndex) {
> /* 051 */
> /* 052 */   }
> /* 053 */
> /* 054 */
> /* 055 */   private void evalIfTrueExpr(InternalRow i) {
> /* 

[jira] [Commented] (SPARK-21380) Join with Columns thinks inner join is cross join even when aliased

2017-07-12 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084274#comment-16084274
 ] 

Dongjoon Hyun commented on SPARK-21380:
---

I see. I agree your point about that warning is misleading here.

> Join with Columns thinks inner join is cross join even when aliased
> ---
>
> Key: SPARK-21380
> URL: https://issues.apache.org/jira/browse/SPARK-21380
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.1.0, 2.1.1
>Reporter: Everett Anderson
>  Labels: correctness
>
> While this seemed to work in Spark 2.0.2, it fails in 2.1.0 and 2.1.1.
> Even after aliasing both the table names and all the columns, joining 
> Datasets using a criteria assembled from Columns rather than the with the 
> join( usingColumns) method variants errors complaining that a join is a 
> cross join / cartesian product even when it isn't.
> Example:
> {noformat}
> Dataset left = spark.sql("select 'bob' as name, 23 as age");
> left = left
> .alias("l")
> .select(
> left.col("name").as("l_name"),
> left.col("age").as("l_age"));
> Dataset right = spark.sql("select 'bob' as name, 'bobco' as 
> company");
> right = right
> .alias("r")
> .select(
> right.col("name").as("r_name"),
> right.col("company").as("r_age"));
> Dataset result = left.join(
> right,
> left.col("l_name").equalTo(right.col("r_name")),
> "inner");
> result.show();
> {noformat}
> Results in
> {noformat}
> org.apache.spark.sql.AnalysisException: Detected cartesian product for INNER 
> join between logical plans
> Project [bob AS l_name#22, 23 AS l_age#23]
> +- OneRowRelation$
> and
> Project [bob AS r_name#33, bobco AS r_age#34]
> +- OneRowRelation$
> Join condition is missing or trivial.
> Use the CROSS JOIN syntax to allow cartesian products between these 
> relations.;
>   at 
> org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1067)
>   at 
> org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1064)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:268)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:268)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:273)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:273)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:257)
>   at 
> org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts.apply(Optimizer.scala:1064)
>   at 
> org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts.apply(Optimizer.scala:1049)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
>   at 
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
>   at 
> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
>   at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
>   at 
> 

[jira] [Commented] (SPARK-18085) SPIP: Better History Server scalability for many / large applications

2017-07-12 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084277#comment-16084277
 ] 

Reynold Xin commented on SPARK-18085:
-

You should email dev@ to notify the list about a new SPIP. Thanks.


> SPIP: Better History Server scalability for many / large applications
> -
>
> Key: SPARK-18085
> URL: https://issues.apache.org/jira/browse/SPARK-18085
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>  Labels: SPIP
> Attachments: spark_hs_next_gen.pdf
>
>
> It's a known fact that the History Server currently has some annoying issues 
> when serving lots of applications, and when serving large applications.
> I'm filing this umbrella to track work related to addressing those issues. 
> I'll be attaching a document shortly describing the issues and suggesting a 
> path to how to solve them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21376) Token is not renewed in yarn client process in cluster mode

2017-07-12 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084270#comment-16084270
 ] 

Saisai Shao commented on SPARK-21376:
-

Hi [~tgraves], it is the local yarn launcher process which will launch Spark 
application on yarn cluster. The problem here is that local launcher process 
will always keep the initial token and not get renewed, so when application is 
killed then local launcher process will try to delete the staging files, and 
using this initial token will be failed in long running scenario.

> Token is not renewed in yarn client process in cluster mode
> ---
>
> Key: SPARK-21376
> URL: https://issues.apache.org/jira/browse/SPARK-21376
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Yesha Vora
>Priority: Minor
>
> STR:
> * Set below config in spark-default.conf
> {code}
> spark.yarn.security.credentials.hbase.enabled true
> spark.hbase.connector.security.credentials.enabled false{code}
> * Set below config in hdfs-site.xml
> {code}
> 'dfs.namenode.delegation.token.max-lifetime':'4320'
> 'dfs.namenode.delegation.token.renew-interval':'2880' {code}
> * Run HDFSWordcount streaming app in yarn-cluster mode  for 25 hrs. 
> After 25 hours, noticing that HDFS Wordcount job is hitting 
> HDFS_DELEGATION_TOKEN renewal issue. 
> {code}
> 17/06/28 10:49:47 WARN Client: Exception encountered while connecting to the 
> server : 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (HDFS_DELEGATION_TOKEN token 230 for hrt_qa) is expired
> 17/06/28 10:49:47 WARN Client: Failed to cleanup staging dir 
> hdfs://mycluster0/user/hrt_qa/.sparkStaging/application_1498539861056_0015
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (HDFS_DELEGATION_TOKEN token 230 for hrt_qa) is expired
> at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1554)
> at org.apache.hadoop.ipc.Client.call(Client.java:1498){code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21391) Cannot convert a Seq of Map whose value type is again a seq, into a dataset

2017-07-12 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084333#comment-16084333
 ] 

Kazuaki Ishizaki commented on SPARK-21391:
--

This program works with the master.

{code}
++
|  properties|
++
|Map(A1 -> [Wrappe...|
|Map(A2 -> [Wrappe...|
++
{code}


> Cannot convert a Seq of Map whose value type is again a seq, into a dataset 
> 
>
> Key: SPARK-21391
> URL: https://issues.apache.org/jira/browse/SPARK-21391
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: Seen on mac OSX, scala 2.11, java 8
>Reporter: indraneel rao
>
> There is an error while trying to create a dataset from a sequence of Maps, 
> whose values have any kind of collections. Even when they are wrapped in a 
> case class. 
> Eg : The following piece of code throws an error:
>
> {code:java}
> case class Values(values: Seq[Double])
> case class ItemProperties(properties:Map[String,Values])
> val l1 = List(ItemProperties(
>   Map(
> "A1" -> Values(Seq(1.0,2.0)),
> "B1" -> Values(Seq(44.0,55.0))
>   )
> ),
>   ItemProperties(
> Map(
>   "A2" -> Values(Seq(123.0,25.0)),
>   "B2" -> Values(Seq(445.0,35.0))
> )
>   )
> )
> l1.toDS().show()
> {code}
> Heres the error:
> 17/07/12 21:59:35 ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 65, Column 46: Expression "ExternalMapToCatalyst_value_isNull0" is not an 
> rvalue
> /* 001 */ public java.lang.Object generate(Object[] references) {
> /* 002 */   return new SpecificUnsafeProjection(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ class SpecificUnsafeProjection extends 
> org.apache.spark.sql.catalyst.expressions.UnsafeProjection {
> /* 006 */
> /* 007 */   private Object[] references;
> /* 008 */   private boolean resultIsNull;
> /* 009 */   private java.lang.String argValue;
> /* 010 */   private Object[] values;
> /* 011 */   private boolean resultIsNull1;
> /* 012 */   private scala.collection.Seq argValue1;
> /* 013 */   private boolean isNull11;
> /* 014 */   private boolean value11;
> /* 015 */   private boolean isNull12;
> /* 016 */   private InternalRow value12;
> /* 017 */   private boolean isNull13;
> /* 018 */   private InternalRow value13;
> /* 019 */   private UnsafeRow result;
> /* 020 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder holder;
> /* 021 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter rowWriter;
> /* 022 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter 
> arrayWriter;
> /* 023 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter 
> arrayWriter1;
> /* 024 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter rowWriter1;
> /* 025 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter 
> arrayWriter2;
> /* 026 */
> /* 027 */   public SpecificUnsafeProjection(Object[] references) {
> /* 028 */ this.references = references;
> /* 029 */
> /* 030 */
> /* 031 */ this.values = null;
> /* 032 */
> /* 033 */
> /* 034 */ isNull11 = false;
> /* 035 */ value11 = false;
> /* 036 */ isNull12 = false;
> /* 037 */ value12 = null;
> /* 038 */ isNull13 = false;
> /* 039 */ value13 = null;
> /* 040 */ result = new UnsafeRow(1);
> /* 041 */ this.holder = new 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(result, 32);
> /* 042 */ this.rowWriter = new 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(holder, 1);
> /* 043 */ this.arrayWriter = new 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter();
> /* 044 */ this.arrayWriter1 = new 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter();
> /* 045 */ this.rowWriter1 = new 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(holder, 1);
> /* 046 */ this.arrayWriter2 = new 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter();
> /* 047 */
> /* 048 */   }
> /* 049 */
> /* 050 */   public void initialize(int partitionIndex) {
> /* 051 */
> /* 052 */   }
> /* 053 */
> /* 054 */
> /* 055 */   private void evalIfTrueExpr(InternalRow i) {
> /* 056 */ final InternalRow value7 = null;
> /* 057 */ isNull12 = true;
> /* 058 */ value12 = value7;
> /* 059 */   }
> /* 060 */
> /* 061 */
> /* 062 */   private void evalIfCondExpr(InternalRow i) {
> /* 063 */
> /* 064 */ isNull11 = false;
> /* 065 */ value11 = 

[jira] [Commented] (SPARK-21390) Dataset filter api inconsistency

2017-07-12 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084306#comment-16084306
 ] 

Kazuaki Ishizaki commented on SPARK-21390:
--

Another interesting results with Spark-2.2:
On IDE
{code:java}
{
...
filterMe1.filter(x=> filterCondition.contains(x)).show
filterMe1.filter(x=> filterCondition.contains(SomeClass(x.field1, 
x.field2))).show
}

+--+--+
|field1|field2|
+--+--+
|00|01|
+--+--+

+--+--+
|field1|field2|
+--+--+
|00|01|
+--+--+
{code}

On REPL
{code:java}
...
scala> filterMe1.filter(x => filterCondition.contains(x)).show
+--+--+
|field1|field2|
+--+--+
|00|01|
+--+--+

scala> filterMe1.filter(x => filterCondition.contains(SomeClass(x.field1, 
x.field2))).show
+--+--+
|field1|field2|
+--+--+
+--+--+

scala> print(filterCondition.contains(SomeClass("00", "01")))
true

scala> filterMe1.filter(x => { val c = 
filterCondition.contains(SomeClass(x.field1, x.field2)); print(s"$c\n"); c} 
).show
false
+--+--+
|field1|field2|
+--+--+
+--+--+
{code}

> Dataset filter api inconsistency
> 
>
> Key: SPARK-21390
> URL: https://issues.apache.org/jira/browse/SPARK-21390
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Gheorghe Gheorghe
>Priority: Minor
>
> Hello everybody, 
> I've encountered a strange situation with the spark-shell.
> When I run the code below in my IDE the second test case prints as expected 
> count "1". However, when I run the same code using the spark-shell in the 
> second test case I get 0 back as a count. 
> I've made sure that I'm running scala 2.11.8 and spark 2.0.1 in both my IDE 
> and spark-shell. 
> {code:java}
>   import org.apache.spark.sql.Dataset
>   case class SomeClass(field1:String, field2:String)
>   val filterCondition: Seq[SomeClass] = Seq( SomeClass("00", "01") )
>   // Test 1
>   val filterMe1: Dataset[SomeClass] = Seq( SomeClass("00", "01") ).toDS
>   
>   println("Works fine!" +filterMe1.filter(filterCondition.contains(_)).count)
>   
>   // Test 2
>   case class OtherClass(field1:String, field2:String)
>   
>   val filterMe2 = Seq( OtherClass("00", "01"), OtherClass("00", "02")).toDS
>   println("Fail, count should return 1: " + filterMe2.filter(x=> 
> filterCondition.contains(SomeClass(x.field1, x.field2))).count)
> {code}
> Note if I transform the dataset first I get 1 back as expected.
> {code:java}
>  println(filterMe2.map(x=> SomeClass(x.field1, 
> x.field2)).filter(filterCondition.contains(_)).count)
> {code}
> Is this a bug? I can see that this filter function has been marked as 
> experimental 
> https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/Dataset.html#filter(scala.Function1)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21390) Dataset filter api inconsistency

2017-07-12 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084306#comment-16084306
 ] 

Kazuaki Ishizaki edited comment on SPARK-21390 at 7/12/17 5:09 PM:
---

Another interesting results with Spark-2.2. Is this only for CaseClass on REPL?

On IDE
{code:java}
{
...
filterMe1.filter(x=> filterCondition.contains(x)).show
filterMe1.filter(x=> filterCondition.contains(SomeClass(x.field1, 
x.field2))).show
}

+--+--+
|field1|field2|
+--+--+
|00|01|
+--+--+

+--+--+
|field1|field2|
+--+--+
|00|01|
+--+--+
{code}

On REPL
{code:java}
...
scala> filterMe1.filter(x => filterCondition.contains(x)).show
+--+--+
|field1|field2|
+--+--+
|00|01|
+--+--+

scala> filterMe1.filter(x => filterCondition.contains(SomeClass(x.field1, 
x.field2))).show
+--+--+
|field1|field2|
+--+--+
+--+--+

scala> print(filterCondition.contains(SomeClass("00", "01")))
true

scala> filterMe1.filter(x => { val c = 
filterCondition.contains(SomeClass(x.field1, x.field2)); print(s"$c\n"); c} 
).show
false
+--+--+
|field1|field2|
+--+--+
+--+--+

scala> Seq((0, 0), (1, 1), (2, 2)).toDS.filter(x => { val c = Seq((1, 
1)).contains((x._1, x._2)); print(s"$c\n"); c} ).show
false
true
false
+---+---+
| _1| _2|
+---+---+
|  1|  1|
+---+---+
{code}


was (Author: kiszk):
Another interesting results with Spark-2.2:
On IDE
{code:java}
{
...
filterMe1.filter(x=> filterCondition.contains(x)).show
filterMe1.filter(x=> filterCondition.contains(SomeClass(x.field1, 
x.field2))).show
}

+--+--+
|field1|field2|
+--+--+
|00|01|
+--+--+

+--+--+
|field1|field2|
+--+--+
|00|01|
+--+--+
{code}

On REPL
{code:java}
...
scala> filterMe1.filter(x => filterCondition.contains(x)).show
+--+--+
|field1|field2|
+--+--+
|00|01|
+--+--+

scala> filterMe1.filter(x => filterCondition.contains(SomeClass(x.field1, 
x.field2))).show
+--+--+
|field1|field2|
+--+--+
+--+--+

scala> print(filterCondition.contains(SomeClass("00", "01")))
true

scala> filterMe1.filter(x => { val c = 
filterCondition.contains(SomeClass(x.field1, x.field2)); print(s"$c\n"); c} 
).show
false
+--+--+
|field1|field2|
+--+--+
+--+--+
{code}

> Dataset filter api inconsistency
> 
>
> Key: SPARK-21390
> URL: https://issues.apache.org/jira/browse/SPARK-21390
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Gheorghe Gheorghe
>Priority: Minor
>
> Hello everybody, 
> I've encountered a strange situation with the spark-shell.
> When I run the code below in my IDE the second test case prints as expected 
> count "1". However, when I run the same code using the spark-shell in the 
> second test case I get 0 back as a count. 
> I've made sure that I'm running scala 2.11.8 and spark 2.0.1 in both my IDE 
> and spark-shell. 
> {code:java}
>   import org.apache.spark.sql.Dataset
>   case class SomeClass(field1:String, field2:String)
>   val filterCondition: Seq[SomeClass] = Seq( SomeClass("00", "01") )
>   // Test 1
>   val filterMe1: Dataset[SomeClass] = Seq( SomeClass("00", "01") ).toDS
>   
>   println("Works fine!" +filterMe1.filter(filterCondition.contains(_)).count)
>   
>   // Test 2
>   case class OtherClass(field1:String, field2:String)
>   
>   val filterMe2 = Seq( OtherClass("00", "01"), OtherClass("00", "02")).toDS
>   println("Fail, count should return 1: " + filterMe2.filter(x=> 
> filterCondition.contains(SomeClass(x.field1, x.field2))).count)
> {code}
> Note if I transform the dataset first I get 1 back as expected.
> {code:java}
>  println(filterMe2.map(x=> SomeClass(x.field1, 
> x.field2)).filter(filterCondition.contains(_)).count)
> {code}
> Is this a bug? I can see that this filter function has been marked as 
> experimental 
> https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/Dataset.html#filter(scala.Function1)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18085) SPIP: Better History Server scalability for many / large applications

2017-07-12 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18085:

Summary: SPIP: Better History Server scalability for many / large 
applications  (was: Better History Server scalability for many / large 
applications)

> SPIP: Better History Server scalability for many / large applications
> -
>
> Key: SPARK-18085
> URL: https://issues.apache.org/jira/browse/SPARK-18085
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>  Labels: SPIP
> Attachments: spark_hs_next_gen.pdf
>
>
> It's a known fact that the History Server currently has some annoying issues 
> when serving lots of applications, and when serving large applications.
> I'm filing this umbrella to track work related to addressing those issues. 
> I'll be attaching a document shortly describing the issues and suggesting a 
> path to how to solve them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18085) Better History Server scalability for many / large applications

2017-07-12 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084265#comment-16084265
 ] 

Marcelo Vanzin commented on SPARK-18085:


Sure, if it's just a matter of adding the label to the bug, done. Or is there 
something else to do?

> Better History Server scalability for many / large applications
> ---
>
> Key: SPARK-18085
> URL: https://issues.apache.org/jira/browse/SPARK-18085
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>  Labels: SPIP
> Attachments: spark_hs_next_gen.pdf
>
>
> It's a known fact that the History Server currently has some annoying issues 
> when serving lots of applications, and when serving large applications.
> I'm filing this umbrella to track work related to addressing those issues. 
> I'll be attaching a document shortly describing the issues and suggesting a 
> path to how to solve them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18085) Better History Server scalability for many / large applications

2017-07-12 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-18085:
---
Labels: SPIP  (was: )

> Better History Server scalability for many / large applications
> ---
>
> Key: SPARK-18085
> URL: https://issues.apache.org/jira/browse/SPARK-18085
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>  Labels: SPIP
> Attachments: spark_hs_next_gen.pdf
>
>
> It's a known fact that the History Server currently has some annoying issues 
> when serving lots of applications, and when serving large applications.
> I'm filing this umbrella to track work related to addressing those issues. 
> I'll be attaching a document shortly describing the issues and suggesting a 
> path to how to solve them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21390) Dataset filter api inconsistency

2017-07-12 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084266#comment-16084266
 ] 

Kazuaki Ishizaki commented on SPARK-21390:
--

Thank you for reporting this. I can reproduce this using Spark 2.2, too.

{code:java}
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.2.0
  /_/
 
Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_131)
Type in expressions to have them evaluated.
Type :help for more information.

scala> import org.apache.spark.sql.Dataset
import org.apache.spark.sql.Dataset

scala> case class SomeClass(field1:String, field2:String)
defined class SomeClass

scala> val filterCondition: Seq[SomeClass] = Seq( SomeClass("00", "01") )
filterCondition: Seq[SomeClass] = List(SomeClass(00,01))

scala> val filterMe1: Dataset[SomeClass] = Seq( SomeClass("00", "01") ).toDS
filterMe1: org.apache.spark.sql.Dataset[SomeClass] = [field1: string, field2: 
string]

scala> println("Works fine!" 
+filterMe1.filter(filterCondition.contains(_)).count)
Works fine!1

scala> case class OtherClass(field1:String, field2:String)
defined class OtherClass

scala> val filterMe2 = Seq( OtherClass("00", "01"), OtherClass("00", "02")).toDS
filterMe2: org.apache.spark.sql.Dataset[OtherClass] = [field1: string, field2: 
string]

scala> println("Fail, count should return 1: " + filterMe2.filter(x=> 
filterCondition.contains(SomeClass(x.field1, x.field2))).count)
Fail, count should return 1: 0
{code}

> Dataset filter api inconsistency
> 
>
> Key: SPARK-21390
> URL: https://issues.apache.org/jira/browse/SPARK-21390
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Gheorghe Gheorghe
>Priority: Minor
>
> Hello everybody, 
> I've encountered a strange situation with the spark-shell.
> When I run the code below in my IDE the second test case prints as expected 
> count "1". However, when I run the same code using the spark-shell in the 
> second test case I get 0 back as a count. 
> I've made sure that I'm running scala 2.11.8 and spark 2.0.1 in both my IDE 
> and spark-shell. 
> {code:java}
>   import org.apache.spark.sql.Dataset
>   case class SomeClass(field1:String, field2:String)
>   val filterCondition: Seq[SomeClass] = Seq( SomeClass("00", "01") )
>   // Test 1
>   val filterMe1: Dataset[SomeClass] = Seq( SomeClass("00", "01") ).toDS
>   
>   println("Works fine!" +filterMe1.filter(filterCondition.contains(_)).count)
>   
>   // Test 2
>   case class OtherClass(field1:String, field2:String)
>   
>   val filterMe2 = Seq( OtherClass("00", "01"), OtherClass("00", "02")).toDS
>   println("Fail, count should return 1: " + filterMe2.filter(x=> 
> filterCondition.contains(SomeClass(x.field1, x.field2))).count)
> {code}
> Note if I transform the dataset first I get 1 back as expected.
> {code:java}
>  println(filterMe2.map(x=> SomeClass(x.field1, 
> x.field2)).filter(filterCondition.contains(_)).count)
> {code}
> Is this a bug? I can see that this filter function has been marked as 
> experimental 
> https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/Dataset.html#filter(scala.Function1)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21391) Cannot convert a Seq of Map whose value type is again a seq, into a dataset

2017-07-12 Thread indraneel rao (JIRA)
indraneel rao created SPARK-21391:
-

 Summary: Cannot convert a Seq of Map whose value type is again a 
seq, into a dataset 
 Key: SPARK-21391
 URL: https://issues.apache.org/jira/browse/SPARK-21391
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
 Environment: Seen on mac OSX, scala 2.11, java 8
Reporter: indraneel rao


There is an error while trying to create a dataset from a sequence of Maps, 
whose values have any kind of collections. Even when they are wrapped in a case 
class. 
Eg : The following piece of code throws an error:
   
{code:java}
 case class Values(values: Option[Seq[Double]])
case class ItemProperties(properties:Map[String,Values])

case class A(values :Set[Double])
val values3 = Set(1.0,2.0,3.0)
spark.createDataset(Seq(values3)).show()
val l1 = List(ItemProperties(
  Map(
"A1" -> Values(Some(Seq(1.0,2.0))),
"B1" -> Values(Some(Seq(44.0,55.0)))
  )
),
  ItemProperties(
Map(
  "A2" -> Values(Some(Seq(123.0,25.0))),
  "B2" -> Values(Some(Seq(445.0,35.0)))
)
  )
)

l1.toDS().show()
{code}


Heres the error:

17/07/12 21:49:31 ERROR CodeGenerator: failed to compile: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 65, 
Column 46: Expression "ExternalMapToCatalyst_value_isNull0" is not an rvalue
/* 001 */ public java.lang.Object generate(Object[] references) {
/* 002 */   return new SpecificUnsafeProjection(references);
/* 003 */ }
/* 004 */
/* 005 */ class SpecificUnsafeProjection extends 
org.apache.spark.sql.catalyst.expressions.UnsafeProjection {
/* 006 */
/* 007 */   private Object[] references;
/* 008 */   private boolean resultIsNull;
/* 009 */   private java.lang.String argValue;
/* 010 */   private Object[] values;
/* 011 */   private boolean resultIsNull1;
/* 012 */   private scala.collection.Seq argValue1;
/* 013 */   private boolean isNull12;
/* 014 */   private boolean value12;
/* 015 */   private boolean isNull13;
/* 016 */   private InternalRow value13;
/* 017 */   private boolean isNull14;
/* 018 */   private InternalRow value14;
/* 019 */   private UnsafeRow result;
/* 020 */   private 
org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder holder;
/* 021 */   private 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter rowWriter;
/* 022 */   private 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter arrayWriter;
/* 023 */   private 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter 
arrayWriter1;
/* 024 */   private 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter rowWriter1;
/* 025 */   private 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter 
arrayWriter2;
/* 026 */
/* 027 */   public SpecificUnsafeProjection(Object[] references) {
/* 028 */ this.references = references;
/* 029 */
/* 030 */
/* 031 */ this.values = null;
/* 032 */
/* 033 */
/* 034 */ isNull12 = false;
/* 035 */ value12 = false;
/* 036 */ isNull13 = false;
/* 037 */ value13 = null;
/* 038 */ isNull14 = false;
/* 039 */ value14 = null;
/* 040 */ result = new UnsafeRow(1);
/* 041 */ this.holder = new 
org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(result, 32);
/* 042 */ this.rowWriter = new 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(holder, 1);
/* 043 */ this.arrayWriter = new 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter();
/* 044 */ this.arrayWriter1 = new 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter();
/* 045 */ this.rowWriter1 = new 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(holder, 1);
/* 046 */ this.arrayWriter2 = new 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter();
/* 047 */
/* 048 */   }
/* 049 */
/* 050 */   public void initialize(int partitionIndex) {
/* 051 */
/* 052 */   }
/* 053 */
/* 054 */
/* 055 */   private void evalIfTrueExpr(InternalRow i) {
/* 056 */ final InternalRow value7 = null;
/* 057 */ isNull13 = true;
/* 058 */ value13 = value7;
/* 059 */   }
/* 060 */
/* 061 */
/* 062 */   private void evalIfCondExpr(InternalRow i) {
/* 063 */
/* 064 */ isNull12 = false;
/* 065 */ value12 = ExternalMapToCatalyst_value_isNull0;
/* 066 */   }
/* 067 */
/* 068 */
/* 069 */   private void evalIfFalseExpr(InternalRow i) {
/* 070 */ values = new Object[1];
/* 071 */ resultIsNull1 = false;
/* 072 */ if (!resultIsNull1) {
/* 073 */
/* 074 */   boolean isNull11 = true;
/* 075 */   scala.Option value11 = null;
/* 076 */   if (!ExternalMapToCatalyst_value_isNull0) {
/* 077 */
/* 078 */ isNull11 = false;
/* 079 */ if (!isNull11) {
/* 080 */
/* 081 */   Object funcResult1 = 

[jira] [Updated] (SPARK-21390) Dataset filter api inconsistency

2017-07-12 Thread Gheorghe Gheorghe (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gheorghe Gheorghe updated SPARK-21390:
--
Description: 
Hello everybody, 

I've encountered a strange situation with the spark-shell.
When I run the code below in my IDE the second test case prints as expected 
count "1". However, when I run the same code using the spark-shell in the 
second test case I get 0 back as a count. 
I've made sure that I'm running scala 2.11.8 and spark 2.0.1 in both my IDE and 
spark-shell. 


{code:java}
  import org.apache.spark.sql.Dataset

  case class SomeClass(field1:String, field2:String)

  val filterCondition: Seq[SomeClass] = Seq( SomeClass("00", "01") )

  // Test 1
  val filterMe1: Dataset[SomeClass] = Seq( SomeClass("00", "01") ).toDS
  
  println("Works fine!" +filterMe1.filter(filterCondition.contains(_)).count)
  
  // Test 2
  case class OtherClass(field1:String, field2:String)
  
  val filterMe2 = Seq( OtherClass("00", "01"), OtherClass("00", "02")).toDS

  println("Fail, count should return 1: " + filterMe2.filter(x=> 
filterCondition.contains(SomeClass(x.field1, x.field2))).count)
{code}

Note if I transform the dataset first I get 1 back as expected.
{code:java}
 println(filterMe2.map(x=> SomeClass(x.field1, 
x.field2)).filter(filterCondition.contains(_)).count)
{code}

Is this a bug? I can see that this filter function has been marked as 
experimental 
https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/Dataset.html#filter(scala.Function1)

  was:
Hello everybody, 

I've encountered a strange situation with the spark-shell.
When I run the code below in my IDE the second test case prints as expected 
count "1". However, when I run the same code using the spark-shell in the 
second test case I get 0 back as a count. 
I've made sure that I'm running scala 2.11.8 and spark 2.0.1 in both my IDE and 
spark-shell. 


{code:java}
  import org.apache.spark.sql.Dataset

  case class SomeClass(field1:String, field2:String)

  val filterCondition: Seq[SomeClass] = Seq( SomeClass("00", "01") )

  // Test 1
  val filterMe1: Dataset[SomeClass] = Seq( SomeClass("00", "01") ).toDS
  
  println("Works fine!" +filterMe1.filter(filterCondition.contains(_)).count)
  
  // Test 2
  case class OtherClass(field1:String, field2:String)
  
  val filterMe2 = Seq( OtherClass("00", "01"), OtherClass("00", "02")).toDS

  println("Fail, count should return 1: " + filterMe2.filter(x=> 
filterCondition.contains(SomeClass(x.field1, x.field2))).count)
{code}

Note if I do this it is printing 1 as expected.
{code:java}
 println(filterMe2.map(x=> SomeClass(x.field1, 
x.field2)).filter(filterCondition.contains(_)).count)
{code}

Is this a bug? I can see that this filter function has been marked as 
experimental 
https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/Dataset.html#filter(scala.Function1)


> Dataset filter api inconsistency
> 
>
> Key: SPARK-21390
> URL: https://issues.apache.org/jira/browse/SPARK-21390
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Gheorghe Gheorghe
>Priority: Minor
>
> Hello everybody, 
> I've encountered a strange situation with the spark-shell.
> When I run the code below in my IDE the second test case prints as expected 
> count "1". However, when I run the same code using the spark-shell in the 
> second test case I get 0 back as a count. 
> I've made sure that I'm running scala 2.11.8 and spark 2.0.1 in both my IDE 
> and spark-shell. 
> {code:java}
>   import org.apache.spark.sql.Dataset
>   case class SomeClass(field1:String, field2:String)
>   val filterCondition: Seq[SomeClass] = Seq( SomeClass("00", "01") )
>   // Test 1
>   val filterMe1: Dataset[SomeClass] = Seq( SomeClass("00", "01") ).toDS
>   
>   println("Works fine!" +filterMe1.filter(filterCondition.contains(_)).count)
>   
>   // Test 2
>   case class OtherClass(field1:String, field2:String)
>   
>   val filterMe2 = Seq( OtherClass("00", "01"), OtherClass("00", "02")).toDS
>   println("Fail, count should return 1: " + filterMe2.filter(x=> 
> filterCondition.contains(SomeClass(x.field1, x.field2))).count)
> {code}
> Note if I transform the dataset first I get 1 back as expected.
> {code:java}
>  println(filterMe2.map(x=> SomeClass(x.field1, 
> x.field2)).filter(filterCondition.contains(_)).count)
> {code}
> Is this a bug? I can see that this filter function has been marked as 
> experimental 
> https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/Dataset.html#filter(scala.Function1)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: 

[jira] [Updated] (SPARK-21390) Dataset filter api inconsistency

2017-07-12 Thread Gheorghe Gheorghe (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gheorghe Gheorghe updated SPARK-21390:
--
Description: 
Hello everybody, 

I've encountered a strange situation with the spark-shell.
When I run the code below in my IDE the second test case prints as expected 
count "1". However, when I run the same code using the spark-shell in the 
second test case I get 0 back as a count. 
I've made sure that I'm running scala 2.11.8 and spark 2.0.1 in both my IDE and 
spark-shell. 


{code:java}
  import org.apache.spark.sql.Dataset

  case class SomeClass(field1:String, field2:String)

  val filterCondition: Seq[SomeClass] = Seq( SomeClass("00", "01") )

  // Test 1
  val filterMe1: Dataset[SomeClass] = Seq( SomeClass("00", "01") ).toDS
  
  println("Works fine!" +filterMe1.filter(filterCondition.contains(_)).count)
  
  // Test 2
  case class OtherClass(field1:String, field2:String)
  
  val filterMe2 = Seq( OtherClass("00", "01"), OtherClass("00", "02")).toDS

  println("Fail, count should return 1: " + filterMe2.filter(x=> 
filterCondition.contains(SomeClass(x.field1, x.field2))).count)
{code}

Note if I do this it is printing 1 as expected.
{code:java}
 println(filterMe2.map(x=> SomeClass(x.field1, 
x.field2)).filter(filterCondition.contains(_)).count)
{code}

Is this a bug? I can see that this filter function has been marked as 
experimental 
https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/Dataset.html#filter(scala.Function1)

  was:
Hello everybody, 

I've encountered a strange situation with spark 2.0.1 in spark-shell. 
When I run the code below in my IDE I get the in the second test case as 
expected 1. However, when I run the spark shell with the same code the second 
test case is returning 0. 
I've made sure that I'm running scala 2.11.8 and spark 2.0.1 in both my IDE and 
spark-shell. 


{code:java}
  import org.apache.spark.sql.Dataset

  case class SomeClass(field1:String, field2:String)

  val filterCondition: Seq[SomeClass] = Seq( SomeClass("00", "01") )

  // Test 1
  val filterMe1: Dataset[SomeClass] = Seq( SomeClass("00", "01") ).toDS
  
  println("Works fine!" +filterMe1.filter(filterCondition.contains(_)).count)
  
  // Test 2
  case class OtherClass(field1:String, field2:String)
  
  val filterMe2 = Seq( OtherClass("00", "01"), OtherClass("00", "02")).toDS

  println("Fail, count should return 1: " + filterMe2.filter(x=> 
filterCondition.contains(SomeClass(x.field1, x.field2))).count)
{code}

Note if I do this it is printing 1 as expected.
{code:java}
 println(filterMe2.map(x=> SomeClass(x.field1, 
x.field2)).filter(filterCondition.contains(_)).count)
{code}

Is this a bug? I can see that this filter function has been marked as 
experimental 
https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/Dataset.html#filter(scala.Function1)


> Dataset filter api inconsistency
> 
>
> Key: SPARK-21390
> URL: https://issues.apache.org/jira/browse/SPARK-21390
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Gheorghe Gheorghe
>Priority: Minor
>
> Hello everybody, 
> I've encountered a strange situation with the spark-shell.
> When I run the code below in my IDE the second test case prints as expected 
> count "1". However, when I run the same code using the spark-shell in the 
> second test case I get 0 back as a count. 
> I've made sure that I'm running scala 2.11.8 and spark 2.0.1 in both my IDE 
> and spark-shell. 
> {code:java}
>   import org.apache.spark.sql.Dataset
>   case class SomeClass(field1:String, field2:String)
>   val filterCondition: Seq[SomeClass] = Seq( SomeClass("00", "01") )
>   // Test 1
>   val filterMe1: Dataset[SomeClass] = Seq( SomeClass("00", "01") ).toDS
>   
>   println("Works fine!" +filterMe1.filter(filterCondition.contains(_)).count)
>   
>   // Test 2
>   case class OtherClass(field1:String, field2:String)
>   
>   val filterMe2 = Seq( OtherClass("00", "01"), OtherClass("00", "02")).toDS
>   println("Fail, count should return 1: " + filterMe2.filter(x=> 
> filterCondition.contains(SomeClass(x.field1, x.field2))).count)
> {code}
> Note if I do this it is printing 1 as expected.
> {code:java}
>  println(filterMe2.map(x=> SomeClass(x.field1, 
> x.field2)).filter(filterCondition.contains(_)).count)
> {code}
> Is this a bug? I can see that this filter function has been marked as 
> experimental 
> https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/Dataset.html#filter(scala.Function1)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21390) Dataset filter api inconsistency

2017-07-12 Thread Gheorghe Gheorghe (JIRA)
Gheorghe Gheorghe created SPARK-21390:
-

 Summary: Dataset filter api inconsistency
 Key: SPARK-21390
 URL: https://issues.apache.org/jira/browse/SPARK-21390
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.1
Reporter: Gheorghe Gheorghe
Priority: Minor


Hello everybody, 

I've encountered a strange situation with spark 2.0.1 in spark-shell. 
When I run the code below in my IDE I get the in the second test case as 
expected 1. However, when I run the spark shell with the same code the second 
test case is returning 0. 
I've made sure that I'm running scala 2.11.8 and spark 2.0.1 in both my IDE and 
spark-shell. 


{code:java}
  import org.apache.spark.sql.Dataset

  case class SomeClass(field1:String, field2:String)

  val filterCondition: Seq[SomeClass] = Seq( SomeClass("00", "01") )

  // Test 1
  val filterMe1: Dataset[SomeClass] = Seq( SomeClass("00", "01") ).toDS
  
  println("Works fine!" +filterMe1.filter(filterCondition.contains(_)).count)
  
  // Test 2
  case class OtherClass(field1:String, field2:String)
  
  val filterMe2 = Seq( OtherClass("00", "01"), OtherClass("00", "02")).toDS

  println("Fail, count should return 1: " + filterMe2.filter(x=> 
filterCondition.contains(SomeClass(x.field1, x.field2))).count)
{code}

Note if I do this it is printing 1 as expected.
{code:java}
 println(filterMe2.map(x=> SomeClass(x.field1, 
x.field2)).filter(filterCondition.contains(_)).count)
{code}

Is this a bug? I can see that this filter function has been marked as 
experimental 
https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/Dataset.html#filter(scala.Function1)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18646) ExecutorClassLoader for spark-shell does not honor spark.executor.userClassPathFirst

2017-07-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084193#comment-16084193
 ] 

Apache Spark commented on SPARK-18646:
--

User 'jiangxb1987' has created a pull request for this issue:
https://github.com/apache/spark/pull/18614

> ExecutorClassLoader for spark-shell does not honor 
> spark.executor.userClassPathFirst
> 
>
> Key: SPARK-18646
> URL: https://issues.apache.org/jira/browse/SPARK-18646
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.6.2
>Reporter: Min Shen
>
> When submitting a spark-shell application, the executor side classloader is 
> set to be {{ExecutorClassLoader}}.
> However, it appears that when {{ExecutorClassLoader}} is used, parameter 
> {{spark.executor.userClassPathFirst}} is not honored.
> It turns out that, since {{ExecutorClassLoader}} class is defined as
> {noformat}
> class ExecutorClassLoader(conf: SparkConf, classUri: String, parent: 
> ClassLoader,
> userClassPathFirst: Boolean) extends ClassLoader with Logging
> {noformat}
> its parent classloader is actually the system default classloader (due to 
> {{ClassLoader}} class's default constructor) rather than the "parent" 
> classloader specified in {{ExecutorClassLoader}}'s constructor.
> As a result, when {{spark.executor.userClassPathFirst}} is set to true, even 
> though the "parent" classloader is {{ChildFirstURLClassLoader}}, 
> {{ExecutorClassLoader.getParent()}} will return the system default 
> classloader.
> Thus, when {{ExecutorClassLoader}} tries to load a class, it will first 
> attempt to load it through the system default classloader, and this will 
> break the {{spark.executor.userClassPathFirst}} behavior.
> A simple fix would be to define {{ExecutorClassLoader}} as:
> {noformat}
> class ExecutorClassLoader(conf: SparkConf, classUri: String, parent: 
> ClassLoader,
> userClassPathFirst: Boolean) extends ClassLoader(parent) with Logging
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18838) High latency of event processing for large jobs

2017-07-12 Thread Arun Achuthan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084169#comment-16084169
 ] 

Arun Achuthan commented on SPARK-18838:
---

We are facing an issue where randomly some jobs are stuck either at processing 
level or queued up. The subsequentt jobs are processed. For very job getting 
stuck we see an  error at the same time in driver logs  mentioning
org.apache.spark.scheduler.LiveListenerBus: Dropped 1 SparkListenerEvents since 
Thu Jan 01 00:00:00 UTC 1970

> High latency of event processing for large jobs
> ---
>
> Key: SPARK-18838
> URL: https://issues.apache.org/jira/browse/SPARK-18838
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
> Attachments: perfResults.pdf, SparkListernerComputeTime.xlsx
>
>
> Currently we are observing the issue of very high event processing delay in 
> driver's `ListenerBus` for large jobs with many tasks. Many critical 
> component of the scheduler like `ExecutorAllocationManager`, 
> `HeartbeatReceiver` depend on the `ListenerBus` events and this delay might 
> hurt the job performance significantly or even fail the job.  For example, a 
> significant delay in receiving the `SparkListenerTaskStart` might cause 
> `ExecutorAllocationManager` manager to mistakenly remove an executor which is 
> not idle.  
> The problem is that the event processor in `ListenerBus` is a single thread 
> which loops through all the Listeners for each event and processes each event 
> synchronously 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94.
>  This single threaded processor often becomes the bottleneck for large jobs.  
> Also, if one of the Listener is very slow, all the listeners will pay the 
> price of delay incurred by the slow listener. In addition to that a slow 
> listener can cause events to be dropped from the event queue which might be 
> fatal to the job.
> To solve the above problems, we propose to get rid of the event queue and the 
> single threaded event processor. Instead each listener will have its own 
> dedicate single threaded executor service . When ever an event is posted, it 
> will be submitted to executor service of all the listeners. The Single 
> threaded executor service will guarantee in order processing of the events 
> per listener.  The queue used for the executor service will be bounded to 
> guarantee we do not grow the memory indefinitely. The downside of this 
> approach is separate event queue per listener will increase the driver memory 
> footprint. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21389) ALS recommendForAll optimization uses Native BLAS

2017-07-12 Thread Peng Meng (JIRA)
Peng Meng created SPARK-21389:
-

 Summary: ALS recommendForAll optimization uses Native BLAS
 Key: SPARK-21389
 URL: https://issues.apache.org/jira/browse/SPARK-21389
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Affects Versions: 2.3.0
Reporter: Peng Meng


In Spark 2.2, we have optimized ALS recommendForAll, which uses a handwriting 
matrix multiplication, and get the topK items for each matrix. The method 
effectively reduce the GC problem. However, Native BLAS GEMM, like Intel MKL, 
and OpenBLAS, the performance of matrix multiplication is about 10X comparing 
with handwriting method. 

I have rewritten the code of recommendForAll with GEMM, and got about 20%-30% 
improvement comparing with the master recommendForAll method. 

Will clean the code and submit for discussion.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21376) Token is not renewed in yarn client process in cluster mode

2017-07-12 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084075#comment-16084075
 ] 

Thomas Graves commented on SPARK-21376:
---

Can you please clarify the title and description?  What do you mean by "in yarn 
client process in cluster mode"?  I assume you were running in yarn cluster 
mode but what is the yarn client process?  the application master?

> Token is not renewed in yarn client process in cluster mode
> ---
>
> Key: SPARK-21376
> URL: https://issues.apache.org/jira/browse/SPARK-21376
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Yesha Vora
>Priority: Minor
>
> STR:
> * Set below config in spark-default.conf
> {code}
> spark.yarn.security.credentials.hbase.enabled true
> spark.hbase.connector.security.credentials.enabled false{code}
> * Set below config in hdfs-site.xml
> {code}
> 'dfs.namenode.delegation.token.max-lifetime':'4320'
> 'dfs.namenode.delegation.token.renew-interval':'2880' {code}
> * Run HDFSWordcount streaming app in yarn-cluster mode  for 25 hrs. 
> After 25 hours, noticing that HDFS Wordcount job is hitting 
> HDFS_DELEGATION_TOKEN renewal issue. 
> {code}
> 17/06/28 10:49:47 WARN Client: Exception encountered while connecting to the 
> server : 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (HDFS_DELEGATION_TOKEN token 230 for hrt_qa) is expired
> 17/06/28 10:49:47 WARN Client: Failed to cleanup staging dir 
> hdfs://mycluster0/user/hrt_qa/.sparkStaging/application_1498539861056_0015
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (HDFS_DELEGATION_TOKEN token 230 for hrt_qa) is expired
> at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1554)
> at org.apache.hadoop.ipc.Client.call(Client.java:1498){code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20307) SparkR: pass on setHandleInvalid to spark.mllib functions that use StringIndexer

2017-07-12 Thread Joseph Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084069#comment-16084069
 ] 

Joseph Wang commented on SPARK-20307:
-

Fantastic function to add. It would be nice to generalize to pyspark too. Any 
plan on this too.

Joseph

On 7/12/17, 9:29 AM, "Apache Spark (JIRA)"  wrote:


[ 
https://issues.apache.org/jira/browse/SPARK-20307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084058#comment-16084058
 ] 

Apache Spark commented on SPARK-20307:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/18613

> SparkR: pass on setHandleInvalid to spark.mllib functions that use 
StringIndexer
> 

>
> Key: SPARK-20307
> URL: https://issues.apache.org/jira/browse/SPARK-20307
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Anne Rutten
>Assignee: Miao Wang
>Priority: Minor
> Fix For: 2.3.0
>
>
> when training a model in SparkR with string variables (tested with 
spark.randomForest, but i assume is valid for all spark.xx functions that apply 
a StringIndexer under the hood), testing on a new dataset with factor levels 
that are not in the training set will throw an "Unseen label" error. 
> I think this can be solved if there's a method to pass setHandleInvalid 
on to the StringIndexers when calling spark.randomForest.
> code snippet:
> {code}
> # (i've run this in Zeppelin which already has SparkR and the context 
loaded)
> #library(SparkR)
> #sparkR.session(master = "local[*]") 
> data = data.frame(clicked = base::sample(c(0,1),100,replace=TRUE),
>   someString = base::sample(c("this", 
"that"), 100, replace=TRUE), stringsAsFactors=FALSE)
> trainidxs = base::sample(nrow(data), nrow(data)*0.7)
> traindf = as.DataFrame(data[trainidxs,])
> testdf = as.DataFrame(rbind(data[-trainidxs,],c(0,"the other")))
> rf = spark.randomForest(traindf, clicked~., type="classification", 
> maxDepth=10, 
> maxBins=41,
> numTrees = 100)
> predictions = predict(rf, testdf)
> SparkR::collect(predictions)
> {code}
> stack trace:
> {quote}
> Error in handleErrors(returnStatus, conn): 
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 607.0 failed 1 times, most recent failure: Lost task 0.0 in stage 607.0 
(TID 1581, localhost, executor driver): org.apache.spark.SparkException: Failed 
to execute user defined function($anonfun$4: (string) => double)
> at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
> at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
> at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
> at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
> at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
> at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
> at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
> at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: Unseen label: the other.
> at 
org.apache.spark.ml.feature.StringIndexerModel$$anonfun$4.apply(StringIndexer.scala:170)
> at 
org.apache.spark.ml.feature.StringIndexerModel$$anonfun$4.apply(StringIndexer.scala:166)
> ... 16 more
> Driver stacktrace:
> at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
>

[jira] [Commented] (SPARK-20307) SparkR: pass on setHandleInvalid to spark.mllib functions that use StringIndexer

2017-07-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084058#comment-16084058
 ] 

Apache Spark commented on SPARK-20307:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/18613

> SparkR: pass on setHandleInvalid to spark.mllib functions that use 
> StringIndexer
> 
>
> Key: SPARK-20307
> URL: https://issues.apache.org/jira/browse/SPARK-20307
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Anne Rutten
>Assignee: Miao Wang
>Priority: Minor
> Fix For: 2.3.0
>
>
> when training a model in SparkR with string variables (tested with 
> spark.randomForest, but i assume is valid for all spark.xx functions that 
> apply a StringIndexer under the hood), testing on a new dataset with factor 
> levels that are not in the training set will throw an "Unseen label" error. 
> I think this can be solved if there's a method to pass setHandleInvalid on to 
> the StringIndexers when calling spark.randomForest.
> code snippet:
> {code}
> # (i've run this in Zeppelin which already has SparkR and the context loaded)
> #library(SparkR)
> #sparkR.session(master = "local[*]") 
> data = data.frame(clicked = base::sample(c(0,1),100,replace=TRUE),
>   someString = base::sample(c("this", "that"), 
> 100, replace=TRUE), stringsAsFactors=FALSE)
> trainidxs = base::sample(nrow(data), nrow(data)*0.7)
> traindf = as.DataFrame(data[trainidxs,])
> testdf = as.DataFrame(rbind(data[-trainidxs,],c(0,"the other")))
> rf = spark.randomForest(traindf, clicked~., type="classification", 
> maxDepth=10, 
> maxBins=41,
> numTrees = 100)
> predictions = predict(rf, testdf)
> SparkR::collect(predictions)
> {code}
> stack trace:
> {quote}
> Error in handleErrors(returnStatus, conn): org.apache.spark.SparkException: 
> Job aborted due to stage failure: Task 0 in stage 607.0 failed 1 times, most 
> recent failure: Lost task 0.0 in stage 607.0 (TID 1581, localhost, executor 
> driver): org.apache.spark.SparkException: Failed to execute user defined 
> function($anonfun$4: (string) => double)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: Unseen label: the other.
> at 
> org.apache.spark.ml.feature.StringIndexerModel$$anonfun$4.apply(StringIndexer.scala:170)
> at 
> org.apache.spark.ml.feature.StringIndexerModel$$anonfun$4.apply(StringIndexer.scala:166)
> ... 16 more
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
> at 
> 

[jira] [Resolved] (SPARK-18619) Make QuantileDiscretizer/Bucketizer/StringIndexer inherit from HasHandleInvalid

2017-07-12 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang resolved SPARK-18619.
-
   Resolution: Fixed
 Assignee: zhengruifeng
Fix Version/s: 2.3.0

> Make QuantileDiscretizer/Bucketizer/StringIndexer inherit from 
> HasHandleInvalid
> ---
>
> Key: SPARK-18619
> URL: https://issues.apache.org/jira/browse/SPARK-18619
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: zhengruifeng
>Priority: Minor
> Fix For: 2.3.0
>
>
> {{QuantileDiscretizer}}, {{Bucketizer}} and {{StringIndexer}} have the same 
> param {{handleInvalid}}, but with different supported options and docs. After 
> SPARK-16151 resolved, we can make all of them inherit from 
> {{HasHandleInvalid}}, and with different supported options and docs for each 
> subclass.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21388) GBT inherit from HasStepSize & LInearSVC/Binarizer from HasThreshold

2017-07-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16083974#comment-16083974
 ] 

Apache Spark commented on SPARK-21388:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/18612

> GBT inherit from HasStepSize & LInearSVC/Binarizer from HasThreshold
> 
>
> Key: SPARK-21388
> URL: https://issues.apache.org/jira/browse/SPARK-21388
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.2.1
>Reporter: zhengruifeng
>
> make GBTs inherit from {{HasStepSize}} and LInearSVC/Binarizer inherit from 
> {{HasThreshold}}
> The desc for param {{StepSize}} of GBTs in Pydoc is wrong, so I also override 
> it in the python side.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21388) GBT inherit from HasStepSize & LInearSVC/Binarizer from HasThreshold

2017-07-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21388:


Assignee: Apache Spark

> GBT inherit from HasStepSize & LInearSVC/Binarizer from HasThreshold
> 
>
> Key: SPARK-21388
> URL: https://issues.apache.org/jira/browse/SPARK-21388
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.2.1
>Reporter: zhengruifeng
>Assignee: Apache Spark
>
> make GBTs inherit from {{HasStepSize}} and LInearSVC/Binarizer inherit from 
> {{HasThreshold}}
> The desc for param {{StepSize}} of GBTs in Pydoc is wrong, so I also override 
> it in the python side.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21388) GBT inherit from HasStepSize & LInearSVC/Binarizer from HasThreshold

2017-07-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21388:


Assignee: (was: Apache Spark)

> GBT inherit from HasStepSize & LInearSVC/Binarizer from HasThreshold
> 
>
> Key: SPARK-21388
> URL: https://issues.apache.org/jira/browse/SPARK-21388
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.2.1
>Reporter: zhengruifeng
>
> make GBTs inherit from {{HasStepSize}} and LInearSVC/Binarizer inherit from 
> {{HasThreshold}}
> The desc for param {{StepSize}} of GBTs in Pydoc is wrong, so I also override 
> it in the python side.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21388) GBT inherit from HasStepSize & LInearSVC/Binarizer from HasThreshold

2017-07-12 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-21388:


 Summary: GBT inherit from HasStepSize & LInearSVC/Binarizer from 
HasThreshold
 Key: SPARK-21388
 URL: https://issues.apache.org/jira/browse/SPARK-21388
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 2.2.1
Reporter: zhengruifeng


make GBTs inherit from {{HasStepSize}} and LInearSVC/Binarizer inherit from 
{{HasThreshold}}

The desc for param {{StepSize}} of GBTs in Pydoc is wrong, so I also override 
it in the python side.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21374) Reading globbed paths from S3 into DF doesn't work if filesystem caching is disabled

2017-07-12 Thread Andrey Taptunov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Taptunov updated SPARK-21374:

Description: 
*Motivation:*
In my case I want to disable filesystem cache to be able to change S3's access 
key and secret key on the fly to read from buckets with different permissions. 
This works perfectly fine for RDDs but doesn't work for DFs.

*Example (works for RDD but fails for DataFrame):*

{code:java}
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession

object SimpleApp {
  def main(args: Array[String]) {

val awsAccessKeyId = "something"
val awsSecretKey = "something else"

val conf = new SparkConf().setAppName("Simple 
Application").setMaster("local[*]")

val sc = new SparkContext(conf)
sc.hadoopConfiguration.set("fs.s3.awsAccessKeyId", awsAccessKeyId)
sc.hadoopConfiguration.set("fs.s3.awsSecretAccessKey", awsSecretKey)
sc.hadoopConfiguration.setBoolean("fs.s3.impl.disable.cache",true)

sc.hadoopConfiguration.set("fs.s3.impl","org.apache.hadoop.fs.s3native.NativeS3FileSystem")
sc.hadoopConfiguration.set("fs.s3.buffer.dir","/tmp")

val spark = SparkSession.builder().config(conf).getOrCreate()

val rddFile = sc.textFile("s3://bucket/file.csv").count // ok
val rddGlob = sc.textFile("s3://bucket/*").count // ok
val dfFile = spark.read.format("csv").load("s3://bucket/file.csv").count // 
ok

val dfGlob = spark.read.format("csv").load("s3://bucket/*").count 
// IllegalArgumentExcepton. AWS Access Key ID and Secret Access Key must be 
specified as the username or password (respectively)
// of a s3 URL, or by setting the fs.s3.awsAccessKeyId or 
fs.s3.awsSecretAccessKey properties (respectively).
   
sc.stop()
  }
}

{code}

  was:
*Motivation:*
Filesystem configuration is not part of cache's key which is used to find 
instance of FileSystem in filesystem cache, where they are stored by default. 
In my case I have to disable filesystem cache to be able to change access key 
and secret key on the fly to read from buckets with different permissions. 

*Example (works for RDD but fails for DataFrame):*

{code:java}
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession

object SimpleApp {
  def main(args: Array[String]) {

val awsAccessKeyId = "something"
val awsSecretKey = "something else"

val conf = new SparkConf().setAppName("Simple 
Application").setMaster("local[*]")

val sc = new SparkContext(conf)
sc.hadoopConfiguration.set("fs.s3.awsAccessKeyId", awsAccessKeyId)
sc.hadoopConfiguration.set("fs.s3.awsSecretAccessKey", awsSecretKey)
sc.hadoopConfiguration.setBoolean("fs.s3.impl.disable.cache",true)

sc.hadoopConfiguration.set("fs.s3.impl","org.apache.hadoop.fs.s3native.NativeS3FileSystem")
sc.hadoopConfiguration.set("fs.s3.buffer.dir","/tmp")

val spark = SparkSession.builder().config(conf).getOrCreate()

val rddFile = sc.textFile("s3://bucket/file.csv").count // ok
val rddGlob = sc.textFile("s3://bucket/*").count // ok
val dfFile = spark.read.format("csv").load("s3://bucket/file.csv").count // 
ok

val dfGlob = spark.read.format("csv").load("s3://bucket/*").count 
// IllegalArgumentExcepton. AWS Access Key ID and Secret Access Key must be 
specified as the username or password (respectively)
// of a s3 URL, or by setting the fs.s3.awsAccessKeyId or 
fs.s3.awsSecretAccessKey properties (respectively).
   
sc.stop()
  }
}

{code}

*Analysis:*
It looks like the problem lies in SparkHadoopUtil.globPath method which uses 
"conf" object to create an instance of FileSystem. If caching is enabled 
(default behavior) instance of FileSystem is retrieved from cache so "conf" 
object is just omitted, however if caching is disabled (my case) this "conf" 
object is used to create instance of FileSystem without settings which are set 
by user.


> Reading globbed paths from S3 into DF doesn't work if filesystem caching is 
> disabled
> 
>
> Key: SPARK-21374
> URL: https://issues.apache.org/jira/browse/SPARK-21374
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.2, 2.1.1
>Reporter: Andrey Taptunov
>
> *Motivation:*
> In my case I want to disable filesystem cache to be able to change S3's 
> access key and secret key on the fly to read from buckets with different 
> permissions. This works perfectly fine for RDDs but doesn't work for DFs.
> *Example (works for RDD but fails for DataFrame):*
> {code:java}
> import org.apache.spark.SparkContext
> import org.apache.spark.SparkConf
> import org.apache.spark.sql.SparkSession
> object SimpleApp {
>   def main(args: 

[jira] [Resolved] (SPARK-21007) Add SQL function - RIGHT && LEFT

2017-07-12 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21007.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18228
[https://github.com/apache/spark/pull/18228]

> Add  SQL function - RIGHT && LEFT
> -
>
> Key: SPARK-21007
> URL: https://issues.apache.org/jira/browse/SPARK-21007
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: liuxian
> Fix For: 2.3.0
>
>
>  Add  SQL function - RIGHT && LEFT, same as MySQL:
>  https://dev.mysql.com/doc/refman/5.7/en/string-functions.html#function_left
> https://dev.mysql.com/doc/refman/5.7/en/string-functions.html#function_right



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21387) org.apache.spark.memory.TaskMemoryManager.allocatePage causes OOM

2017-07-12 Thread Kazuaki Ishizaki (JIRA)
Kazuaki Ishizaki created SPARK-21387:


 Summary: org.apache.spark.memory.TaskMemoryManager.allocatePage 
causes OOM
 Key: SPARK-21387
 URL: https://issues.apache.org/jira/browse/SPARK-21387
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: Kazuaki Ishizaki






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21007) Add SQL function - RIGHT && LEFT

2017-07-12 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-21007:
---

Assignee: liuxian

> Add  SQL function - RIGHT && LEFT
> -
>
> Key: SPARK-21007
> URL: https://issues.apache.org/jira/browse/SPARK-21007
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: liuxian
>Assignee: liuxian
> Fix For: 2.3.0
>
>
>  Add  SQL function - RIGHT && LEFT, same as MySQL:
>  https://dev.mysql.com/doc/refman/5.7/en/string-functions.html#function_left
> https://dev.mysql.com/doc/refman/5.7/en/string-functions.html#function_right



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21078) JobHistory applications synchronized is invalid

2017-07-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21078.
---
Resolution: Duplicate

I'm resolving this as 'duplicate' but really this will just go away when the 
linked change finishes. If there's no evidence it's causing a problem in 
practice then I think we can wait.

https://github.com/apache/spark/pull/18497#issuecomment-313815621

> JobHistory applications synchronized is invalid
> ---
>
> Key: SPARK-21078
> URL: https://issues.apache.org/jira/browse/SPARK-21078
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: fangfengbin
>
> FsHistoryProvider has some threads to mergeApplicationListing.
> mergeApplicationListing function has code like this:
> applications.synchronized { 
> ...
> applications = mergedApps 
> ...
> } 
> applications object is changed,  so the synchronized do not work.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21305) The BKM (best known methods) of using native BLAS to improvement ML/MLLIB performance

2017-07-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-21305:
-

 Assignee: Peng Meng
Flags:   (was: Important)
Affects Version/s: (was: 2.3.0)
   2.2.0
 Priority: Minor  (was: Critical)
  Component/s: (was: MLlib)
   Documentation
   Issue Type: Improvement  (was: Umbrella)

> The BKM (best known methods) of using native BLAS to improvement ML/MLLIB 
> performance
> -
>
> Key: SPARK-21305
> URL: https://issues.apache.org/jira/browse/SPARK-21305
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML
>Affects Versions: 2.2.0
>Reporter: Peng Meng
>Assignee: Peng Meng
>Priority: Minor
> Fix For: 2.3.0
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> Many ML/MLLIB algorithms use native BLAS (like Intel MKL, ATLAS, OpenBLAS) to 
> improvement the performance. 
> The methods to use native BLAS is important for the performance,  sometimes 
> (high opportunity) native BLAS even causes worse performance.  
> For example, for the ALS recommendForAll method before SPARK 2.2 which uses 
> BLAS gemm for matrix multiplication. 
> If you only test the matrix multiplication performance of native BLAS gemm 
> (like Intel MKL, and OpenBLAS) and netlib-java F2j BLAS gemm , the native 
> BLAS is about 10X performance improvement.  But if you test the Spark Job 
> end-to-end performance, F2j is much faster than native BLAS, very 
> interesting. 
> I spend much time for this problem, and find we should not use native BLAS 
> (like OpenBLAS and Intel MKL) which support multi-thread with no any setting. 
> By default, this native BLAS will enable multi-thread, which will conflict 
> with Spark executor.  You can use multi-thread native BLAS, but it is better 
> to disable multi-thread first. 
> https://github.com/xianyi/OpenBLAS/wiki/faq#multi-threaded
> https://software.intel.com/en-us/articles/recommended-settings-for-calling-intel-mkl-routines-from-multi-threaded-applications
> I think we should add some comments in docs/ml-guilde.md for this first. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >