[jira] [Closed] (SPARK-21938) Spark partial CSV write fails silently

2017-09-06 Thread Abbi McClintic (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abbi McClintic closed SPARK-21938.
--
Resolution: Cannot Reproduce

> Spark partial CSV write fails silently
> --
>
> Key: SPARK-21938
> URL: https://issues.apache.org/jira/browse/SPARK-21938
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, Spark Core
>Affects Versions: 2.2.0
> Environment: Amazon EMR 5.8, varying instance types
>Reporter: Abbi McClintic
>
> Hello,
> My team has been experiencing a recurring unpredictable bug where only a 
> partial write to CSV in S3 on one partition of our Dataset is performed. For 
> example, in a Dataset of 10 partitions written to CSV in S3, we might see 9 
> of the partitions as 2.8 GB in size, but one of them as 1.6 GB. However, the 
> job does not exit with an error code. 
> This becomes problematic in the following ways:
> 1. When we copy the data to Redshift, we get a bad decrypt error on the 
> partial file, suggesting that the failure occurred at a weird byte in the 
> file. 
> 2. We lose data - sometimes as much as 10%. 
> We don't see this problem with parquet, which we also use, but moving all of 
> our data to parquet is not currently feasible. We're using the Java API.
> Any help on resolving this would be much appreciated.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21938) Spark partial CSV write fails silently

2017-09-06 Thread Abbi McClintic (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16156517#comment-16156517
 ] 

Abbi McClintic commented on SPARK-21938:


Sure makes sense. I'll send out a post to the mailing list and we can close 
this out for now. 

> Spark partial CSV write fails silently
> --
>
> Key: SPARK-21938
> URL: https://issues.apache.org/jira/browse/SPARK-21938
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, Spark Core
>Affects Versions: 2.2.0
> Environment: Amazon EMR 5.8, varying instance types
>Reporter: Abbi McClintic
>
> Hello,
> My team has been experiencing a recurring unpredictable bug where only a 
> partial write to CSV in S3 on one partition of our Dataset is performed. For 
> example, in a Dataset of 10 partitions written to CSV in S3, we might see 9 
> of the partitions as 2.8 GB in size, but one of them as 1.6 GB. However, the 
> job does not exit with an error code. 
> This becomes problematic in the following ways:
> 1. When we copy the data to Redshift, we get a bad decrypt error on the 
> partial file, suggesting that the failure occurred at a weird byte in the 
> file. 
> 2. We lose data - sometimes as much as 10%. 
> We don't see this problem with parquet, which we also use, but moving all of 
> our data to parquet is not currently feasible. We're using the Java API.
> Any help on resolving this would be much appreciated.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21912) ORC/Parquet table should not create invalid column names

2017-09-06 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-21912.
-
   Resolution: Fixed
 Assignee: Dongjoon Hyun
Fix Version/s: 2.3.0

> ORC/Parquet table should not create invalid column names
> 
>
> Key: SPARK-21912
> URL: https://issues.apache.org/jira/browse/SPARK-21912
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
> Fix For: 2.3.0
>
>
> Currently, users meet job abortions while creating ORC data source tables 
> with invalid column names. We had better prevent this by raising 
> AnalysisException like Paquet data source tables.
> {code}
> scala> sql("CREATE TABLE orc1 USING ORC AS SELECT 1 `a b`")
> 17/09/04 13:28:21 ERROR Utils: Aborting task
> java.lang.IllegalArgumentException: Error: : expected at the position 8 of 
> 'struct' but ' ' is found.
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:360)
> ...
> 17/09/04 13:28:21 WARN FileOutputCommitter: Could not delete 
> file:/Users/dongjoon/spark-release/spark-master/spark-warehouse/orc1/_temporary/0/_temporary/attempt_20170904132821_0001_m_00_0
> 17/09/04 13:28:21 ERROR FileFormatWriter: Job job_20170904132821_0001 aborted.
> 17/09/04 13:28:21 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> org.apache.spark.SparkException: Task failed while writing rows.
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21938) Spark partial CSV write fails silently

2017-09-06 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16156455#comment-16156455
 ] 

Hyukjin Kwon commented on SPARK-21938:
--

Without reproducible steps or narrowing it down, it sounds almost impossible to 
reproduce and verify this issue and, even we'd be almost unable to make sure 
even if this was an actual issue or is already fixed. I think we'd better start 
this on the mailing list and and narrow down and investigate it more before 
posting this as an issue.

> Spark partial CSV write fails silently
> --
>
> Key: SPARK-21938
> URL: https://issues.apache.org/jira/browse/SPARK-21938
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, Spark Core
>Affects Versions: 2.2.0
> Environment: Amazon EMR 5.8, varying instance types
>Reporter: Abbi McClintic
>
> Hello,
> My team has been experiencing a recurring unpredictable bug where only a 
> partial write to CSV in S3 on one partition of our Dataset is performed. For 
> example, in a Dataset of 10 partitions written to CSV in S3, we might see 9 
> of the partitions as 2.8 GB in size, but one of them as 1.6 GB. However, the 
> job does not exit with an error code. 
> This becomes problematic in the following ways:
> 1. When we copy the data to Redshift, we get a bad decrypt error on the 
> partial file, suggesting that the failure occurred at a weird byte in the 
> file. 
> 2. We lose data - sometimes as much as 10%. 
> We don't see this problem with parquet, which we also use, but moving all of 
> our data to parquet is not currently feasible. We're using the Java API.
> Any help on resolving this would be much appreciated.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21858) Make Spark grouping_id() compatible with Hive grouping__id

2017-09-06 Thread Yann Byron (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16156357#comment-16156357
 ] 

Yann Byron commented on SPARK-21858:


I know that.
Due to an amount of queries need to be migrated from Hive to Spark SQL, I'm 
afraid that I have to keep the query results same between both.
So I'll modify the `grouping_id()` behavior in my local version.

Thank you again.

> Make Spark grouping_id() compatible with Hive grouping__id
> --
>
> Key: SPARK-21858
> URL: https://issues.apache.org/jira/browse/SPARK-21858
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Yann Byron
>
> If you want to migrate some ETLs using `grouping__id` in Hive to Spark and 
> use Spark `grouping_id()` instead of Hive `grouping__id`, you will find 
> difference between their evaluations.
> Here is an example.
> {code:java}
> select A, B, grouping__id/grouping_id() from t group by A, B grouping 
> sets((), (A), (B), (A,B))
> {code}
> Running it on Hive and Spark separately, you'll find this: (the selected 
> attribute in selected grouping set is represented by (/) and  otherwise by 
> (x))
> ||A B||Binary Expression in Spark||Spark||Hive||Binary Expression in Hive||B 
> A||
> |(x) (x)|11|3|0|00|(x) (x)|
> |(x) (/)|10|2|2|10|(/) (x)|
> |(/) (x)|01|1|1|01|(x) (/)|
> |(/) (/)|00|0|3|11|(/) (/)|
> As shown above,In Hive, (/) set to 0, (x) set to 1, and in Spark it's 
> opposite.
> Moreover, attributes in `group by` will reverse firstly in Hive. In Spark 
> it'll be evaluated directly.
> In my opinion, I suggest that modifying the behavior of `grouping_id()` make 
> it compatible with Hive `grouping__id`.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19609) Broadcast joins should pushdown join constraints as Filter to the larger relation

2017-09-06 Thread ivorzhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16156341#comment-16156341
 ] 

ivorzhou edited comment on SPARK-19609 at 9/7/17 2:31 AM:
--

It is a very important issue, is there any plan to add this feature?


was (Author: ivorzhou):
It is very important issue, is there any plan to add this feature?

> Broadcast joins should pushdown join constraints as Filter to the larger 
> relation
> -
>
> Key: SPARK-19609
> URL: https://issues.apache.org/jira/browse/SPARK-19609
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Nick Dimiduk
>
> For broadcast inner-joins, where the smaller relation is known to be small 
> enough to materialize on a worker, the set of values for all join columns is 
> known and fits in memory. Spark should translate these values into a 
> {{Filter}} pushed down to the datasource. The common join condition of 
> equality, i.e. {{lhs.a == rhs.a}}, can be written as an {{a in ...}} clause. 
> An example of pushing such filters is already present in the form of 
> {{IsNotNull}} filters via [~sameerag]'s work on SPARK-12957 subtasks.
> This optimization could even work when the smaller relation does not fit 
> entirely in memory. This could be done by partitioning the smaller relation 
> into N pieces, applying this predicate pushdown for each piece, and unioning 
> the results.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19609) Broadcast joins should pushdown join constraints as Filter to the larger relation

2017-09-06 Thread ivorzhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16156341#comment-16156341
 ] 

ivorzhou commented on SPARK-19609:
--

It is very important issue, is there any plan to add this feature?

> Broadcast joins should pushdown join constraints as Filter to the larger 
> relation
> -
>
> Key: SPARK-19609
> URL: https://issues.apache.org/jira/browse/SPARK-19609
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Nick Dimiduk
>
> For broadcast inner-joins, where the smaller relation is known to be small 
> enough to materialize on a worker, the set of values for all join columns is 
> known and fits in memory. Spark should translate these values into a 
> {{Filter}} pushed down to the datasource. The common join condition of 
> equality, i.e. {{lhs.a == rhs.a}}, can be written as an {{a in ...}} clause. 
> An example of pushing such filters is already present in the form of 
> {{IsNotNull}} filters via [~sameerag]'s work on SPARK-12957 subtasks.
> This optimization could even work when the smaller relation does not fit 
> entirely in memory. This could be done by partitioning the smaller relation 
> into N pieces, applying this predicate pushdown for each piece, and unioning 
> the results.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21915) Model 1 and Model 2 ParamMaps Missing

2017-09-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16156336#comment-16156336
 ] 

Apache Spark commented on SPARK-21915:
--

User 'marktab' has created a pull request for this issue:
https://github.com/apache/spark/pull/19152

> Model 1 and Model 2 ParamMaps Missing
> -
>
> Key: SPARK-21915
> URL: https://issues.apache.org/jira/browse/SPARK-21915
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 1.6.3, 2.0.0, 
> 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.2.0
>Reporter: Mark Tabladillo
>Priority: Minor
>  Labels: easyfix
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Error in PySpark example code
> [https://github.com/apache/spark/blob/master/examples/src/main/python/ml/estimator_transformer_param_example.py]
> The original Scala code says
> println("Model 2 was fit using parameters: " + model2.parent.extractParamMap)
> The parent is lr
> There is no method for accessing parent as is done in Scala.
> 
> This code has been tested in Python, and returns values consistent with Scala
> Proposing to call the lr variable instead of model1 or model2
> 
> This patch was tested with Spark 2.1.0 comparing the Scala and PySpark 
> results. Pyspark returns nothing at present for those two print lines.
> The output for model2 in PySpark should be
> {Param(parent='LogisticRegression_4187be538f744d5a9090', name='tol', doc='the 
> convergence tolerance for iterative algorithms (>= 0).'): 1e-06,
> Param(parent='LogisticRegression_4187be538f744d5a9090', 
> name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 
> 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 
> penalty.'): 0.0,
> Param(parent='LogisticRegression_4187be538f744d5a9090', name='predictionCol', 
> doc='prediction column name.'): 'prediction',
> Param(parent='LogisticRegression_4187be538f744d5a9090', name='featuresCol', 
> doc='features column name.'): 'features',
> Param(parent='LogisticRegression_4187be538f744d5a9090', name='labelCol', 
> doc='label column name.'): 'label',
> Param(parent='LogisticRegression_4187be538f744d5a9090', 
> name='probabilityCol', doc='Column name for predicted class conditional 
> probabilities. Note: Not all models output well-calibrated probability 
> estimates! These probabilities should be treated as confidences, not precise 
> probabilities.'): 'myProbability',
> Param(parent='LogisticRegression_4187be538f744d5a9090', 
> name='rawPredictionCol', doc='raw prediction (a.k.a. confidence) column 
> name.'): 'rawPrediction',
> Param(parent='LogisticRegression_4187be538f744d5a9090', name='family', 
> doc='The name of family which is a description of the label distribution to 
> be used in the model. Supported options: auto, binomial, multinomial'): 
> 'auto',
> Param(parent='LogisticRegression_4187be538f744d5a9090', name='fitIntercept', 
> doc='whether to fit an intercept term.'): True,
> Param(parent='LogisticRegression_4187be538f744d5a9090', name='threshold', 
> doc='Threshold in binary classification prediction, in range [0, 1]. If 
> threshold and thresholds are both set, they must match.e.g. if threshold is 
> p, then thresholds must be equal to [1-p, p].'): 0.55,
> Param(parent='LogisticRegression_4187be538f744d5a9090', 
> name='aggregationDepth', doc='suggested depth for treeAggregate (>= 2).'): 2,
> Param(parent='LogisticRegression_4187be538f744d5a9090', name='maxIter', 
> doc='max number of iterations (>= 0).'): 30,
> Param(parent='LogisticRegression_4187be538f744d5a9090', name='regParam', 
> doc='regularization parameter (>= 0).'): 0.1,
> Param(parent='LogisticRegression_4187be538f744d5a9090', 
> name='standardization', doc='whether to standardize the training features 
> before fitting the model.'): True}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21940) Support timezone for timestamps in SparkR

2017-09-06 Thread Hossein Falaki (JIRA)
Hossein Falaki created SPARK-21940:
--

 Summary: Support timezone for timestamps in SparkR
 Key: SPARK-21940
 URL: https://issues.apache.org/jira/browse/SPARK-21940
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.2.0
Reporter: Hossein Falaki


{{SparkR::createDataFrame()}} wipes timezone attribute from POSIXct and 
POSIXlt. See following example:

{code}
> x <- data.frame(x = c(Sys.time()))
> x
x
1 2017-09-06 19:17:16
> attr(x$x, "tzone") <- "Europe/Paris"
> x
x
1 2017-09-07 04:17:16
> collect(createDataFrame(x))
x
1 2017-09-06 19:17:16
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21933) Spark Streaming request more executors than excepted without DynamicAllocation

2017-09-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21933:


Assignee: Apache Spark

> Spark Streaming request more executors than excepted without DynamicAllocation
> --
>
> Key: SPARK-21933
> URL: https://issues.apache.org/jira/browse/SPARK-21933
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.2.0
>Reporter: Congxian Qiu
>Assignee: Apache Spark
>
> When Spark Streaming application runs on Yarn without DynamicAllocation, If 
> some nodemanager becomes lost, then the containers on the lost nodemanager 
> will be reported to all the applicationmaster, application master will 
> allocate new containers.
> But after application master allocate new containers, the lost nodemanager be 
> available, after this, resource manager restarted, after resource manager has 
> been restarted, the node manager   will report all the containers running on 
> it before to resource manager because of Yarn's HA, then application manager 
> will receive a duplicated completed container message, so the spark streaming 
> application will request more resource than it requires.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21933) Spark Streaming request more executors than excepted without DynamicAllocation

2017-09-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21933:


Assignee: (was: Apache Spark)

> Spark Streaming request more executors than excepted without DynamicAllocation
> --
>
> Key: SPARK-21933
> URL: https://issues.apache.org/jira/browse/SPARK-21933
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.2.0
>Reporter: Congxian Qiu
>
> When Spark Streaming application runs on Yarn without DynamicAllocation, If 
> some nodemanager becomes lost, then the containers on the lost nodemanager 
> will be reported to all the applicationmaster, application master will 
> allocate new containers.
> But after application master allocate new containers, the lost nodemanager be 
> available, after this, resource manager restarted, after resource manager has 
> been restarted, the node manager   will report all the containers running on 
> it before to resource manager because of Yarn's HA, then application manager 
> will receive a duplicated completed container message, so the spark streaming 
> application will request more resource than it requires.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21933) Spark Streaming request more executors than excepted without DynamicAllocation

2017-09-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16156321#comment-16156321
 ] 

Apache Spark commented on SPARK-21933:
--

User 'klion26' has created a pull request for this issue:
https://github.com/apache/spark/pull/19145

> Spark Streaming request more executors than excepted without DynamicAllocation
> --
>
> Key: SPARK-21933
> URL: https://issues.apache.org/jira/browse/SPARK-21933
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.2.0
>Reporter: Congxian Qiu
>
> When Spark Streaming application runs on Yarn without DynamicAllocation, If 
> some nodemanager becomes lost, then the containers on the lost nodemanager 
> will be reported to all the applicationmaster, application master will 
> allocate new containers.
> But after application master allocate new containers, the lost nodemanager be 
> available, after this, resource manager restarted, after resource manager has 
> been restarted, the node manager   will report all the containers running on 
> it before to resource manager because of Yarn's HA, then application manager 
> will receive a duplicated completed container message, so the spark streaming 
> application will request more resource than it requires.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21937) Spark SQL DDL/DML docs non-existent

2017-09-06 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16156255#comment-16156255
 ] 

Hyukjin Kwon commented on SPARK-21937:
--

(This should be related to SPARK-14764)

> Spark SQL DDL/DML docs non-existent
> ---
>
> Key: SPARK-21937
> URL: https://issues.apache.org/jira/browse/SPARK-21937
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Thomas Graves
>
> As far as I can tell there aren't any documents stating what all we support 
> for Spark SQL commands.  It would be nice to have a more formal doc with 
> DDL/DML



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21835) RewritePredicateSubquery should not produce unresolved query plans

2017-09-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16156226#comment-16156226
 ] 

Apache Spark commented on SPARK-21835:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/19151

> RewritePredicateSubquery should not produce unresolved query plans
> --
>
> Key: SPARK-21835
> URL: https://issues.apache.org/jira/browse/SPARK-21835
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.3.0
>
>
> {{RewritePredicateSubquery}} rewrites correlated subquery to join operations. 
> During the structural integrity, I found {[RewritePredicateSubquery}} can 
> produce unresolved query plans due to conflicting attributes. We should not 
> let {{RewritePredicateSubquery}} produce unresolved plans.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21939) Use TimeLimits instead of Timeouts

2017-09-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21939:


Assignee: (was: Apache Spark)

> Use TimeLimits instead of Timeouts
> --
>
> Key: SPARK-21939
> URL: https://issues.apache.org/jira/browse/SPARK-21939
> Project: Spark
>  Issue Type: Task
>  Components: Tests
>Affects Versions: 2.2.0
>Reporter: Dongjoon Hyun
>Priority: Trivial
>
> Since ScalaTest 3.0.0, `org.scalatest.concurrent.TimeLimits` is deprecated.
> This issue replaces the deprecated one with 
> `org.scalatest.concurrent.TimeLimits`.
> {code}
> -import org.scalatest.concurrent.Timeouts._
> +import org.scalatest.concurrent.TimeLimits._
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21939) Use TimeLimits instead of Timeouts

2017-09-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16156222#comment-16156222
 ] 

Apache Spark commented on SPARK-21939:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/19150

> Use TimeLimits instead of Timeouts
> --
>
> Key: SPARK-21939
> URL: https://issues.apache.org/jira/browse/SPARK-21939
> Project: Spark
>  Issue Type: Task
>  Components: Tests
>Affects Versions: 2.2.0
>Reporter: Dongjoon Hyun
>Priority: Trivial
>
> Since ScalaTest 3.0.0, `org.scalatest.concurrent.TimeLimits` is deprecated.
> This issue replaces the deprecated one with 
> `org.scalatest.concurrent.TimeLimits`.
> {code}
> -import org.scalatest.concurrent.Timeouts._
> +import org.scalatest.concurrent.TimeLimits._
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21939) Use TimeLimits instead of Timeouts

2017-09-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21939:


Assignee: Apache Spark

> Use TimeLimits instead of Timeouts
> --
>
> Key: SPARK-21939
> URL: https://issues.apache.org/jira/browse/SPARK-21939
> Project: Spark
>  Issue Type: Task
>  Components: Tests
>Affects Versions: 2.2.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Trivial
>
> Since ScalaTest 3.0.0, `org.scalatest.concurrent.TimeLimits` is deprecated.
> This issue replaces the deprecated one with 
> `org.scalatest.concurrent.TimeLimits`.
> {code}
> -import org.scalatest.concurrent.Timeouts._
> +import org.scalatest.concurrent.TimeLimits._
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21187) Complete support for remaining Spark data types in Arrow Converters

2017-09-06 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-21187:
-
Description: 
This is to track adding the remaining type support in Arrow Converters.  
Currently, only primitive data types are supported.  '

Remaining types:

* *Date*
* *Timestamp*
* *Complex*: Struct, Array, Map
* *Decimal*


Some things to do before closing this out:

* Look to upgrading to Arrow 0.7 for better Decimal support (can now write 
values as BigDecimal)
* Need to add some user docs
* Make sure Python tests are thorough
* Check into complex type support mentioned in comments by [~leif], should we 
support mulit-indexing?

  was:
This is to track adding the remaining type support in Arrow Converters.  
Currently, only primitive data types are supported.  '

Remaining types:

* *Date*
* *Timestamp*
* *Complex*: Struct, Array, Map
* *Decimal*


> Complete support for remaining Spark data types in Arrow Converters
> ---
>
> Key: SPARK-21187
> URL: https://issues.apache.org/jira/browse/SPARK-21187
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>
> This is to track adding the remaining type support in Arrow Converters.  
> Currently, only primitive data types are supported.  '
> Remaining types:
> * *Date*
> * *Timestamp*
> * *Complex*: Struct, Array, Map
> * *Decimal*
> Some things to do before closing this out:
> * Look to upgrading to Arrow 0.7 for better Decimal support (can now write 
> values as BigDecimal)
> * Need to add some user docs
> * Make sure Python tests are thorough
> * Check into complex type support mentioned in comments by [~leif], should we 
> support mulit-indexing?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21939) Use TimeLimits instead of Timeouts

2017-09-06 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-21939:
--
Summary: Use TimeLimits instead of Timeouts  (was: Use TimeLimits instead 
of TimeLimits)

> Use TimeLimits instead of Timeouts
> --
>
> Key: SPARK-21939
> URL: https://issues.apache.org/jira/browse/SPARK-21939
> Project: Spark
>  Issue Type: Task
>  Components: Tests
>Affects Versions: 2.2.0
>Reporter: Dongjoon Hyun
>Priority: Trivial
>
> Since ScalaTest 3.0.0, `org.scalatest.concurrent.TimeLimits` is deprecated.
> This issue replaces the deprecated one with 
> `org.scalatest.concurrent.TimeLimits`.
> {code}
> -import org.scalatest.concurrent.Timeouts._
> +import org.scalatest.concurrent.TimeLimits._
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21939) Use TimeLimits instead of TimeLimits

2017-09-06 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-21939:
-

 Summary: Use TimeLimits instead of TimeLimits
 Key: SPARK-21939
 URL: https://issues.apache.org/jira/browse/SPARK-21939
 Project: Spark
  Issue Type: Task
  Components: Tests
Affects Versions: 2.2.0
Reporter: Dongjoon Hyun
Priority: Trivial


Since ScalaTest 3.0.0, `org.scalatest.concurrent.TimeLimits` is deprecated.
This issue replaces the deprecated one with 
`org.scalatest.concurrent.TimeLimits`.
{code}
-import org.scalatest.concurrent.Timeouts._
+import org.scalatest.concurrent.TimeLimits._
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21938) Spark partial CSV write fails silently

2017-09-06 Thread Abbi McClintic (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16156181#comment-16156181
 ] 

Abbi McClintic edited comment on SPARK-21938 at 9/6/17 11:20 PM:
-

Unfortunately it appears to be unpredictable and we're not able to reliably 
reproduce the problem. 
It occurs 1-3x/week on a daily job - we're essentially just using 
df.write().csv("s3://some-bucket/some_location")


was (Author: abbim):
Unfortunately it appears to unpredictable and we're not able to reliably 
reproduce the problem. 
It occurs 1-3x/week on a daily job - we're essentially just using 
df.write().csv("s3://some-bucket/some_location")

> Spark partial CSV write fails silently
> --
>
> Key: SPARK-21938
> URL: https://issues.apache.org/jira/browse/SPARK-21938
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, Spark Core
>Affects Versions: 2.2.0
> Environment: Amazon EMR 5.8, varying instance types
>Reporter: Abbi McClintic
>
> Hello,
> My team has been experiencing a recurring unpredictable bug where only a 
> partial write to CSV in S3 on one partition of our Dataset is performed. For 
> example, in a Dataset of 10 partitions written to CSV in S3, we might see 9 
> of the partitions as 2.8 GB in size, but one of them as 1.6 GB. However, the 
> job does not exit with an error code. 
> This becomes problematic in the following ways:
> 1. When we copy the data to Redshift, we get a bad decrypt error on the 
> partial file, suggesting that the failure occurred at a weird byte in the 
> file. 
> 2. We lose data - sometimes as much as 10%. 
> We don't see this problem with parquet, which we also use, but moving all of 
> our data to parquet is not currently feasible. We're using the Java API.
> Any help on resolving this would be much appreciated.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21938) Spark partial CSV write fails silently

2017-09-06 Thread Abbi McClintic (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16156181#comment-16156181
 ] 

Abbi McClintic commented on SPARK-21938:


Unfortunately it appears to unpredictable and we're not able to reliably 
reproduce the problem. 
It occurs 1-3x/week on a daily job - we're essentially just using 
df.write().csv("s3://some-bucket/some_location")

> Spark partial CSV write fails silently
> --
>
> Key: SPARK-21938
> URL: https://issues.apache.org/jira/browse/SPARK-21938
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, Spark Core
>Affects Versions: 2.2.0
> Environment: Amazon EMR 5.8, varying instance types
>Reporter: Abbi McClintic
>
> Hello,
> My team has been experiencing a recurring unpredictable bug where only a 
> partial write to CSV in S3 on one partition of our Dataset is performed. For 
> example, in a Dataset of 10 partitions written to CSV in S3, we might see 9 
> of the partitions as 2.8 GB in size, but one of them as 1.6 GB. However, the 
> job does not exit with an error code. 
> This becomes problematic in the following ways:
> 1. When we copy the data to Redshift, we get a bad decrypt error on the 
> partial file, suggesting that the failure occurred at a weird byte in the 
> file. 
> 2. We lose data - sometimes as much as 10%. 
> We don't see this problem with parquet, which we also use, but moving all of 
> our data to parquet is not currently feasible. We're using the Java API.
> Any help on resolving this would be much appreciated.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21893) Put Kafka 0.8 behind a profile

2017-09-06 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-21893:
-
Component/s: (was: Structured Streaming)
 DStreams

> Put Kafka 0.8 behind a profile
> --
>
> Key: SPARK-21893
> URL: https://issues.apache.org/jira/browse/SPARK-21893
> Project: Spark
>  Issue Type: Sub-task
>  Components: DStreams
>Affects Versions: 2.2.0
>Reporter: Sean Owen
>Priority: Minor
>
> Kafka does not support 0.8.x for Scala 2.12. This code will have to, at 
> least, be optionally enabled by a profile, which could be enabled by default 
> for 2.11. Or outright removed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21901) Define toString for StateOperatorProgress

2017-09-06 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-21901.
--
   Resolution: Fixed
 Assignee: Jacek Laskowski
Fix Version/s: 2.3.0
   2.2.1

> Define toString for StateOperatorProgress
> -
>
> Key: SPARK-21901
> URL: https://issues.apache.org/jira/browse/SPARK-21901
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Assignee: Jacek Laskowski
>Priority: Trivial
> Fix For: 2.2.1, 2.3.0
>
>
> {{StateOperatorProgress}} should define its own {{toString}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21938) Spark partial CSV write fails silently

2017-09-06 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16156118#comment-16156118
 ] 

Marco Gaido commented on SPARK-21938:
-

It would be helpful if you can post a sample code to reproduce the issue with 
some sample data, thanks.

> Spark partial CSV write fails silently
> --
>
> Key: SPARK-21938
> URL: https://issues.apache.org/jira/browse/SPARK-21938
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, Spark Core
>Affects Versions: 2.2.0
> Environment: Amazon EMR 5.8, varying instance types
>Reporter: Abbi McClintic
>
> Hello,
> My team has been experiencing a recurring unpredictable bug where only a 
> partial write to CSV in S3 on one partition of our Dataset is performed. For 
> example, in a Dataset of 10 partitions written to CSV in S3, we might see 9 
> of the partitions as 2.8 GB in size, but one of them as 1.6 GB. However, the 
> job does not exit with an error code. 
> This becomes problematic in the following ways:
> 1. When we copy the data to Redshift, we get a bad decrypt error on the 
> partial file, suggesting that the failure occurred at a weird byte in the 
> file. 
> 2. We lose data - sometimes as much as 10%. 
> We don't see this problem with parquet, which we also use, but moving all of 
> our data to parquet is not currently feasible. We're using the Java API.
> Any help on resolving this would be much appreciated.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21190) SPIP: Vectorized UDFs in Python

2017-09-06 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16156082#comment-16156082
 ] 

Bryan Cutler commented on SPARK-21190:
--

I attached my PR because it had already been done and pretty much matches the 
proposal from here.  It is slightly different to define the UDF, I use just a 
flag in the normal UDF declaration, for example {{udf(my_func, DoubleType(), 
vectorized=True}}, but it is trivial to change it to {{pandas_udf}} if that is 
what's decided.  I'm not exactly sure if there was a problem in using my PR for 
this, but I'm more than happy to discuss the differences and hopefully avoid 
duplicated work.

> SPIP: Vectorized UDFs in Python
> ---
>
> Key: SPARK-21190
> URL: https://issues.apache.org/jira/browse/SPARK-21190
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>  Labels: SPIP
> Attachments: SPIPVectorizedUDFsforPython (1).pdf
>
>
> *Background and Motivation*
> Python is one of the most popular programming languages among Spark users. 
> Spark currently exposes a row-at-a-time interface for defining and executing 
> user-defined functions (UDFs). This introduces high overhead in serialization 
> and deserialization, and also makes it difficult to leverage Python libraries 
> (e.g. numpy, Pandas) that are written in native code.
>  
> This proposal advocates introducing new APIs to support vectorized UDFs in 
> Python, in which a block of data is transferred over to Python in some 
> columnar format for execution.
>  
>  
> *Target Personas*
> Data scientists, data engineers, library developers.
>  
> *Goals*
> - Support vectorized UDFs that apply on chunks of the data frame
> - Low system overhead: Substantially reduce serialization and deserialization 
> overhead when compared with row-at-a-time interface
> - UDF performance: Enable users to leverage native libraries in Python (e.g. 
> numpy, Pandas) for data manipulation in these UDFs
>  
> *Non-Goals*
> The following are explicitly out of scope for the current SPIP, and should be 
> done in future SPIPs. Nonetheless, it would be good to consider these future 
> use cases during API design, so we can achieve some consistency when rolling 
> out new APIs.
>  
> - Define block oriented UDFs in other languages (that are not Python).
> - Define aggregate UDFs
> - Tight integration with machine learning frameworks
>  
> *Proposed API Changes*
> The following sketches some possibilities. I haven’t spent a lot of time 
> thinking about the API (wrote it down in 5 mins) and I am not attached to 
> this design at all. The main purpose of the SPIP is to get feedback on use 
> cases and see how they can impact API design.
>  
> A few things to consider are:
>  
> 1. Python is dynamically typed, whereas DataFrames/SQL requires static, 
> analysis time typing. This means users would need to specify the return type 
> of their UDFs.
>  
> 2. Ratio of input rows to output rows. We propose initially we require number 
> of output rows to be the same as the number of input rows. In the future, we 
> can consider relaxing this constraint with support for vectorized aggregate 
> UDFs.
> 3. How do we handle null values, since Pandas doesn't have the concept of 
> nulls?
>  
> Proposed API sketch (using examples):
>  
> Use case 1. A function that defines all the columns of a DataFrame (similar 
> to a “map” function):
>  
> {code}
> @spark_udf(some way to describe the return schema)
> def my_func_on_entire_df(input):
>   """ Some user-defined function.
>  
>   :param input: A Pandas DataFrame with two columns, a and b.
>   :return: :class: A Pandas data frame.
>   """
>   input[c] = input[a] + input[b]
>   Input[d] = input[a] - input[b]
>   return input
>  
> spark.range(1000).selectExpr("id a", "id / 2 b")
>   .mapBatches(my_func_on_entire_df)
> {code}
>  
> Use case 2. A function that defines only one column (similar to existing 
> UDFs):
>  
> {code}
> @spark_udf(some way to describe the return schema)
> def my_func_that_returns_one_column(input):
>   """ Some user-defined function.
>  
>   :param input: A Pandas DataFrame with two columns, a and b.
>   :return: :class: A numpy array
>   """
>   return input[a] + input[b]
>  
> my_func = udf(my_func_that_returns_one_column)
>  
> df = spark.range(1000).selectExpr("id a", "id / 2 b")
> df.withColumn("c", my_func(df.a, df.b))
> {code}
>  
>  
>  
> *Optional Design Sketch*
> I’m more concerned about getting proper feedback for API design. The 
> implementation should be pretty straightforward and is not a huge concern at 
> this point. We can leverage the same implementation for faster toPandas 
> (using Arrow).
>  
>  
> *Optional Rejected Designs*
> See above.
>  
>  
> 

[jira] [Commented] (SPARK-21190) SPIP: Vectorized UDFs in Python

2017-09-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16156068#comment-16156068
 ] 

Apache Spark commented on SPARK-21190:
--

User 'BryanCutler' has created a pull request for this issue:
https://github.com/apache/spark/pull/18659

> SPIP: Vectorized UDFs in Python
> ---
>
> Key: SPARK-21190
> URL: https://issues.apache.org/jira/browse/SPARK-21190
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>  Labels: SPIP
> Attachments: SPIPVectorizedUDFsforPython (1).pdf
>
>
> *Background and Motivation*
> Python is one of the most popular programming languages among Spark users. 
> Spark currently exposes a row-at-a-time interface for defining and executing 
> user-defined functions (UDFs). This introduces high overhead in serialization 
> and deserialization, and also makes it difficult to leverage Python libraries 
> (e.g. numpy, Pandas) that are written in native code.
>  
> This proposal advocates introducing new APIs to support vectorized UDFs in 
> Python, in which a block of data is transferred over to Python in some 
> columnar format for execution.
>  
>  
> *Target Personas*
> Data scientists, data engineers, library developers.
>  
> *Goals*
> - Support vectorized UDFs that apply on chunks of the data frame
> - Low system overhead: Substantially reduce serialization and deserialization 
> overhead when compared with row-at-a-time interface
> - UDF performance: Enable users to leverage native libraries in Python (e.g. 
> numpy, Pandas) for data manipulation in these UDFs
>  
> *Non-Goals*
> The following are explicitly out of scope for the current SPIP, and should be 
> done in future SPIPs. Nonetheless, it would be good to consider these future 
> use cases during API design, so we can achieve some consistency when rolling 
> out new APIs.
>  
> - Define block oriented UDFs in other languages (that are not Python).
> - Define aggregate UDFs
> - Tight integration with machine learning frameworks
>  
> *Proposed API Changes*
> The following sketches some possibilities. I haven’t spent a lot of time 
> thinking about the API (wrote it down in 5 mins) and I am not attached to 
> this design at all. The main purpose of the SPIP is to get feedback on use 
> cases and see how they can impact API design.
>  
> A few things to consider are:
>  
> 1. Python is dynamically typed, whereas DataFrames/SQL requires static, 
> analysis time typing. This means users would need to specify the return type 
> of their UDFs.
>  
> 2. Ratio of input rows to output rows. We propose initially we require number 
> of output rows to be the same as the number of input rows. In the future, we 
> can consider relaxing this constraint with support for vectorized aggregate 
> UDFs.
> 3. How do we handle null values, since Pandas doesn't have the concept of 
> nulls?
>  
> Proposed API sketch (using examples):
>  
> Use case 1. A function that defines all the columns of a DataFrame (similar 
> to a “map” function):
>  
> {code}
> @spark_udf(some way to describe the return schema)
> def my_func_on_entire_df(input):
>   """ Some user-defined function.
>  
>   :param input: A Pandas DataFrame with two columns, a and b.
>   :return: :class: A Pandas data frame.
>   """
>   input[c] = input[a] + input[b]
>   Input[d] = input[a] - input[b]
>   return input
>  
> spark.range(1000).selectExpr("id a", "id / 2 b")
>   .mapBatches(my_func_on_entire_df)
> {code}
>  
> Use case 2. A function that defines only one column (similar to existing 
> UDFs):
>  
> {code}
> @spark_udf(some way to describe the return schema)
> def my_func_that_returns_one_column(input):
>   """ Some user-defined function.
>  
>   :param input: A Pandas DataFrame with two columns, a and b.
>   :return: :class: A numpy array
>   """
>   return input[a] + input[b]
>  
> my_func = udf(my_func_that_returns_one_column)
>  
> df = spark.range(1000).selectExpr("id a", "id / 2 b")
> df.withColumn("c", my_func(df.a, df.b))
> {code}
>  
>  
>  
> *Optional Design Sketch*
> I’m more concerned about getting proper feedback for API design. The 
> implementation should be pretty straightforward and is not a huge concern at 
> this point. We can leverage the same implementation for faster toPandas 
> (using Arrow).
>  
>  
> *Optional Rejected Designs*
> See above.
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21907) NullPointerException in UnsafeExternalSorter.spill()

2017-09-06 Thread Juliusz Sompolski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16156036#comment-16156036
 ] 

Juliusz Sompolski commented on SPARK-21907:
---

[~kiszk] unfortunately I don't have a small scale repro. I hit it several times 
when running sql queries operating on several TB of data on a cluster with ~20 
nodes / ~300 cores.
Looking at the code, I'm quite sure it's caused by UnsafeInMemorySorter.reset:
{code}
  consumer.freeArray(array);
  array = consumer.allocateArray(initialSize);
{code}
where allocating this array just after it was freed fails with another OOM, 
causing a nested spill.
I think nested spilling is invalid, and thus acquiring memory in a way that can 
cause spill is invalid on code paths that are already during spilling.

> NullPointerException in UnsafeExternalSorter.spill()
> 
>
> Key: SPARK-21907
> URL: https://issues.apache.org/jira/browse/SPARK-21907
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Juliusz Sompolski
>
> I see NPE during sorting with the following stacktrace:
> {code}
> java.lang.NullPointerException
>   at 
> org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:383)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:63)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:43)
>   at 
> org.apache.spark.util.collection.TimSort.countRunAndMakeAscending(TimSort.java:270)
>   at org.apache.spark.util.collection.TimSort.sort(TimSort.java:142)
>   at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:345)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:206)
>   at 
> org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:203)
>   at 
> org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:281)
>   at 
> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:90)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.reset(UnsafeInMemorySorter.java:173)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:221)
>   at 
> org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:203)
>   at 
> org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:281)
>   at 
> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:90)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:349)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:400)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:109)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
>   at 
> org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:83)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedStreamed(SortMergeJoinExec.scala:778)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextInnerJoinRows(SortMergeJoinExec.scala:685)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinExec$$anonfun$doExecute$1$$anon$2.advanceNext(SortMergeJoinExec.scala:259)
>   at 
> org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:68)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> 

[jira] [Commented] (SPARK-21907) NullPointerException in UnsafeExternalSorter.spill()

2017-09-06 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16156015#comment-16156015
 ] 

Kazuaki Ishizaki commented on SPARK-21907:
--

Thank you for your report. Could you please attach a program that can reproduce 
this issue?

> NullPointerException in UnsafeExternalSorter.spill()
> 
>
> Key: SPARK-21907
> URL: https://issues.apache.org/jira/browse/SPARK-21907
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Juliusz Sompolski
>
> I see NPE during sorting with the following stacktrace:
> {code}
> java.lang.NullPointerException
>   at 
> org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:383)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:63)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:43)
>   at 
> org.apache.spark.util.collection.TimSort.countRunAndMakeAscending(TimSort.java:270)
>   at org.apache.spark.util.collection.TimSort.sort(TimSort.java:142)
>   at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:345)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:206)
>   at 
> org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:203)
>   at 
> org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:281)
>   at 
> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:90)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.reset(UnsafeInMemorySorter.java:173)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:221)
>   at 
> org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:203)
>   at 
> org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:281)
>   at 
> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:90)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:349)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:400)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:109)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
>   at 
> org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:83)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedStreamed(SortMergeJoinExec.scala:778)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextInnerJoinRows(SortMergeJoinExec.scala:685)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinExec$$anonfun$doExecute$1$$anon$2.advanceNext(SortMergeJoinExec.scala:259)
>   at 
> org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:68)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:346)
>   at 
> 

[jira] [Resolved] (SPARK-21835) RewritePredicateSubquery should not produce unresolved query plans

2017-09-06 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-21835.
-
   Resolution: Later
 Assignee: Liang-Chi Hsieh
Fix Version/s: 2.3.0

> RewritePredicateSubquery should not produce unresolved query plans
> --
>
> Key: SPARK-21835
> URL: https://issues.apache.org/jira/browse/SPARK-21835
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.3.0
>
>
> {{RewritePredicateSubquery}} rewrites correlated subquery to join operations. 
> During the structural integrity, I found {[RewritePredicateSubquery}} can 
> produce unresolved query plans due to conflicting attributes. We should not 
> let {{RewritePredicateSubquery}} produce unresolved plans.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13656) Delete spark.sql.parquet.cacheMetadata

2017-09-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13656:
--
Target Version/s:   (was: 2.0.0)

> Delete spark.sql.parquet.cacheMetadata
> --
>
> Key: SPARK-13656
> URL: https://issues.apache.org/jira/browse/SPARK-13656
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>
> Looks like spark.sql.parquet.cacheMetadata is not used anymore. Let's delete 
> it to avoid any potential confusion.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21765) Ensure all leaf nodes that are derived from streaming sources have isStreaming=true

2017-09-06 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-21765.
---
   Resolution: Fixed
Fix Version/s: (was: 2.3.0)
   3.0.0

Issue resolved by pull request 19056
[https://github.com/apache/spark/pull/19056]

> Ensure all leaf nodes that are derived from streaming sources have 
> isStreaming=true
> ---
>
> Key: SPARK-21765
> URL: https://issues.apache.org/jira/browse/SPARK-21765
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Jose Torres
>Assignee: Jose Torres
> Fix For: 3.0.0
>
>
> LogicalPlan has an isStreaming bit, but it's incompletely implemented. Some 
> streaming sources don't set the bit, and the bit can sometimes be lost in 
> rewriting. Setting the bit for all plans that are logically streaming will 
> help us simplify the logic around checking query plan validity.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21652) Optimizer cannot reach a fixed point on certain queries

2017-09-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16155812#comment-16155812
 ] 

Apache Spark commented on SPARK-21652:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/19149

> Optimizer cannot reach a fixed point on certain queries
> ---
>
> Key: SPARK-21652
> URL: https://issues.apache.org/jira/browse/SPARK-21652
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.2.0
>Reporter: Anton Okolnychyi
>
> The optimizer cannot reach a fixed point on the following query:
> {code}
> Seq((1, 2)).toDF("col1", "col2").write.saveAsTable("t1")
> Seq(1, 2).toDF("col").write.saveAsTable("t2")
> spark.sql("SELECT * FROM t1, t2 WHERE t1.col1 = 1 AND 1 = t1.col2 AND t1.col1 
> = t2.col AND t1.col2 = t2.col").explain(true)
> {code}
> At some point during the optimization, InferFiltersFromConstraints infers a 
> new constraint '(col2#33 = col1#32)' that is appended to the join condition, 
> then PushPredicateThroughJoin pushes it down, ConstantPropagation replaces 
> '(col2#33 = col1#32)' with '1 = 1' based on other propagated constraints, 
> ConstantFolding replaces '1 = 1' with 'true and BooleanSimplification finally 
> removes this predicate. However, InferFiltersFromConstraints will again infer 
> '(col2#33 = col1#32)' on the next iteration and the process will continue 
> until the limit of iterations is reached. 
> See below for more details
> {noformat}
> === Applying Rule 
> org.apache.spark.sql.catalyst.optimizer.InferFiltersFromConstraints ===
> !Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))   
> Join Inner, ((col2#33 = col1#32) && ((col1#32 = col#34) && 
> (col2#33 = col#34)))
>  :- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) && 
> (1 = col2#33)))   :- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && 
> ((col1#32 = 1) && (1 = col2#33)))
>  :  +- Relation[col1#32,col2#33] parquet  
> :  +- Relation[col1#32,col2#33] parquet
>  +- Filter ((1 = col#34) && isnotnull(col#34))
> +- Filter ((1 = col#34) && isnotnull(col#34))
> +- Relation[col#34] parquet   
>+- Relation[col#34] parquet
> 
> === Applying Rule 
> org.apache.spark.sql.catalyst.optimizer.PushPredicateThroughJoin ===
> !Join Inner, ((col2#33 = col1#32) && ((col1#32 = col#34) && (col2#33 = 
> col#34)))  Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))
> !:- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) && 
> (1 = col2#33)))   :- Filter (col2#33 = col1#32)
> !:  +- Relation[col1#32,col2#33] parquet  
> :  +- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && 
> ((col1#32 = 1) && (1 = col2#33)))
> !+- Filter ((1 = col#34) && isnotnull(col#34))
> : +- Relation[col1#32,col2#33] parquet
> !   +- Relation[col#34] parquet   
> +- Filter ((1 = col#34) && isnotnull(col#34))
> ! 
>+- Relation[col#34] parquet
> 
> === Applying Rule org.apache.spark.sql.catalyst.optimizer.CombineFilters ===
>  Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))   
>Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))
> !:- Filter (col2#33 = col1#32)
>:- Filter (((isnotnull(col1#32) && isnotnull(col2#33)) && 
> ((col1#32 = 1) && (1 = col2#33))) && (col2#33 = col1#32))
> !:  +- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) 
> && (1 = col2#33)))   :  +- Relation[col1#32,col2#33] parquet
> !: +- Relation[col1#32,col2#33] parquet   
>+- Filter ((1 = col#34) && isnotnull(col#34))
> !+- Filter ((1 = col#34) && isnotnull(col#34))
>   +- Relation[col#34] parquet
> !   +- Relation[col#34] parquet   
>
> 
> === Applying Rule org.apache.spark.sql.catalyst.optimizer.ConstantPropagation 
> ===
>  Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))   
>  Join Inner, ((col1#32 = col#34) && 
> (col2#33 = col#34))
> !:- Filter (((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) && 
> (1 = col2#33))) && (col2#33 = col1#32))  

[jira] [Commented] (SPARK-14155) Hide UserDefinedType in Spark 2.0

2017-09-06 Thread Maciej Szymkiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16155811#comment-16155811
 ] 

Maciej Szymkiewicz commented on SPARK-14155:


[~barrybecker4] Nope: SPARK-7768

> Hide UserDefinedType in Spark 2.0
> -
>
> Key: SPARK-14155
> URL: https://issues.apache.org/jira/browse/SPARK-14155
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>
> UserDefinedType is a developer API in Spark 1.x.
> With very high probability we will create a new API for user-defined type 
> that also works well with column batches as well as encoders (datasets). In 
> Spark 2.0, let's make UserDefinedType private[spark] first.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21931) add LNNVL function

2017-09-06 Thread Ruslan Dautkhanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16155797#comment-16155797
 ] 

Ruslan Dautkhanov commented on SPARK-21931:
---

Example 1)

{code:sql}
select * from products where LNNVL(qty >= reorder_level)
{code}

without LNNVL:

{code:sql}
select * from products where NVL(qty, -1) >= NVL(reorder_level, 0)
{code}

Example 2)
{code:sql}
SELECT empno, comm FROM emp WHERE LNNVL ( comm > 0 )
{code}
without LNNVL:
{code:sql}
SELECT empno, comm FROM emp WHERE NOT ( comm > 0 )OR COMM IS NULL
{code}

Is LNNVL essential? Nope. Is it helpful? Sometimes a lot. 

Oracle had LNNVL for ages, although they documented it reletively recently - in 
Oracle 11g. Hope it'll get to ANSI SQL sometimes.

Some other helpful NULL-related Oracle functions: 
https://oracle-base.com/articles/misc/null-related-functions 

> add LNNVL function
> --
>
> Key: SPARK-21931
> URL: https://issues.apache.org/jira/browse/SPARK-21931
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Ruslan Dautkhanov
>Priority: Minor
> Attachments: Capture1.JPG
>
>
> Purpose
> LNNVL provides a concise way to evaluate a condition when one or both 
> operands of the condition may be null. The function can be used only in the 
> WHERE clause of a query. It takes as an argument a condition and returns TRUE 
> if the condition is FALSE or UNKNOWN and FALSE if the condition is TRUE. 
> LNNVL can be used anywhere a scalar expression can appear, even in contexts 
> where the IS (NOT) NULL, AND, or OR conditions are not valid but would 
> otherwise be required to account for potential nulls. Oracle Database 
> sometimes uses the LNNVL function internally in this way to rewrite NOT IN 
> conditions as NOT EXISTS conditions. In such cases, output from EXPLAIN PLAN 
> shows this operation in the plan table output. The condition can evaluate any 
> scalar values but cannot be a compound condition containing AND, OR, or 
> BETWEEN.
> The table that follows shows what LNNVL returns given that a = 2 and b is 
> null.
> !Capture1.JPG!
> https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions078.htm 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21938) Spark partial CSV write fails silently

2017-09-06 Thread Abbi McClintic (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abbi McClintic updated SPARK-21938:
---
Description: 
Hello,
My team has been experiencing a recurring unpredictable bug where only a 
partial write to CSV in S3 on one partition of our Dataset is performed. For 
example, in a Dataset of 10 partitions written to CSV in S3, we might see 9 of 
the partitions as 2.8 GB in size, but one of them as 1.6 GB. However, the job 
does not exit with an error code. 

This becomes problematic in the following ways:
1. When we copy the data to Redshift, we get a bad decrypt error on the partial 
file, suggesting that the failure occurred at a weird byte in the file. 
2. We lose data - sometimes as much as 10%. 

We don't see this problem with parquet, which we also use, but moving all of 
our data to parquet is not currently feasible. We're using the Java API.

Any help on resolving this would be much appreciated.

  was:
Hello,
My team has been experiencing a recurring unpredictable bug where only a 
partial write to CSV in S3 on one partition of our Dataset is performed. For 
example, in a Dataset of 10 partitions written to CSV in S3, we might see 9 of 
the partitions as 2.8 GB in size, but one of them as 1.6 GB. However, the job 
does not fail. 

This becomes problematic in the following ways:
1. When we copy the data to Redshift, we get a bad decrypt error on the partial 
file, suggesting that the failure occurred at a weird byte in the file. 
2. We lose data - sometimes as much as 10%. 

We don't see this problem with parquet, which we also use, but moving all of 
our data to parquet is not currently feasible. We're using the Java API.

Any help on resolving this would be much appreciated.


> Spark partial CSV write fails silently
> --
>
> Key: SPARK-21938
> URL: https://issues.apache.org/jira/browse/SPARK-21938
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, Spark Core
>Affects Versions: 2.2.0
> Environment: Amazon EMR 5.8, varying instance types
>Reporter: Abbi McClintic
>
> Hello,
> My team has been experiencing a recurring unpredictable bug where only a 
> partial write to CSV in S3 on one partition of our Dataset is performed. For 
> example, in a Dataset of 10 partitions written to CSV in S3, we might see 9 
> of the partitions as 2.8 GB in size, but one of them as 1.6 GB. However, the 
> job does not exit with an error code. 
> This becomes problematic in the following ways:
> 1. When we copy the data to Redshift, we get a bad decrypt error on the 
> partial file, suggesting that the failure occurred at a weird byte in the 
> file. 
> 2. We lose data - sometimes as much as 10%. 
> We don't see this problem with parquet, which we also use, but moving all of 
> our data to parquet is not currently feasible. We're using the Java API.
> Any help on resolving this would be much appreciated.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14155) Hide UserDefinedType in Spark 2.0

2017-09-06 Thread Barry Becker (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16155790#comment-16155790
 ] 

Barry Becker commented on SPARK-14155:
--

Does it work with datasets now in 2.1?

> Hide UserDefinedType in Spark 2.0
> -
>
> Key: SPARK-14155
> URL: https://issues.apache.org/jira/browse/SPARK-14155
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>
> UserDefinedType is a developer API in Spark 1.x.
> With very high probability we will create a new API for user-defined type 
> that also works well with column batches as well as encoders (datasets). In 
> Spark 2.0, let's make UserDefinedType private[spark] first.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21938) Spark partial CSV write fails silently

2017-09-06 Thread Abbi McClintic (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abbi McClintic updated SPARK-21938:
---
Description: 
Hello,
My team has been experiencing a recurring unpredictable bug where only a 
partial write to CSV in S3 on one partition of our Dataset is performed. For 
example, in a Dataset of 10 partitions written to CSV in S3, we might see 9 of 
the partitions as 2.8 GB in size, but one of them as 1.6 GB. However, the job 
does not fail. 

This becomes problematic in the following ways:
1. When we copy the data to Redshift, we get a bad decrypt error on the partial 
file, suggesting that the failure occurred at a weird byte in the file. 
2. We lose data - sometimes as much as 10%. 

We don't see this problem with parquet, which we also use, but moving all of 
our data to parquet is not currently feasible. We're using the Java API.

Any help on resolving this would be much appreciated.

  was:
Hello,
My team has been experiencing a recurring unpredictable bug where only a 
partial write to CSV in S3 on one partition of our Dataset is performed. For 
example, in a Dataset of 10 partitions written to CSV in S3, we might see 9 of 
the partitions as 2.8 GB in size, but one of them as 1.6 GB. However, the job 
does not fail. 

This becomes problematic in the following ways:
1. When we copy the data to Redshift, we get a bad decrypt error on the partial 
file, suggesting that the failure occurred at a weird byte in the file. 
2. We lose data - sometimes as much as 10%. 

We don't see this problem with parquet, which we also use, but moving all of 
our data to parquet is not currently feasible. 

Any help on resolving this would be much appreciated.


> Spark partial CSV write fails silently
> --
>
> Key: SPARK-21938
> URL: https://issues.apache.org/jira/browse/SPARK-21938
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, Spark Core
>Affects Versions: 2.2.0
> Environment: Amazon EMR 5.8, varying instance types
>Reporter: Abbi McClintic
>
> Hello,
> My team has been experiencing a recurring unpredictable bug where only a 
> partial write to CSV in S3 on one partition of our Dataset is performed. For 
> example, in a Dataset of 10 partitions written to CSV in S3, we might see 9 
> of the partitions as 2.8 GB in size, but one of them as 1.6 GB. However, the 
> job does not fail. 
> This becomes problematic in the following ways:
> 1. When we copy the data to Redshift, we get a bad decrypt error on the 
> partial file, suggesting that the failure occurred at a weird byte in the 
> file. 
> 2. We lose data - sometimes as much as 10%. 
> We don't see this problem with parquet, which we also use, but moving all of 
> our data to parquet is not currently feasible. We're using the Java API.
> Any help on resolving this would be much appreciated.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21938) Spark partial CSV write fails silently

2017-09-06 Thread Abbi McClintic (JIRA)
Abbi McClintic created SPARK-21938:
--

 Summary: Spark partial CSV write fails silently
 Key: SPARK-21938
 URL: https://issues.apache.org/jira/browse/SPARK-21938
 Project: Spark
  Issue Type: Bug
  Components: Java API, Spark Core
Affects Versions: 2.2.0
 Environment: Amazon EMR 5.8, varying instance types
Reporter: Abbi McClintic


Hello,
My team has been experiencing a recurring unpredictable bug where only a 
partial write to CSV in S3 on one partition of our Dataset is performed. For 
example, in a Dataset of 10 partitions written to CSV in S3, we might see 9 of 
the partitions as 2.8 GB in size, but one of them as 1.6 GB. However, the job 
does not fail. 

This becomes problematic in the following ways:
1. When we copy the data to Redshift, we get a bad decrypt error on the partial 
file, suggesting that the failure occurred at a weird byte in the file. 
2. We lose data - sometimes as much as 10%. 

We don't see this problem with parquet, which we also use, but moving all of 
our data to parquet is not currently feasible. 

Any help on resolving this would be much appreciated.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21937) Spark SQL DDL/DML docs non-existent

2017-09-06 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-21937:
--
Issue Type: Improvement  (was: Bug)

> Spark SQL DDL/DML docs non-existent
> ---
>
> Key: SPARK-21937
> URL: https://issues.apache.org/jira/browse/SPARK-21937
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Thomas Graves
>
> As far as I can tell there aren't any documents stating what all we support 
> for Spark SQL commands.  It would be nice to have a more formal doc with 
> DDL/DML



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21937) Spark SQL DDL/DML docs non-existent

2017-09-06 Thread Thomas Graves (JIRA)
Thomas Graves created SPARK-21937:
-

 Summary: Spark SQL DDL/DML docs non-existent
 Key: SPARK-21937
 URL: https://issues.apache.org/jira/browse/SPARK-21937
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Thomas Graves


As far as I can tell there aren't any documents stating what all we support for 
Spark SQL commands.  It would be nice to have a more formal doc with DDL/DML





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17901) NettyRpcEndpointRef: Error sending message and Caused by: java.util.ConcurrentModificationException

2017-09-06 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16155727#comment-16155727
 ] 

Sean Owen commented on SPARK-17901:
---

Possibly, but I would only reopen if you can reproduce on master.

> NettyRpcEndpointRef: Error sending message and Caused by: 
> java.util.ConcurrentModificationException
> ---
>
> Key: SPARK-17901
> URL: https://issues.apache.org/jira/browse/SPARK-17901
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.1
>Reporter: Harish
>
> I have 2 data frames one with 10K rows and 10,000 columns and another with 4M 
> rows with 50 columns. I joined this and trying to find mean of merged data 
> set,
> i calculated the mean using lamda using python mean() function. I cant write 
> in pyspark due to 64KB code limit issue.
> After calculating the mean i did rdd.take(2). it works.But creating the DF 
> from RDD and DF.show is progress for more than 2 hours (I stopped the 
> process) with below message  (102 GB , 6 cores per node -- total 10 nodes+ 
> 1master)
> 16/10/13 04:36:31 WARN NettyRpcEndpointRef: Error sending message [message = 
> Heartbeat(9,[Lscala.Tuple2;@68eedafb,BlockManagerId(9, 10.63.136.103, 
> 35729))] in 1 attempts
> org.apache.spark.SparkException: Exception thrown in awaitResult
>   at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
>   at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
>   at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
>   at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
>   at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
>   at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
>   at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
>   at 
> org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:518)
>   at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:547)
>   at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
>   at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1877)
>   at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:547)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.util.ConcurrentModificationException
>   at java.util.ArrayList.writeObject(ArrayList.java:766)
>   at sun.reflect.GeneratedMethodAccessor36.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at 
> java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:441)
>   at 
> java.util.Collections$SynchronizedCollection.writeObject(Collections.java:2081)
>   at sun.reflect.GeneratedMethodAccessor38.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at 

[jira] [Comment Edited] (SPARK-17901) NettyRpcEndpointRef: Error sending message and Caused by: java.util.ConcurrentModificationException

2017-09-06 Thread Sergei Lebedev (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16155707#comment-16155707
 ] 

Sergei Lebedev edited comment on SPARK-17901 at 9/6/17 5:09 PM:


[~srowen], I think this issue could've been closed by mistake: the stack trace 
is different from SPARK-17816. Could you reopen?


was (Author: lebedev):
[~srowen] I think this issue could've been closed by mistake: the stack trace 
is different from SPARK-17816. Could you reopen?

> NettyRpcEndpointRef: Error sending message and Caused by: 
> java.util.ConcurrentModificationException
> ---
>
> Key: SPARK-17901
> URL: https://issues.apache.org/jira/browse/SPARK-17901
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.1
>Reporter: Harish
>
> I have 2 data frames one with 10K rows and 10,000 columns and another with 4M 
> rows with 50 columns. I joined this and trying to find mean of merged data 
> set,
> i calculated the mean using lamda using python mean() function. I cant write 
> in pyspark due to 64KB code limit issue.
> After calculating the mean i did rdd.take(2). it works.But creating the DF 
> from RDD and DF.show is progress for more than 2 hours (I stopped the 
> process) with below message  (102 GB , 6 cores per node -- total 10 nodes+ 
> 1master)
> 16/10/13 04:36:31 WARN NettyRpcEndpointRef: Error sending message [message = 
> Heartbeat(9,[Lscala.Tuple2;@68eedafb,BlockManagerId(9, 10.63.136.103, 
> 35729))] in 1 attempts
> org.apache.spark.SparkException: Exception thrown in awaitResult
>   at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
>   at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
>   at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
>   at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
>   at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
>   at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
>   at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
>   at 
> org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:518)
>   at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:547)
>   at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
>   at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1877)
>   at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:547)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.util.ConcurrentModificationException
>   at java.util.ArrayList.writeObject(ArrayList.java:766)
>   at sun.reflect.GeneratedMethodAccessor36.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at 
> java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:441)
>   at 
> java.util.Collections$SynchronizedCollection.writeObject(Collections.java:2081)
>   at sun.reflect.GeneratedMethodAccessor38.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> 

[jira] [Comment Edited] (SPARK-17901) NettyRpcEndpointRef: Error sending message and Caused by: java.util.ConcurrentModificationException

2017-09-06 Thread Sergei Lebedev (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16155707#comment-16155707
 ] 

Sergei Lebedev edited comment on SPARK-17901 at 9/6/17 5:09 PM:


[~srowen] I think this issue could've been closed by mistake: the stack trace 
is different from SPARK-17816. Could you reopen?


was (Author: lebedev):
I think this issue could've been closed by mistake: the stack trace is 
different from SPARK-17816. Could you reopen?

> NettyRpcEndpointRef: Error sending message and Caused by: 
> java.util.ConcurrentModificationException
> ---
>
> Key: SPARK-17901
> URL: https://issues.apache.org/jira/browse/SPARK-17901
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.1
>Reporter: Harish
>
> I have 2 data frames one with 10K rows and 10,000 columns and another with 4M 
> rows with 50 columns. I joined this and trying to find mean of merged data 
> set,
> i calculated the mean using lamda using python mean() function. I cant write 
> in pyspark due to 64KB code limit issue.
> After calculating the mean i did rdd.take(2). it works.But creating the DF 
> from RDD and DF.show is progress for more than 2 hours (I stopped the 
> process) with below message  (102 GB , 6 cores per node -- total 10 nodes+ 
> 1master)
> 16/10/13 04:36:31 WARN NettyRpcEndpointRef: Error sending message [message = 
> Heartbeat(9,[Lscala.Tuple2;@68eedafb,BlockManagerId(9, 10.63.136.103, 
> 35729))] in 1 attempts
> org.apache.spark.SparkException: Exception thrown in awaitResult
>   at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
>   at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
>   at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
>   at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
>   at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
>   at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
>   at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
>   at 
> org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:518)
>   at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:547)
>   at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
>   at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1877)
>   at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:547)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.util.ConcurrentModificationException
>   at java.util.ArrayList.writeObject(ArrayList.java:766)
>   at sun.reflect.GeneratedMethodAccessor36.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at 
> java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:441)
>   at 
> java.util.Collections$SynchronizedCollection.writeObject(Collections.java:2081)
>   at sun.reflect.GeneratedMethodAccessor38.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> 

[jira] [Updated] (SPARK-21926) Some transformers in spark.ml.feature fail when trying to transform streaming dataframes

2017-09-06 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-21926:
-
Summary: Some transformers in spark.ml.feature fail when trying to 
transform streaming dataframes  (was: Some transformers in spark.ml.feature 
fail when trying to transform steaming dataframes)

> Some transformers in spark.ml.feature fail when trying to transform streaming 
> dataframes
> 
>
> Key: SPARK-21926
> URL: https://issues.apache.org/jira/browse/SPARK-21926
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Bago Amirbekian
>
> We've run into a few cases where ML components don't play nice with streaming 
> dataframes (for prediction). This ticket is meant to help aggregate these 
> known cases in one place and provide a place to discuss possible fixes.
> Failing cases:
> 1) VectorAssembler where one of the inputs is a VectorUDT column with no 
> metadata.
> Possible fixes:
> a) Re-design vectorUDT metadata to support missing metadata for some 
> elements. (This might be a good thing to do anyways SPARK-19141)
> b) drop metadata in streaming context.
> 2) OneHotEncoder where the input is a column with no metadata.
> Possible fixes:
> a) Make OneHotEncoder an estimator (SPARK-13030).
> b) Allow user to set the cardinality of OneHotEncoder.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17901) NettyRpcEndpointRef: Error sending message and Caused by: java.util.ConcurrentModificationException

2017-09-06 Thread Sergei Lebedev (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16155707#comment-16155707
 ] 

Sergei Lebedev commented on SPARK-17901:


I think this issue could've been closed by mistake: the stack trace is 
different from SPARK-17816. Could you reopen?

> NettyRpcEndpointRef: Error sending message and Caused by: 
> java.util.ConcurrentModificationException
> ---
>
> Key: SPARK-17901
> URL: https://issues.apache.org/jira/browse/SPARK-17901
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.1
>Reporter: Harish
>
> I have 2 data frames one with 10K rows and 10,000 columns and another with 4M 
> rows with 50 columns. I joined this and trying to find mean of merged data 
> set,
> i calculated the mean using lamda using python mean() function. I cant write 
> in pyspark due to 64KB code limit issue.
> After calculating the mean i did rdd.take(2). it works.But creating the DF 
> from RDD and DF.show is progress for more than 2 hours (I stopped the 
> process) with below message  (102 GB , 6 cores per node -- total 10 nodes+ 
> 1master)
> 16/10/13 04:36:31 WARN NettyRpcEndpointRef: Error sending message [message = 
> Heartbeat(9,[Lscala.Tuple2;@68eedafb,BlockManagerId(9, 10.63.136.103, 
> 35729))] in 1 attempts
> org.apache.spark.SparkException: Exception thrown in awaitResult
>   at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
>   at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
>   at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
>   at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
>   at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
>   at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
>   at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
>   at 
> org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:518)
>   at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:547)
>   at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
>   at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1877)
>   at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:547)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.util.ConcurrentModificationException
>   at java.util.ArrayList.writeObject(ArrayList.java:766)
>   at sun.reflect.GeneratedMethodAccessor36.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at 
> java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:441)
>   at 
> java.util.Collections$SynchronizedCollection.writeObject(Collections.java:2081)
>   at sun.reflect.GeneratedMethodAccessor38.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
>   at 
> 

[jira] [Commented] (SPARK-21858) Make Spark grouping_id() compatible with Hive grouping__id

2017-09-06 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16155672#comment-16155672
 ] 

Dongjoon Hyun commented on SPARK-21858:
---

Given that the reporter of that Hive issue is a Spark committer, I guess this 
Spark feature was designed intentionally like this.

> Make Spark grouping_id() compatible with Hive grouping__id
> --
>
> Key: SPARK-21858
> URL: https://issues.apache.org/jira/browse/SPARK-21858
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Yann Byron
>
> If you want to migrate some ETLs using `grouping__id` in Hive to Spark and 
> use Spark `grouping_id()` instead of Hive `grouping__id`, you will find 
> difference between their evaluations.
> Here is an example.
> {code:java}
> select A, B, grouping__id/grouping_id() from t group by A, B grouping 
> sets((), (A), (B), (A,B))
> {code}
> Running it on Hive and Spark separately, you'll find this: (the selected 
> attribute in selected grouping set is represented by (/) and  otherwise by 
> (x))
> ||A B||Binary Expression in Spark||Spark||Hive||Binary Expression in Hive||B 
> A||
> |(x) (x)|11|3|0|00|(x) (x)|
> |(x) (/)|10|2|2|10|(/) (x)|
> |(/) (x)|01|1|1|01|(x) (/)|
> |(/) (/)|00|0|3|11|(/) (/)|
> As shown above,In Hive, (/) set to 0, (x) set to 1, and in Spark it's 
> opposite.
> Moreover, attributes in `group by` will reverse firstly in Hive. In Spark 
> it'll be evaluated directly.
> In my opinion, I suggest that modifying the behavior of `grouping_id()` make 
> it compatible with Hive `grouping__id`.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21936) backward compatibility test framework for HiveExternalCatalog

2017-09-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21936:


Assignee: Apache Spark  (was: Wenchen Fan)

> backward compatibility test framework for HiveExternalCatalog
> -
>
> Key: SPARK-21936
> URL: https://issues.apache.org/jira/browse/SPARK-21936
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21936) backward compatibility test framework for HiveExternalCatalog

2017-09-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21936:


Assignee: Wenchen Fan  (was: Apache Spark)

> backward compatibility test framework for HiveExternalCatalog
> -
>
> Key: SPARK-21936
> URL: https://issues.apache.org/jira/browse/SPARK-21936
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21936) backward compatibility test framework for HiveExternalCatalog

2017-09-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16155667#comment-16155667
 ] 

Apache Spark commented on SPARK-21936:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/19148

> backward compatibility test framework for HiveExternalCatalog
> -
>
> Key: SPARK-21936
> URL: https://issues.apache.org/jira/browse/SPARK-21936
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21936) backward compatibility test framework for HiveExternalCatalog

2017-09-06 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-21936:
---

 Summary: backward compatibility test framework for 
HiveExternalCatalog
 Key: SPARK-21936
 URL: https://issues.apache.org/jira/browse/SPARK-21936
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 2.3.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-21935) Pyspark UDF causing ExecutorLostFailure

2017-09-06 Thread Nikolaos Tsipas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikolaos Tsipas updated SPARK-21935:

Comment: was deleted

(was: Hmm.. I think I understand it a bit better now after some reading about 
how pyspark works. 

The problem is that the amount of memory used by python on the executor is 
separate from the amount of memory used by the JVM. However, all the 
calculations in spark-defaults.conf are based on the assumption that only JVM 
will be consuming memory on the executor/YARN container. 

Would the right approach then be to minimise the amount of memory used by the 
JVM in the container and leave the container memory available to python?)

> Pyspark UDF causing ExecutorLostFailure 
> 
>
> Key: SPARK-21935
> URL: https://issues.apache.org/jira/browse/SPARK-21935
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Nikolaos Tsipas
>  Labels: pyspark, udf
> Attachments: cpu.png, Screen Shot 2017-09-06 at 11.30.28.png, Screen 
> Shot 2017-09-06 at 11.31.13.png, Screen Shot 2017-09-06 at 11.31.31.png
>
>
> Hi,
> I'm using spark 2.1.0 on AWS EMR (Yarn) and trying to use a UDF in python as 
> follows:
> {code}
> from pyspark.sql.functions import col, udf
> from pyspark.sql.types import StringType
> path = 's3://some/parquet/dir/myfile.parquet'
> df = spark.read.load(path)
> def _test_udf(useragent):
> return useragent.upper()
> test_udf = udf(_test_udf, StringType())
> df = df.withColumn('test_field', test_udf(col('std_useragent')))
> df.write.parquet('/output.parquet')
> {code}
> The following config is used in {{spark-defaults.conf}} (using 
> {{maximizeResourceAllocation}} in EMR)
> {code}
> ...
> spark.executor.instances 4
> spark.executor.cores 8
> spark.driver.memory  8G
> spark.executor.memory9658M
> spark.default.parallelism64
> spark.driver.maxResultSize   3G
> ...
> {code}
> The cluster has 4 worker nodes (+1 master) with the following specs: 8 vCPU, 
> 15 GiB memory, 160 SSD GB storage
> The above example fails every single time with errors like the following:
> {code}
> 17/09/06 09:58:08 WARN TaskSetManager: Lost task 26.1 in stage 1.0 (TID 50, 
> ip-172-31-7-125.eu-west-1.compute.internal, executor 10): ExecutorLostFailure 
> (executor 10 exited caused by one of the running tasks) Reason: Container 
> killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical 
> memory used. Consider boosting spark.yarn.executor.memoryOverhead.
> {code}
> I tried to increase the  {{spark.yarn.executor.memoryOverhead}} to 3000 which 
> delays the errors but eventually I get them before the end of the job. The 
> job eventually fails.
> !Screen Shot 2017-09-06 at 11.31.31.png|width=800!
> If I run the above job in scala everything works as expected (without having 
> to adjust the memoryOverhead)
> {code}
> import org.apache.spark.sql.functions.udf
> val upper: String => String = _.toUpperCase
> val df = spark.read.load("s3://some/parquet/dir/myfile.parquet")
> val upperUDF = udf(upper)
> val newdf = df.withColumn("test_field", upperUDF(col("std_useragent")))
> newdf.write.parquet("/output.parquet")
> {code}
> !Screen Shot 2017-09-06 at 11.31.13.png|width=800!
> Cpu utilisation is very bad with pyspark
> !cpu.png|width=800!
> Is this a known bug with pyspark and udfs or is it a matter of bad 
> configuration? 
> Looking forward to suggestions. Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21866) SPIP: Image support in Spark

2017-09-06 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16155422#comment-16155422
 ] 

Yanbo Liang commented on SPARK-21866:
-

[~timhunter] Fair enough.

> SPIP: Image support in Spark
> 
>
> Key: SPARK-21866
> URL: https://issues.apache.org/jira/browse/SPARK-21866
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Timothy Hunter
>  Labels: SPIP
> Attachments: SPIP - Image support for Apache Spark V1.1.pdf
>
>
> h2. Background and motivation
> As Apache Spark is being used more and more in the industry, some new use 
> cases are emerging for different data formats beyond the traditional SQL 
> types or the numerical types (vectors and matrices). Deep Learning 
> applications commonly deal with image processing. A number of projects add 
> some Deep Learning capabilities to Spark (see list below), but they struggle 
> to  communicate with each other or with MLlib pipelines because there is no 
> standard way to represent an image in Spark DataFrames. We propose to 
> federate efforts for representing images in Spark by defining a 
> representation that caters to the most common needs of users and library 
> developers.
> This SPIP proposes a specification to represent images in Spark DataFrames 
> and Datasets (based on existing industrial standards), and an interface for 
> loading sources of images. It is not meant to be a full-fledged image 
> processing library, but rather the core description that other libraries and 
> users can rely on. Several packages already offer various processing 
> facilities for transforming images or doing more complex operations, and each 
> has various design tradeoffs that make them better as standalone solutions.
> This project is a joint collaboration between Microsoft and Databricks, which 
> have been testing this design in two open source packages: MMLSpark and Deep 
> Learning Pipelines.
> The proposed image format is an in-memory, decompressed representation that 
> targets low-level applications. It is significantly more liberal in memory 
> usage than compressed image representations such as JPEG, PNG, etc., but it 
> allows easy communication with popular image processing libraries and has no 
> decoding overhead.
> h2. Targets users and personas:
> Data scientists, data engineers, library developers.
> The following libraries define primitives for loading and representing 
> images, and will gain from a common interchange format (in alphabetical 
> order):
> * BigDL
> * DeepLearning4J
> * Deep Learning Pipelines
> * MMLSpark
> * TensorFlow (Spark connector)
> * TensorFlowOnSpark
> * TensorFrames
> * Thunder
> h2. Goals:
> * Simple representation of images in Spark DataFrames, based on pre-existing 
> industrial standards (OpenCV)
> * This format should eventually allow the development of high-performance 
> integration points with image processing libraries such as libOpenCV, Google 
> TensorFlow, CNTK, and other C libraries.
> * The reader should be able to read popular formats of images from 
> distributed sources.
> h2. Non-Goals:
> Images are a versatile medium and encompass a very wide range of formats and 
> representations. This SPIP explicitly aims at the most common use case in the 
> industry currently: multi-channel matrices of binary, int32, int64, float or 
> double data that can fit comfortably in the heap of the JVM:
> * the total size of an image should be restricted to less than 2GB (roughly)
> * the meaning of color channels is application-specific and is not mandated 
> by the standard (in line with the OpenCV standard)
> * specialized formats used in meteorology, the medical field, etc. are not 
> supported
> * this format is specialized to images and does not attempt to solve the more 
> general problem of representing n-dimensional tensors in Spark
> h2. Proposed API changes
> We propose to add a new package in the package structure, under the MLlib 
> project:
> {{org.apache.spark.image}}
> h3. Data format
> We propose to add the following structure:
> imageSchema = StructType([
> * StructField("mode", StringType(), False),
> ** The exact representation of the data.
> ** The values are described in the following OpenCV convention. Basically, 
> the type has both "depth" and "number of channels" info: in particular, type 
> "CV_8UC3" means "3 channel unsigned bytes". BGRA format would be CV_8UC4 
> (value 32 in the table) with the channel order specified by convention.
> ** The exact channel ordering and meaning of each channel is dictated by 
> convention. By default, the order is RGB (3 channels) and BGRA (4 channels).
> If the image failed to load, the value is the empty string "".
> * StructField("origin", StringType(), True),
> ** Some information about the 

[jira] [Commented] (SPARK-21920) DataFrame Fail To Find The Column Name

2017-09-06 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16155397#comment-16155397
 ] 

Sean Owen commented on SPARK-21920:
---

This is not where you ask questions. As I say, use the mailing list. I've 
already what I think it the essential explanation.

> DataFrame Fail To Find The Column Name
> --
>
> Key: SPARK-21920
> URL: https://issues.apache.org/jira/browse/SPARK-21920
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: abhijit nag
>Priority: Minor
>
> I am getting one issue like "sql.AnalysisException: cannot resolve 
> column_name"
> Wrote a simple query as below.
> [DataFrame df= df1
>   .join(df2, df1.col("MERCHANT").equalTo(df2.col("MERCHANT")))
>   .select(df2.col("MERCH_ID"), df1.col("MERCHANT")));]
> Exception Found : 
> resolved attribute(s) MERCH_ID#738 missing from 
> MERCHANT#737,MERCHANT#928,MERCH_ID#929,MER_LOC#930 in operator !Project 
> [MERCH_ID#738,MERCHANT#737];
> Problem Solved by following code:
> DataFrame df= df1.alias("df1").
>   .join(df2.alias("df2"), 
> functions.col("df1.MERCHANT").equalTo(functions.col("df2.MERCHANT")))
>   .select(functions.col("df2.MERCH_ID"), functions.col("df2.MERCHANT")));
> Similar kind of issue appears rare, but I want to know the root cause of this 
> problem. 
> Is it a bug in Spark 1.6 or something else.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20958) Roll back parquet-mr 1.8.2 to parquet-1.8.1

2017-09-06 Thread Anthony Dotterer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16155387#comment-16155387
 ] 

Anthony Dotterer edited comment on SPARK-20958 at 9/6/17 2:01 PM:
--

As a user of Spark 2.2.0 that mixes usage of parquet-avro and avro, here is the 
exceptions that I had.  This will hopefully make search engines find this 
library conflict more quickly for others.

{code}
java.lang.NoClassDefFoundError: org/apache/avro/LogicalType
at 
org.apache.parquet.avro.AvroParquetWriter.writeSupport(AvroParquetWriter.java:144)
at 
org.apache.parquet.avro.AvroParquetWriter.access$100(AvroParquetWriter.java:35)
at 
org.apache.parquet.avro.AvroParquetWriter$Builder.getWriteSupport(AvroParquetWriter.java:173)
...
Caused by: java.lang.ClassNotFoundException: org.apache.avro.LogicalType
{code}



was (Author: spiricalsalsaz):
As a user of Spark 2.2.0 that mixes usage of parquet-avro and avro, here are 
some exceptions that I had.  This will hopefully make search engines find this 
library conflict more quickly for others.

{code}
java.lang.NoClassDefFoundError: org/apache/avro/LogicalType
at 
org.apache.parquet.avro.AvroParquetWriter.writeSupport(AvroParquetWriter.java:144)
at 
org.apache.parquet.avro.AvroParquetWriter.access$100(AvroParquetWriter.java:35)
at 
org.apache.parquet.avro.AvroParquetWriter$Builder.getWriteSupport(AvroParquetWriter.java:173)
...
Caused by: java.lang.ClassNotFoundException: org.apache.avro.LogicalType
{code}


> Roll back parquet-mr 1.8.2 to parquet-1.8.1
> ---
>
> Key: SPARK-20958
> URL: https://issues.apache.org/jira/browse/SPARK-20958
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>  Labels: release-notes, release_notes, releasenotes
>
> We recently realized that parquet-mr 1.8.2 used by Spark 2.2.0-rc2 depends on 
> avro 1.8.1, which is incompatible with avro 1.7.6 used by parquet-mr 1.8.1 
> and avro 1.7.7 used by spark-core 2.2.0-rc2.
> Basically, Spark 2.2.0-rc2 introduced two incompatible versions of avro 
> (1.7.7 and 1.8.1). Upgrading avro 1.7.7 to 1.8.1 is not preferable due to the 
> reasons mentioned in [PR 
> #17163|https://github.com/apache/spark/pull/17163#issuecomment-286563131]. 
> Therefore, we don't really have many choices here and have to roll back 
> parquet-mr 1.8.2 to 1.8.1 to resolve this dependency conflict.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21920) DataFrame Fail To Find The Column Name

2017-09-06 Thread abhijit nag (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16155384#comment-16155384
 ] 

abhijit nag edited comment on SPARK-21920 at 9/6/17 2:04 PM:
-

[~srowen] Please read the entire post. Problem statement has been given with 
all the details. You can also check the external link. DataFrame API is failing 
on a particular occasion, which need to be resolved. But we got an alternative 
work around to address this problem which is also posted. I want to know the 
reason behind this issue.


was (Author: abhijitnag):
[~srowen] Please read the entire post. Problem statement has been given with 
all the details. 

> DataFrame Fail To Find The Column Name
> --
>
> Key: SPARK-21920
> URL: https://issues.apache.org/jira/browse/SPARK-21920
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: abhijit nag
>Priority: Minor
>
> I am getting one issue like "sql.AnalysisException: cannot resolve 
> column_name"
> Wrote a simple query as below.
> [DataFrame df= df1
>   .join(df2, df1.col("MERCHANT").equalTo(df2.col("MERCHANT")))
>   .select(df2.col("MERCH_ID"), df1.col("MERCHANT")));]
> Exception Found : 
> resolved attribute(s) MERCH_ID#738 missing from 
> MERCHANT#737,MERCHANT#928,MERCH_ID#929,MER_LOC#930 in operator !Project 
> [MERCH_ID#738,MERCHANT#737];
> Problem Solved by following code:
> DataFrame df= df1.alias("df1").
>   .join(df2.alias("df2"), 
> functions.col("df1.MERCHANT").equalTo(functions.col("df2.MERCHANT")))
>   .select(functions.col("df2.MERCH_ID"), functions.col("df2.MERCHANT")));
> Similar kind of issue appears rare, but I want to know the root cause of this 
> problem. 
> Is it a bug in Spark 1.6 or something else.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20958) Roll back parquet-mr 1.8.2 to parquet-1.8.1

2017-09-06 Thread Anthony Dotterer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16155387#comment-16155387
 ] 

Anthony Dotterer commented on SPARK-20958:
--

As a user of Spark 2.2.0 that mixes usage of parquet-avro and avro, here are 
some exceptions that I had.  This will hopefully make search engines find this 
library conflict more quickly for others.

{code}
java.lang.NoClassDefFoundError: org/apache/avro/LogicalType
at 
org.apache.parquet.avro.AvroParquetWriter.writeSupport(AvroParquetWriter.java:144)
at 
org.apache.parquet.avro.AvroParquetWriter.access$100(AvroParquetWriter.java:35)
at 
org.apache.parquet.avro.AvroParquetWriter$Builder.getWriteSupport(AvroParquetWriter.java:173)
...
Caused by: java.lang.ClassNotFoundException: org.apache.avro.LogicalType
{code}


> Roll back parquet-mr 1.8.2 to parquet-1.8.1
> ---
>
> Key: SPARK-20958
> URL: https://issues.apache.org/jira/browse/SPARK-20958
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>  Labels: release-notes, release_notes, releasenotes
>
> We recently realized that parquet-mr 1.8.2 used by Spark 2.2.0-rc2 depends on 
> avro 1.8.1, which is incompatible with avro 1.7.6 used by parquet-mr 1.8.1 
> and avro 1.7.7 used by spark-core 2.2.0-rc2.
> Basically, Spark 2.2.0-rc2 introduced two incompatible versions of avro 
> (1.7.7 and 1.8.1). Upgrading avro 1.7.7 to 1.8.1 is not preferable due to the 
> reasons mentioned in [PR 
> #17163|https://github.com/apache/spark/pull/17163#issuecomment-286563131]. 
> Therefore, we don't really have many choices here and have to roll back 
> parquet-mr 1.8.2 to 1.8.1 to resolve this dependency conflict.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21920) DataFrame Fail To Find The Column Name

2017-09-06 Thread abhijit nag (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16155384#comment-16155384
 ] 

abhijit nag commented on SPARK-21920:
-

[~srowen] Please read the entire post. Problem statement has been given with 
all the details. 

> DataFrame Fail To Find The Column Name
> --
>
> Key: SPARK-21920
> URL: https://issues.apache.org/jira/browse/SPARK-21920
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: abhijit nag
>Priority: Minor
>
> I am getting one issue like "sql.AnalysisException: cannot resolve 
> column_name"
> Wrote a simple query as below.
> [DataFrame df= df1
>   .join(df2, df1.col("MERCHANT").equalTo(df2.col("MERCHANT")))
>   .select(df2.col("MERCH_ID"), df1.col("MERCHANT")));]
> Exception Found : 
> resolved attribute(s) MERCH_ID#738 missing from 
> MERCHANT#737,MERCHANT#928,MERCH_ID#929,MER_LOC#930 in operator !Project 
> [MERCH_ID#738,MERCHANT#737];
> Problem Solved by following code:
> DataFrame df= df1.alias("df1").
>   .join(df2.alias("df2"), 
> functions.col("df1.MERCHANT").equalTo(functions.col("df2.MERCHANT")))
>   .select(functions.col("df2.MERCH_ID"), functions.col("df2.MERCHANT")));
> Similar kind of issue appears rare, but I want to know the root cause of this 
> problem. 
> Is it a bug in Spark 1.6 or something else.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21920) DataFrame Fail To Find The Column Name

2017-09-06 Thread abhijit nag (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

abhijit nag updated SPARK-21920:

Issue Type: Bug  (was: Question)

> DataFrame Fail To Find The Column Name
> --
>
> Key: SPARK-21920
> URL: https://issues.apache.org/jira/browse/SPARK-21920
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: abhijit nag
>Priority: Minor
>
> I am getting one issue like "sql.AnalysisException: cannot resolve 
> column_name"
> Wrote a simple query as below.
> [DataFrame df= df1
>   .join(df2, df1.col("MERCHANT").equalTo(df2.col("MERCHANT")))
>   .select(df2.col("MERCH_ID"), df1.col("MERCHANT")));]
> Exception Found : 
> resolved attribute(s) MERCH_ID#738 missing from 
> MERCHANT#737,MERCHANT#928,MERCH_ID#929,MER_LOC#930 in operator !Project 
> [MERCH_ID#738,MERCHANT#737];
> Problem Solved by following code:
> DataFrame df= df1.alias("df1").
>   .join(df2.alias("df2"), 
> functions.col("df1.MERCHANT").equalTo(functions.col("df2.MERCHANT")))
>   .select(functions.col("df2.MERCH_ID"), functions.col("df2.MERCHANT")));
> Similar kind of issue appears rare, but I want to know the root cause of this 
> problem. 
> Is it a bug in Spark 1.6 or something else.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21935) Pyspark UDF causing ExecutorLostFailure

2017-09-06 Thread Nikolaos Tsipas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16155378#comment-16155378
 ] 

Nikolaos Tsipas commented on SPARK-21935:
-

Hmm.. I think I understand it a bit better now after some reading about how 
pyspark works. 

The problem is that the amount of memory used by python on the executor is 
separate from the amount of memory used by the JVM. However, all the 
calculations in spark-defaults.conf are based on the assumption that only JVM 
will be consuming memory on the executor/YARN container. 

Would the right approach then be to minimise the amount of memory used by the 
JVM in the container and leave the container memory available to python?

> Pyspark UDF causing ExecutorLostFailure 
> 
>
> Key: SPARK-21935
> URL: https://issues.apache.org/jira/browse/SPARK-21935
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Nikolaos Tsipas
>  Labels: pyspark, udf
> Attachments: cpu.png, Screen Shot 2017-09-06 at 11.30.28.png, Screen 
> Shot 2017-09-06 at 11.31.13.png, Screen Shot 2017-09-06 at 11.31.31.png
>
>
> Hi,
> I'm using spark 2.1.0 on AWS EMR (Yarn) and trying to use a UDF in python as 
> follows:
> {code}
> from pyspark.sql.functions import col, udf
> from pyspark.sql.types import StringType
> path = 's3://some/parquet/dir/myfile.parquet'
> df = spark.read.load(path)
> def _test_udf(useragent):
> return useragent.upper()
> test_udf = udf(_test_udf, StringType())
> df = df.withColumn('test_field', test_udf(col('std_useragent')))
> df.write.parquet('/output.parquet')
> {code}
> The following config is used in {{spark-defaults.conf}} (using 
> {{maximizeResourceAllocation}} in EMR)
> {code}
> ...
> spark.executor.instances 4
> spark.executor.cores 8
> spark.driver.memory  8G
> spark.executor.memory9658M
> spark.default.parallelism64
> spark.driver.maxResultSize   3G
> ...
> {code}
> The cluster has 4 worker nodes (+1 master) with the following specs: 8 vCPU, 
> 15 GiB memory, 160 SSD GB storage
> The above example fails every single time with errors like the following:
> {code}
> 17/09/06 09:58:08 WARN TaskSetManager: Lost task 26.1 in stage 1.0 (TID 50, 
> ip-172-31-7-125.eu-west-1.compute.internal, executor 10): ExecutorLostFailure 
> (executor 10 exited caused by one of the running tasks) Reason: Container 
> killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical 
> memory used. Consider boosting spark.yarn.executor.memoryOverhead.
> {code}
> I tried to increase the  {{spark.yarn.executor.memoryOverhead}} to 3000 which 
> delays the errors but eventually I get them before the end of the job. The 
> job eventually fails.
> !Screen Shot 2017-09-06 at 11.31.31.png|width=800!
> If I run the above job in scala everything works as expected (without having 
> to adjust the memoryOverhead)
> {code}
> import org.apache.spark.sql.functions.udf
> val upper: String => String = _.toUpperCase
> val df = spark.read.load("s3://some/parquet/dir/myfile.parquet")
> val upperUDF = udf(upper)
> val newdf = df.withColumn("test_field", upperUDF(col("std_useragent")))
> newdf.write.parquet("/output.parquet")
> {code}
> !Screen Shot 2017-09-06 at 11.31.13.png|width=800!
> Cpu utilisation is very bad with pyspark
> !cpu.png|width=800!
> Is this a known bug with pyspark and udfs or is it a matter of bad 
> configuration? 
> Looking forward to suggestions. Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9776) Another instance of Derby may have already booted the database

2017-09-06 Thread Quincy HSIEH (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16155305#comment-16155305
 ] 

Quincy HSIEH commented on SPARK-9776:
-

Hi Holger,

It's not a fix in the distribution. It's a workaround that you have to setup 
with the configurations...

> Another instance of Derby may have already booted the database 
> ---
>
> Key: SPARK-9776
> URL: https://issues.apache.org/jira/browse/SPARK-9776
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
> Environment: Mac Yosemite, spark-1.5.0
>Reporter: Sudhakar Thota
> Attachments: SPARK-9776-FL1.rtf
>
>
> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) results in 
> error. Though the same works for spark-1.4.1.
> Caused by: ERROR XSDB6: Another instance of Derby may have already booted the 
> database 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21190) SPIP: Vectorized UDFs in Python

2017-09-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21190:


Assignee: Apache Spark  (was: Reynold Xin)

> SPIP: Vectorized UDFs in Python
> ---
>
> Key: SPARK-21190
> URL: https://issues.apache.org/jira/browse/SPARK-21190
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Apache Spark
>  Labels: SPIP
> Attachments: SPIPVectorizedUDFsforPython (1).pdf
>
>
> *Background and Motivation*
> Python is one of the most popular programming languages among Spark users. 
> Spark currently exposes a row-at-a-time interface for defining and executing 
> user-defined functions (UDFs). This introduces high overhead in serialization 
> and deserialization, and also makes it difficult to leverage Python libraries 
> (e.g. numpy, Pandas) that are written in native code.
>  
> This proposal advocates introducing new APIs to support vectorized UDFs in 
> Python, in which a block of data is transferred over to Python in some 
> columnar format for execution.
>  
>  
> *Target Personas*
> Data scientists, data engineers, library developers.
>  
> *Goals*
> - Support vectorized UDFs that apply on chunks of the data frame
> - Low system overhead: Substantially reduce serialization and deserialization 
> overhead when compared with row-at-a-time interface
> - UDF performance: Enable users to leverage native libraries in Python (e.g. 
> numpy, Pandas) for data manipulation in these UDFs
>  
> *Non-Goals*
> The following are explicitly out of scope for the current SPIP, and should be 
> done in future SPIPs. Nonetheless, it would be good to consider these future 
> use cases during API design, so we can achieve some consistency when rolling 
> out new APIs.
>  
> - Define block oriented UDFs in other languages (that are not Python).
> - Define aggregate UDFs
> - Tight integration with machine learning frameworks
>  
> *Proposed API Changes*
> The following sketches some possibilities. I haven’t spent a lot of time 
> thinking about the API (wrote it down in 5 mins) and I am not attached to 
> this design at all. The main purpose of the SPIP is to get feedback on use 
> cases and see how they can impact API design.
>  
> A few things to consider are:
>  
> 1. Python is dynamically typed, whereas DataFrames/SQL requires static, 
> analysis time typing. This means users would need to specify the return type 
> of their UDFs.
>  
> 2. Ratio of input rows to output rows. We propose initially we require number 
> of output rows to be the same as the number of input rows. In the future, we 
> can consider relaxing this constraint with support for vectorized aggregate 
> UDFs.
> 3. How do we handle null values, since Pandas doesn't have the concept of 
> nulls?
>  
> Proposed API sketch (using examples):
>  
> Use case 1. A function that defines all the columns of a DataFrame (similar 
> to a “map” function):
>  
> {code}
> @spark_udf(some way to describe the return schema)
> def my_func_on_entire_df(input):
>   """ Some user-defined function.
>  
>   :param input: A Pandas DataFrame with two columns, a and b.
>   :return: :class: A Pandas data frame.
>   """
>   input[c] = input[a] + input[b]
>   Input[d] = input[a] - input[b]
>   return input
>  
> spark.range(1000).selectExpr("id a", "id / 2 b")
>   .mapBatches(my_func_on_entire_df)
> {code}
>  
> Use case 2. A function that defines only one column (similar to existing 
> UDFs):
>  
> {code}
> @spark_udf(some way to describe the return schema)
> def my_func_that_returns_one_column(input):
>   """ Some user-defined function.
>  
>   :param input: A Pandas DataFrame with two columns, a and b.
>   :return: :class: A numpy array
>   """
>   return input[a] + input[b]
>  
> my_func = udf(my_func_that_returns_one_column)
>  
> df = spark.range(1000).selectExpr("id a", "id / 2 b")
> df.withColumn("c", my_func(df.a, df.b))
> {code}
>  
>  
>  
> *Optional Design Sketch*
> I’m more concerned about getting proper feedback for API design. The 
> implementation should be pretty straightforward and is not a huge concern at 
> this point. We can leverage the same implementation for faster toPandas 
> (using Arrow).
>  
>  
> *Optional Rejected Designs*
> See above.
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21190) SPIP: Vectorized UDFs in Python

2017-09-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21190:


Assignee: Reynold Xin  (was: Apache Spark)

> SPIP: Vectorized UDFs in Python
> ---
>
> Key: SPARK-21190
> URL: https://issues.apache.org/jira/browse/SPARK-21190
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>  Labels: SPIP
> Attachments: SPIPVectorizedUDFsforPython (1).pdf
>
>
> *Background and Motivation*
> Python is one of the most popular programming languages among Spark users. 
> Spark currently exposes a row-at-a-time interface for defining and executing 
> user-defined functions (UDFs). This introduces high overhead in serialization 
> and deserialization, and also makes it difficult to leverage Python libraries 
> (e.g. numpy, Pandas) that are written in native code.
>  
> This proposal advocates introducing new APIs to support vectorized UDFs in 
> Python, in which a block of data is transferred over to Python in some 
> columnar format for execution.
>  
>  
> *Target Personas*
> Data scientists, data engineers, library developers.
>  
> *Goals*
> - Support vectorized UDFs that apply on chunks of the data frame
> - Low system overhead: Substantially reduce serialization and deserialization 
> overhead when compared with row-at-a-time interface
> - UDF performance: Enable users to leverage native libraries in Python (e.g. 
> numpy, Pandas) for data manipulation in these UDFs
>  
> *Non-Goals*
> The following are explicitly out of scope for the current SPIP, and should be 
> done in future SPIPs. Nonetheless, it would be good to consider these future 
> use cases during API design, so we can achieve some consistency when rolling 
> out new APIs.
>  
> - Define block oriented UDFs in other languages (that are not Python).
> - Define aggregate UDFs
> - Tight integration with machine learning frameworks
>  
> *Proposed API Changes*
> The following sketches some possibilities. I haven’t spent a lot of time 
> thinking about the API (wrote it down in 5 mins) and I am not attached to 
> this design at all. The main purpose of the SPIP is to get feedback on use 
> cases and see how they can impact API design.
>  
> A few things to consider are:
>  
> 1. Python is dynamically typed, whereas DataFrames/SQL requires static, 
> analysis time typing. This means users would need to specify the return type 
> of their UDFs.
>  
> 2. Ratio of input rows to output rows. We propose initially we require number 
> of output rows to be the same as the number of input rows. In the future, we 
> can consider relaxing this constraint with support for vectorized aggregate 
> UDFs.
> 3. How do we handle null values, since Pandas doesn't have the concept of 
> nulls?
>  
> Proposed API sketch (using examples):
>  
> Use case 1. A function that defines all the columns of a DataFrame (similar 
> to a “map” function):
>  
> {code}
> @spark_udf(some way to describe the return schema)
> def my_func_on_entire_df(input):
>   """ Some user-defined function.
>  
>   :param input: A Pandas DataFrame with two columns, a and b.
>   :return: :class: A Pandas data frame.
>   """
>   input[c] = input[a] + input[b]
>   Input[d] = input[a] - input[b]
>   return input
>  
> spark.range(1000).selectExpr("id a", "id / 2 b")
>   .mapBatches(my_func_on_entire_df)
> {code}
>  
> Use case 2. A function that defines only one column (similar to existing 
> UDFs):
>  
> {code}
> @spark_udf(some way to describe the return schema)
> def my_func_that_returns_one_column(input):
>   """ Some user-defined function.
>  
>   :param input: A Pandas DataFrame with two columns, a and b.
>   :return: :class: A numpy array
>   """
>   return input[a] + input[b]
>  
> my_func = udf(my_func_that_returns_one_column)
>  
> df = spark.range(1000).selectExpr("id a", "id / 2 b")
> df.withColumn("c", my_func(df.a, df.b))
> {code}
>  
>  
>  
> *Optional Design Sketch*
> I’m more concerned about getting proper feedback for API design. The 
> implementation should be pretty straightforward and is not a huge concern at 
> this point. We can leverage the same implementation for faster toPandas 
> (using Arrow).
>  
>  
> *Optional Rejected Designs*
> See above.
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21190) SPIP: Vectorized UDFs in Python

2017-09-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16155271#comment-16155271
 ] 

Apache Spark commented on SPARK-21190:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/19147

> SPIP: Vectorized UDFs in Python
> ---
>
> Key: SPARK-21190
> URL: https://issues.apache.org/jira/browse/SPARK-21190
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>  Labels: SPIP
> Attachments: SPIPVectorizedUDFsforPython (1).pdf
>
>
> *Background and Motivation*
> Python is one of the most popular programming languages among Spark users. 
> Spark currently exposes a row-at-a-time interface for defining and executing 
> user-defined functions (UDFs). This introduces high overhead in serialization 
> and deserialization, and also makes it difficult to leverage Python libraries 
> (e.g. numpy, Pandas) that are written in native code.
>  
> This proposal advocates introducing new APIs to support vectorized UDFs in 
> Python, in which a block of data is transferred over to Python in some 
> columnar format for execution.
>  
>  
> *Target Personas*
> Data scientists, data engineers, library developers.
>  
> *Goals*
> - Support vectorized UDFs that apply on chunks of the data frame
> - Low system overhead: Substantially reduce serialization and deserialization 
> overhead when compared with row-at-a-time interface
> - UDF performance: Enable users to leverage native libraries in Python (e.g. 
> numpy, Pandas) for data manipulation in these UDFs
>  
> *Non-Goals*
> The following are explicitly out of scope for the current SPIP, and should be 
> done in future SPIPs. Nonetheless, it would be good to consider these future 
> use cases during API design, so we can achieve some consistency when rolling 
> out new APIs.
>  
> - Define block oriented UDFs in other languages (that are not Python).
> - Define aggregate UDFs
> - Tight integration with machine learning frameworks
>  
> *Proposed API Changes*
> The following sketches some possibilities. I haven’t spent a lot of time 
> thinking about the API (wrote it down in 5 mins) and I am not attached to 
> this design at all. The main purpose of the SPIP is to get feedback on use 
> cases and see how they can impact API design.
>  
> A few things to consider are:
>  
> 1. Python is dynamically typed, whereas DataFrames/SQL requires static, 
> analysis time typing. This means users would need to specify the return type 
> of their UDFs.
>  
> 2. Ratio of input rows to output rows. We propose initially we require number 
> of output rows to be the same as the number of input rows. In the future, we 
> can consider relaxing this constraint with support for vectorized aggregate 
> UDFs.
> 3. How do we handle null values, since Pandas doesn't have the concept of 
> nulls?
>  
> Proposed API sketch (using examples):
>  
> Use case 1. A function that defines all the columns of a DataFrame (similar 
> to a “map” function):
>  
> {code}
> @spark_udf(some way to describe the return schema)
> def my_func_on_entire_df(input):
>   """ Some user-defined function.
>  
>   :param input: A Pandas DataFrame with two columns, a and b.
>   :return: :class: A Pandas data frame.
>   """
>   input[c] = input[a] + input[b]
>   Input[d] = input[a] - input[b]
>   return input
>  
> spark.range(1000).selectExpr("id a", "id / 2 b")
>   .mapBatches(my_func_on_entire_df)
> {code}
>  
> Use case 2. A function that defines only one column (similar to existing 
> UDFs):
>  
> {code}
> @spark_udf(some way to describe the return schema)
> def my_func_that_returns_one_column(input):
>   """ Some user-defined function.
>  
>   :param input: A Pandas DataFrame with two columns, a and b.
>   :return: :class: A numpy array
>   """
>   return input[a] + input[b]
>  
> my_func = udf(my_func_that_returns_one_column)
>  
> df = spark.range(1000).selectExpr("id a", "id / 2 b")
> df.withColumn("c", my_func(df.a, df.b))
> {code}
>  
>  
>  
> *Optional Design Sketch*
> I’m more concerned about getting proper feedback for API design. The 
> implementation should be pretty straightforward and is not a huge concern at 
> this point. We can leverage the same implementation for faster toPandas 
> (using Arrow).
>  
>  
> *Optional Rejected Designs*
> See above.
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21190) SPIP: Vectorized UDFs in Python

2017-09-06 Thread Takuya Ueshin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16155270#comment-16155270
 ] 

Takuya Ueshin commented on SPARK-21190:
---

[~leif], [~bryanc] Thanks for the instruction about {{inspect}} module.
I believe we got a consensus about the basic APIs except for the size hint, so 
I've decided to submit a pr based on my proposal at first and let's discuss the 
size hint for 0-parameter UDF (or for more parameter UDFs to be consistent) 
through the review process.

> SPIP: Vectorized UDFs in Python
> ---
>
> Key: SPARK-21190
> URL: https://issues.apache.org/jira/browse/SPARK-21190
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>  Labels: SPIP
> Attachments: SPIPVectorizedUDFsforPython (1).pdf
>
>
> *Background and Motivation*
> Python is one of the most popular programming languages among Spark users. 
> Spark currently exposes a row-at-a-time interface for defining and executing 
> user-defined functions (UDFs). This introduces high overhead in serialization 
> and deserialization, and also makes it difficult to leverage Python libraries 
> (e.g. numpy, Pandas) that are written in native code.
>  
> This proposal advocates introducing new APIs to support vectorized UDFs in 
> Python, in which a block of data is transferred over to Python in some 
> columnar format for execution.
>  
>  
> *Target Personas*
> Data scientists, data engineers, library developers.
>  
> *Goals*
> - Support vectorized UDFs that apply on chunks of the data frame
> - Low system overhead: Substantially reduce serialization and deserialization 
> overhead when compared with row-at-a-time interface
> - UDF performance: Enable users to leverage native libraries in Python (e.g. 
> numpy, Pandas) for data manipulation in these UDFs
>  
> *Non-Goals*
> The following are explicitly out of scope for the current SPIP, and should be 
> done in future SPIPs. Nonetheless, it would be good to consider these future 
> use cases during API design, so we can achieve some consistency when rolling 
> out new APIs.
>  
> - Define block oriented UDFs in other languages (that are not Python).
> - Define aggregate UDFs
> - Tight integration with machine learning frameworks
>  
> *Proposed API Changes*
> The following sketches some possibilities. I haven’t spent a lot of time 
> thinking about the API (wrote it down in 5 mins) and I am not attached to 
> this design at all. The main purpose of the SPIP is to get feedback on use 
> cases and see how they can impact API design.
>  
> A few things to consider are:
>  
> 1. Python is dynamically typed, whereas DataFrames/SQL requires static, 
> analysis time typing. This means users would need to specify the return type 
> of their UDFs.
>  
> 2. Ratio of input rows to output rows. We propose initially we require number 
> of output rows to be the same as the number of input rows. In the future, we 
> can consider relaxing this constraint with support for vectorized aggregate 
> UDFs.
> 3. How do we handle null values, since Pandas doesn't have the concept of 
> nulls?
>  
> Proposed API sketch (using examples):
>  
> Use case 1. A function that defines all the columns of a DataFrame (similar 
> to a “map” function):
>  
> {code}
> @spark_udf(some way to describe the return schema)
> def my_func_on_entire_df(input):
>   """ Some user-defined function.
>  
>   :param input: A Pandas DataFrame with two columns, a and b.
>   :return: :class: A Pandas data frame.
>   """
>   input[c] = input[a] + input[b]
>   Input[d] = input[a] - input[b]
>   return input
>  
> spark.range(1000).selectExpr("id a", "id / 2 b")
>   .mapBatches(my_func_on_entire_df)
> {code}
>  
> Use case 2. A function that defines only one column (similar to existing 
> UDFs):
>  
> {code}
> @spark_udf(some way to describe the return schema)
> def my_func_that_returns_one_column(input):
>   """ Some user-defined function.
>  
>   :param input: A Pandas DataFrame with two columns, a and b.
>   :return: :class: A numpy array
>   """
>   return input[a] + input[b]
>  
> my_func = udf(my_func_that_returns_one_column)
>  
> df = spark.range(1000).selectExpr("id a", "id / 2 b")
> df.withColumn("c", my_func(df.a, df.b))
> {code}
>  
>  
>  
> *Optional Design Sketch*
> I’m more concerned about getting proper feedback for API design. The 
> implementation should be pretty straightforward and is not a huge concern at 
> this point. We can leverage the same implementation for faster toPandas 
> (using Arrow).
>  
>  
> *Optional Rejected Designs*
> See above.
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To 

[jira] [Assigned] (SPARK-19357) Parallel Model Evaluation for ML Tuning: Scala

2017-09-06 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-19357:
--

Assignee: Bryan Cutler

> Parallel Model Evaluation for ML Tuning: Scala
> --
>
> Key: SPARK-19357
> URL: https://issues.apache.org/jira/browse/SPARK-19357
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
> Fix For: 2.3.0
>
>
> This is a first step of the parent task of Optimizations for ML Pipeline 
> Tuning to perform model evaluation in parallel.  A simple approach is to 
> naively evaluate with a possible parameter to control the level of 
> parallelism.  There are some concerns with this:
> * excessive caching of datasets
> * what to set as the default value for level of parallelism.  1 will evaluate 
> all models in serial, as is done currently. Higher values could lead to 
> excessive caching.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19357) Parallel Model Evaluation for ML Tuning: Scala

2017-09-06 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-19357.

   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 16774
[https://github.com/apache/spark/pull/16774]

> Parallel Model Evaluation for ML Tuning: Scala
> --
>
> Key: SPARK-19357
> URL: https://issues.apache.org/jira/browse/SPARK-19357
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Bryan Cutler
> Fix For: 2.3.0
>
>
> This is a first step of the parent task of Optimizations for ML Pipeline 
> Tuning to perform model evaluation in parallel.  A simple approach is to 
> naively evaluate with a possible parameter to control the level of 
> parallelism.  There are some concerns with this:
> * excessive caching of datasets
> * what to set as the default value for level of parallelism.  1 will evaluate 
> all models in serial, as is done currently. Higher values could lead to 
> excessive caching.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21931) add LNNVL function

2017-09-06 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16155238#comment-16155238
 ] 

Hyukjin Kwon commented on SPARK-21931:
--

I think we can use null-safe equality comparison to cover this case. I am not 
sure too if we need this.

> add LNNVL function
> --
>
> Key: SPARK-21931
> URL: https://issues.apache.org/jira/browse/SPARK-21931
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Ruslan Dautkhanov
>Priority: Minor
> Attachments: Capture1.JPG
>
>
> Purpose
> LNNVL provides a concise way to evaluate a condition when one or both 
> operands of the condition may be null. The function can be used only in the 
> WHERE clause of a query. It takes as an argument a condition and returns TRUE 
> if the condition is FALSE or UNKNOWN and FALSE if the condition is TRUE. 
> LNNVL can be used anywhere a scalar expression can appear, even in contexts 
> where the IS (NOT) NULL, AND, or OR conditions are not valid but would 
> otherwise be required to account for potential nulls. Oracle Database 
> sometimes uses the LNNVL function internally in this way to rewrite NOT IN 
> conditions as NOT EXISTS conditions. In such cases, output from EXPLAIN PLAN 
> shows this operation in the plan table output. The condition can evaluate any 
> scalar values but cannot be a compound condition containing AND, OR, or 
> BETWEEN.
> The table that follows shows what LNNVL returns given that a = 2 and b is 
> null.
> !Capture1.JPG!
> https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions078.htm 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21927) scalastyle 1.0.0 generates SBT warnings

2017-09-06 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16155194#comment-16155194
 ] 

Hyukjin Kwon commented on SPARK-21927:
--

I can reproduce this with 0.9.0 too. I can't reproduce this with 0.8.0. I am 
reproducing it by the commands below:

{code}
rm -r ~/.ivy2/cache/org.scalastyle/scalastyle_2.10/*
./build/sbt clean package
{code}

{code}
[info] Loading project definition from /.../spark/project
[info] Updating {file:/.../spark/project/}spark-build...
[info] Resolving org.fusesource.jansi#jansi;1.4 ...
[info] downloading 
https://repo1.maven.org/maven2/org/scalastyle/scalastyle_2.10/0.9.0/scalastyle_2.10-0.9.0.jar
 ...
[info]  [SUCCESSFUL ] org.scalastyle#scalastyle_2.10;0.9.0!scalastyle_2.10.jar 
(10870ms)
[info] Done updating.
[warn] Found version conflict(s) in library dependencies; some are suspected to 
be binary incompatible:
[warn]
[warn]  * org.apache.maven.wagon:wagon-provider-api:2.2 is selected over 
1.0-beta-6
[warn]  +- org.apache.maven:maven-compat:3.0.4(depends on 
2.2)
[warn]  +- org.apache.maven.wagon:wagon-file:2.2  (depends on 
2.2)
[warn]  +- org.spark-project:sbt-pom-reader:1.0.0-spark (scalaVersion=2.10, 
sbtVersion=0.13) (depends on 2.2)
[warn]  +- org.apache.maven.wagon:wagon-http-shared4:2.2  (depends on 
2.2)
[warn]  +- org.apache.maven.wagon:wagon-http:2.2  (depends on 
2.2)
[warn]  +- org.apache.maven.wagon:wagon-http-lightweight:2.2  (depends on 
2.2)
[warn]  +- org.sonatype.aether:aether-connector-wagon:1.13.1  (depends on 
1.0-beta-6)
[warn]
[warn]  * org.codehaus.plexus:plexus-utils:3.0 is selected over {2.0.7, 2.0.6, 
2.1, 1.5.5}
[warn]  +- org.apache.maven.wagon:wagon-provider-api:2.2  (depends on 
3.0)
[warn]  +- org.apache.maven:maven-compat:3.0.4(depends on 
2.0.6)
[warn]  +- org.sonatype.sisu:sisu-inject-plexus:2.3.0 (depends on 
2.0.6)
[warn]  +- org.apache.maven:maven-artifact:3.0.4  (depends on 
2.0.6)
[warn]  +- org.apache.maven:maven-core:3.0.4  (depends on 
2.0.6)
[warn]  +- org.sonatype.plexus:plexus-sec-dispatcher:1.3  (depends on 
2.0.6)
[warn]  +- org.apache.maven:maven-embedder:3.0.4  (depends on 
2.0.6)
[warn]  +- org.apache.maven:maven-settings:3.0.4  (depends on 
2.0.6)
[warn]  +- org.apache.maven:maven-settings-builder:3.0.4  (depends on 
2.0.6)
[warn]  +- org.apache.maven:maven-model-builder:3.0.4 (depends on 
2.0.7)
[warn]  +- org.sonatype.aether:aether-connector-wagon:1.13.1  (depends on 
2.0.7)
[warn]  +- org.sonatype.sisu:sisu-inject-plexus:2.2.3 (depends on 
2.0.7)
[warn]  +- org.apache.maven:maven-model:3.0.4 (depends on 
2.0.7)
[warn]  +- org.apache.maven:maven-aether-provider:3.0.4   (depends on 
2.0.7)
[warn]  +- org.apache.maven:maven-repository-metadata:3.0.4   (depends on 
2.0.7)
[warn]
[warn]  * cglib:cglib is evicted completely
[warn]  +- org.sonatype.sisu:sisu-guice:3.0.3 (depends on 
2.2.2)
[warn]
[warn]  * asm:asm is evicted completely
[warn]  +- cglib:cglib:2.2.2  (depends on 
3.3.1)
[warn]
[warn] Run 'evicted' to see detailed eviction warnings
{code}

Looks there have been dependency update between 0.8.0 and 0.9.0 assuming from 
the Github.

> scalastyle 1.0.0 generates SBT warnings
> ---
>
> Key: SPARK-21927
> URL: https://issues.apache.org/jira/browse/SPARK-21927
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
> Environment: Apache Spark current master (commit 
> 12ab7f7e89ec9e102859ab3b710815d3058a2e8d)
>Reporter: Kris Mok
>Priority: Minor
>
> When building the current Spark master just now (commit 
> 12ab7f7e89ec9e102859ab3b710815d3058a2e8d), I noticed the build prints a lot 
> of warning messages such as the following. Looks like the dependency 
> management in the POMs are somehow broken recently.
> {code:none}
> .../workspace/apache-spark/master (master) $ build/sbt clean package
> Attempting to fetch sbt
> Launching sbt from build/sbt-launch-0.13.16.jar
> [info] Loading project definition from 
> .../workspace/apache-spark/master/project
> [info] Updating 
> {file:.../workspace/apache-spark/master/project/}master-build...
> [info] Resolving org.fusesource.jansi#jansi;1.4 ...
> [info] downloading 
> https://repo1.maven.org/maven2/org/scalastyle/scalastyle-sbt-plugin_2.10_0.13/1.0.0/scalastyle-sbt-plugin-1.0.0.jar
>  ...
> [info] [SUCCESSFUL ] 
> org.scalastyle#scalastyle-sbt-plugin;1.0.0!scalastyle-sbt-plugin.jar (239ms)
> [info] downloading 
> 

[jira] [Updated] (SPARK-21935) Pyspark UDF causing ExecutorLostFailure

2017-09-06 Thread Nikolaos Tsipas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikolaos Tsipas updated SPARK-21935:

Description: 
Hi,

I'm using spark 2.1.0 on AWS EMR (Yarn) and trying to use a UDF in python as 
follows:
{code}
from pyspark.sql.functions import col, udf
from pyspark.sql.types import StringType

path = 's3://some/parquet/dir/myfile.parquet'
df = spark.read.load(path)
def _test_udf(useragent):
return useragent.upper()

test_udf = udf(_test_udf, StringType())
df = df.withColumn('test_field', test_udf(col('std_useragent')))
df.write.parquet('/output.parquet')
{code}

The following config is used in {{spark-defaults.conf}} (using 
{{maximizeResourceAllocation}} in EMR)
{code}
...
spark.executor.instances 4
spark.executor.cores 8
spark.driver.memory  8G
spark.executor.memory9658M
spark.default.parallelism64
spark.driver.maxResultSize   3G
...
{code}
The cluster has 4 worker nodes (+1 master) with the following specs: 8 vCPU, 15 
GiB memory, 160 SSD GB storage

The above example fails every single time with errors like the following:
{code}
17/09/06 09:58:08 WARN TaskSetManager: Lost task 26.1 in stage 1.0 (TID 50, 
ip-172-31-7-125.eu-west-1.compute.internal, executor 10): ExecutorLostFailure 
(executor 10 exited caused by one of the running tasks) Reason: Container 
killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory 
used. Consider boosting spark.yarn.executor.memoryOverhead.
{code}

I tried to increase the  {{spark.yarn.executor.memoryOverhead}} to 3000 which 
delays the errors but eventually I get them before the end of the job. The job 
eventually fails.

!Screen Shot 2017-09-06 at 11.31.31.png|width=800!

If I run the above job in scala everything works as expected (without having to 
adjust the memoryOverhead)

{code}
import org.apache.spark.sql.functions.udf

val upper: String => String = _.toUpperCase
val df = spark.read.load("s3://some/parquet/dir/myfile.parquet")
val upperUDF = udf(upper)
val newdf = df.withColumn("test_field", upperUDF(col("std_useragent")))
newdf.write.parquet("/output.parquet")
{code}

!Screen Shot 2017-09-06 at 11.31.13.png|width=800!

Cpu utilisation is very bad with pyspark

!cpu.png|width=800!

Is this a known bug with pyspark and udfs or is it a matter of bad 
configuration? 
Looking forward to suggestions. Thanks!



  was:
Hi,

I'm using spark 2.1.0 on AWS EMR (Yarn) and trying to use a UDF in python as 
follows:
{code}
from pyspark.sql.functions import col, udf
from pyspark.sql.types import StringType

path = 's3://some/parquet/dir/myfile.parquet'
df = spark.read.load(path)
def _test_udf(useragent):
return useragent.upper()

test_udf = udf(_test_udf, StringType())
df = df.withColumn('test_field', test_udf(col('std_useragent')))
df.write.parquet('/output.parquet')
{code}

The following config is used in {{spark-defaults.conf}} (using 
{{maximizeResourceAllocation}} in EMR)
{code}
...
spark.executor.instances 4
spark.executor.cores 8
spark.driver.memory  8G
spark.executor.memory9658M
spark.default.parallelism64
spark.driver.maxResultSize   3G
...
{code}
The cluster has 4 worker nodes (+1 master) with the following specs: 8 vCPU, 15 
GiB memory, 160 SSD GB storage

The above example fails every single time with errors like the following:
{code}
17/09/06 09:58:08 WARN TaskSetManager: Lost task 26.1 in stage 1.0 (TID 50, 
ip-172-31-7-125.eu-west-1.compute.internal, executor 10): ExecutorLostFailure 
(executor 10 exited caused by one of the running tasks) Reason: Container 
killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory 
used. Consider boosting spark.yarn.executor.memoryOverhead.
{code}

I tried to increase the  {{spark.yarn.executor.memoryOverhead}} to 3000 which 
delays the errors but eventually I get them before the end of the job. The job 
eventually fails.

!Screen Shot 2017-09-06 at 11.31.31.png|width=800!

If I run the above job in scala everything works as expected (without having to 
adjust the memoryOverhead)

{code}
import org.apache.spark.sql.functions.udf

val upper: String => String = _.toUpperCase
val df = spark.read.load("s3://some/parquet/dir/myfile.parquet")
val upperUDF = udf(upper)
val newdf = df.withColumn("test_field", upperUDF(col("std_useragent")))
newdf.write.parquet("/output.parquet")
{code}

!Screen Shot 2017-09-06 at 11.31.13.png|width=800!

Cpu utilisation is very bad with pyspark

!Screen Shot 2017-09-06 at 11.30.28.png|width=800!

Is this a known bug with pyspark and udfs or is it a matter of bad 
configuration? 
Looking forward to suggestions. Thanks!




> Pyspark UDF causing ExecutorLostFailure 
> 
>
> Key: SPARK-21935
> URL: https://issues.apache.org/jira/browse/SPARK-21935
> 

[jira] [Updated] (SPARK-21935) Pyspark UDF causing ExecutorLostFailure

2017-09-06 Thread Nikolaos Tsipas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikolaos Tsipas updated SPARK-21935:

Attachment: cpu.png

> Pyspark UDF causing ExecutorLostFailure 
> 
>
> Key: SPARK-21935
> URL: https://issues.apache.org/jira/browse/SPARK-21935
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Nikolaos Tsipas
>  Labels: pyspark, udf
> Attachments: cpu.png, Screen Shot 2017-09-06 at 11.30.28.png, Screen 
> Shot 2017-09-06 at 11.31.13.png, Screen Shot 2017-09-06 at 11.31.31.png
>
>
> Hi,
> I'm using spark 2.1.0 on AWS EMR (Yarn) and trying to use a UDF in python as 
> follows:
> {code}
> from pyspark.sql.functions import col, udf
> from pyspark.sql.types import StringType
> path = 's3://some/parquet/dir/myfile.parquet'
> df = spark.read.load(path)
> def _test_udf(useragent):
> return useragent.upper()
> test_udf = udf(_test_udf, StringType())
> df = df.withColumn('test_field', test_udf(col('std_useragent')))
> df.write.parquet('/output.parquet')
> {code}
> The following config is used in {{spark-defaults.conf}} (using 
> {{maximizeResourceAllocation}} in EMR)
> {code}
> ...
> spark.executor.instances 4
> spark.executor.cores 8
> spark.driver.memory  8G
> spark.executor.memory9658M
> spark.default.parallelism64
> spark.driver.maxResultSize   3G
> ...
> {code}
> The cluster has 4 worker nodes (+1 master) with the following specs: 8 vCPU, 
> 15 GiB memory, 160 SSD GB storage
> The above example fails every single time with errors like the following:
> {code}
> 17/09/06 09:58:08 WARN TaskSetManager: Lost task 26.1 in stage 1.0 (TID 50, 
> ip-172-31-7-125.eu-west-1.compute.internal, executor 10): ExecutorLostFailure 
> (executor 10 exited caused by one of the running tasks) Reason: Container 
> killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical 
> memory used. Consider boosting spark.yarn.executor.memoryOverhead.
> {code}
> I tried to increase the  {{spark.yarn.executor.memoryOverhead}} to 3000 which 
> delays the errors but eventually I get them before the end of the job. The 
> job eventually fails.
> !Screen Shot 2017-09-06 at 11.31.31.png|width=800!
> If I run the above job in scala everything works as expected (without having 
> to adjust the memoryOverhead)
> {code}
> import org.apache.spark.sql.functions.udf
> val upper: String => String = _.toUpperCase
> val df = spark.read.load("s3://some/parquet/dir/myfile.parquet")
> val upperUDF = udf(upper)
> val newdf = df.withColumn("test_field", upperUDF(col("std_useragent")))
> newdf.write.parquet("/output.parquet")
> {code}
> !Screen Shot 2017-09-06 at 11.31.13.png|width=800!
> Cpu utilisation is very bad with pyspark
> !Screen Shot 2017-09-06 at 11.30.28.png|width=800!
> Is this a known bug with pyspark and udfs or is it a matter of bad 
> configuration? 
> Looking forward to suggestions. Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21935) Pyspark UDF causing ExecutorLostFailure

2017-09-06 Thread Nikolaos Tsipas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikolaos Tsipas updated SPARK-21935:

Description: 
Hi,

I'm using spark 2.1.0 on AWS EMR (Yarn) and trying to use a UDF in python as 
follows:
{code}
from pyspark.sql.functions import col, udf
from pyspark.sql.types import StringType

path = 's3://some/parquet/dir/myfile.parquet'
df = spark.read.load(path)
def _test_udf(useragent):
return useragent.upper()

test_udf = udf(_test_udf, StringType())
df = df.withColumn('test_field', test_udf(col('std_useragent')))
df.write.parquet('/output.parquet')
{code}

The following config is used in {{spark-defaults.conf}} (using 
{{maximizeResourceAllocation}} in EMR)
{code}
...
spark.executor.instances 4
spark.executor.cores 8
spark.driver.memory  8G
spark.executor.memory9658M
spark.default.parallelism64
spark.driver.maxResultSize   3G
...
{code}
The cluster has 4 worker nodes (+1 master) with the following specs: 8 vCPU, 15 
GiB memory, 160 SSD GB storage

The above example fails every single time with errors like the following:
{code}
17/09/06 09:58:08 WARN TaskSetManager: Lost task 26.1 in stage 1.0 (TID 50, 
ip-172-31-7-125.eu-west-1.compute.internal, executor 10): ExecutorLostFailure 
(executor 10 exited caused by one of the running tasks) Reason: Container 
killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory 
used. Consider boosting spark.yarn.executor.memoryOverhead.
{code}

I tried to increase the  {{spark.yarn.executor.memoryOverhead}} to 3000 which 
delays the errors but eventually I get them before the end of the job. The job 
eventually fails.

!Screen Shot 2017-09-06 at 11.31.31.png|width=800!

If I run the above job in scala everything works as expected (without having to 
adjust the memoryOverhead)

{code}
import org.apache.spark.sql.functions.udf

val upper: String => String = _.toUpperCase
val df = spark.read.load("s3://some/parquet/dir/myfile.parquet")
val upperUDF = udf(upper)
val newdf = df.withColumn("test_field", upperUDF(col("std_useragent")))
newdf.write.parquet("/output.parquet")
{code}

!Screen Shot 2017-09-06 at 11.31.13.png|width=800!

Cpu utilisation is very bad with pyspark

!Screen Shot 2017-09-06 at 11.30.28.png|width=800!

Is this a known bug with pyspark and udfs or is it a matter of bad 
configuration? 
Looking forward to suggestions. Thanks!



  was:
Hi,

I'm using spark 2.1.0 on AWS EMR and trying to use a UDF in python as follows:
{code}
from pyspark.sql.functions import col, udf
from pyspark.sql.types import StringType

path = 's3://some/parquet/dir/myfile.parquet'
df = spark.read.load(path)
def _test_udf(useragent):
return useragent.upper()

test_udf = udf(_test_udf, StringType())
df = df.withColumn('test_field', test_udf(col('std_useragent')))
df.write.parquet('/output.parquet')
{code}

The following config is used in {{spark-defaults.conf}} (using 
{{maximizeResourceAllocation}} in EMR)
{code}
...
spark.executor.instances 4
spark.executor.cores 8
spark.driver.memory  8G
spark.executor.memory9658M
spark.default.parallelism64
spark.driver.maxResultSize   3G
...
{code}
The cluster has 4 worker nodes (+1 master) with the following specs: 8 vCPU, 15 
GiB memory, 160 SSD GB storage

The above example fails every single time with errors like the following:
{code}
17/09/06 09:58:08 WARN TaskSetManager: Lost task 26.1 in stage 1.0 (TID 50, 
ip-172-31-7-125.eu-west-1.compute.internal, executor 10): ExecutorLostFailure 
(executor 10 exited caused by one of the running tasks) Reason: Container 
killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory 
used. Consider boosting spark.yarn.executor.memoryOverhead.
{code}

I tried to increase the  {{spark.yarn.executor.memoryOverhead}} to 3000 which 
delays the errors but eventually I get them before the end of the job. The job 
eventually fails.

!Screen Shot 2017-09-06 at 11.31.31.png|width=800!

If I run the above job in scala everything works as expected (without having to 
adjust the memoryOverhead)

{code}
import org.apache.spark.sql.functions.udf

val upper: String => String = _.toUpperCase
val df = spark.read.load("s3://some/parquet/dir/myfile.parquet")
val upperUDF = udf(upper)
val newdf = df.withColumn("test_field", upperUDF(col("std_useragent")))
newdf.write.parquet("/output.parquet")
{code}

!Screen Shot 2017-09-06 at 11.31.13.png|width=800!

Cpu utilisation is very bad with pyspark

!Screen Shot 2017-09-06 at 11.30.28.png|width=800!

Is this a known bug with pyspark and udfs or is it a matter of bad 
configuration? 
Looking forward to suggestions. Thanks!




> Pyspark UDF causing ExecutorLostFailure 
> 
>
> Key: SPARK-21935
> URL: 

[jira] [Updated] (SPARK-21935) Pyspark UDF causing ExecutorLostFailure

2017-09-06 Thread Nikolaos Tsipas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikolaos Tsipas updated SPARK-21935:

Description: 
Hi,

I'm using spark 2.1.0 on AWS EMR and trying to use a UDF in python as follows:
{code}
from pyspark.sql.functions import col, udf
from pyspark.sql.types import StringType

path = 's3://some/parquet/dir/myfile.parquet'
df = spark.read.load(path)
def _test_udf(useragent):
return useragent.upper()

test_udf = udf(_test_udf, StringType())
df = df.withColumn('test_field', test_udf(col('std_useragent')))
df.write.parquet('/output.parquet')
{code}

The following config is used in {{spark-defaults.conf}} (using 
{{maximizeResourceAllocation}} in EMR)
{code}
...
spark.executor.instances 4
spark.executor.cores 8
spark.driver.memory  8G
spark.executor.memory9658M
spark.default.parallelism64
spark.driver.maxResultSize   3G
...
{code}
The cluster has 4 worker nodes (+1 master) with the following specs: 8 vCPU, 15 
GiB memory, 160 SSD GB storage

The above example fails every single time with errors like the following:
{code}
17/09/06 09:58:08 WARN TaskSetManager: Lost task 26.1 in stage 1.0 (TID 50, 
ip-172-31-7-125.eu-west-1.compute.internal, executor 10): ExecutorLostFailure 
(executor 10 exited caused by one of the running tasks) Reason: Container 
killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory 
used. Consider boosting spark.yarn.executor.memoryOverhead.
{code}

I tried to increase the  {{spark.yarn.executor.memoryOverhead}} to 3000 which 
delays the errors but eventually I get them before the end of the job. The job 
eventually fails.

!Screen Shot 2017-09-06 at 11.31.31.png|width=800!

If I run the above job in scala everything works as expected (without having to 
adjust the memoryOverhead)

{code}
import org.apache.spark.sql.functions.udf

val upper: String => String = _.toUpperCase
val df = spark.read.load("s3://some/parquet/dir/myfile.parquet")
val upperUDF = udf(upper)
val newdf = df.withColumn("test_field", upperUDF(col("std_useragent")))
newdf.write.parquet("/output.parquet")
{code}

!Screen Shot 2017-09-06 at 11.31.13.png|width=800!

Cpu utilisation is very bad with pyspark

!Screen Shot 2017-09-06 at 11.30.28.png|width=800!

Is this a known bug with pyspark and udfs or is it a matter of bad 
configuration? 
Looking forward to suggestions. Thanks!



  was:
Hi,

I'm using spark 2.1.0 on AWS EMR and trying to use a UDF in python as follows:
{code}
from pyspark.sql.functions import col, udf
from pyspark.sql.types import StringType

path = 's3://some/parquet/dir/myfile.parquet'
df = spark.read.load(path)
def _test_udf(useragent):
return useragent.upper()

test_udf = udf(_test_udf, StringType())
df = df.withColumn('test_field', test_udf(col('std_useragent')))
df.write.parquet('/output.parquet')
{code}

The following config is used in {{spark-defaults.conf}} (using 
{{maximizeResourceAllocation}} in EMR)
{code}
...
spark.executor.instances 4
spark.executor.cores 8
spark.driver.memory  8G
spark.executor.memory9658M
spark.default.parallelism64
spark.driver.maxResultSize   3G
...
{code}
The cluster has 4 worker nodes (+1 master) with the following specs: 8 vCPU, 15 
GiB memory, 160 SSD GB storage

The above example fails every single time with errors like the following:
{code}
17/09/06 09:58:08 WARN TaskSetManager: Lost task 26.1 in stage 1.0 (TID 50, 
ip-172-31-7-125.eu-west-1.compute.internal, executor 10): ExecutorLostFailure 
(executor 10 exited caused by one of the running tasks) Reason: Container 
killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory 
used. Consider boosting spark.yarn.executor.memoryOverhead.
{code}

I tried to increase the  {{spark.yarn.executor.memoryOverhead}} to 3000 which 
delays the errors but eventually I get them before the end of the job. The job 
eventually fails.



If I run the above job in scala everything works as expected (without having to 
adjust the memoryOverhead)

{code}
import org.apache.spark.sql.functions.udf

val upper: String => String = _.toUpperCase
val df = spark.read.load("s3://some/parquet/dir/myfile.parquet")
val upperUDF = udf(upper)
val newdf = df.withColumn("test_field", upperUDF(col("std_useragent")))
newdf.write.parquet("/output.parquet")
{code}



Cpu utilisation is very bad with pyspark



Is this a known bug with pyspark and udfs or is it a matter of bad 
configuration? 
Looking forward to suggestions. Thanks!




> Pyspark UDF causing ExecutorLostFailure 
> 
>
> Key: SPARK-21935
> URL: https://issues.apache.org/jira/browse/SPARK-21935
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Nikolaos 

[jira] [Updated] (SPARK-21935) Pyspark UDF causing ExecutorLostFailure

2017-09-06 Thread Nikolaos Tsipas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikolaos Tsipas updated SPARK-21935:

Attachment: Screen Shot 2017-09-06 at 11.31.31.png
Screen Shot 2017-09-06 at 11.31.13.png
Screen Shot 2017-09-06 at 11.30.28.png

> Pyspark UDF causing ExecutorLostFailure 
> 
>
> Key: SPARK-21935
> URL: https://issues.apache.org/jira/browse/SPARK-21935
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Nikolaos Tsipas
>  Labels: pyspark, udf
> Attachments: Screen Shot 2017-09-06 at 11.30.28.png, Screen Shot 
> 2017-09-06 at 11.31.13.png, Screen Shot 2017-09-06 at 11.31.31.png
>
>
> Hi,
> I'm using spark 2.1.0 on AWS EMR and trying to use a UDF in python as follows:
> {code}
> from pyspark.sql.functions import col, udf
> from pyspark.sql.types import StringType
> path = 's3://some/parquet/dir/myfile.parquet'
> df = spark.read.load(path)
> def _test_udf(useragent):
> return useragent.upper()
> test_udf = udf(_test_udf, StringType())
> df = df.withColumn('test_field', test_udf(col('std_useragent')))
> df.write.parquet('/output.parquet')
> {code}
> The following config is used in {{spark-defaults.conf}} (using 
> {{maximizeResourceAllocation}} in EMR)
> {code}
> ...
> spark.executor.instances 4
> spark.executor.cores 8
> spark.driver.memory  8G
> spark.executor.memory9658M
> spark.default.parallelism64
> spark.driver.maxResultSize   3G
> ...
> {code}
> The cluster has 4 worker nodes (+1 master) with the following specs: 8 vCPU, 
> 15 GiB memory, 160 SSD GB storage
> The above example fails every single time with errors like the following:
> {code}
> 17/09/06 09:58:08 WARN TaskSetManager: Lost task 26.1 in stage 1.0 (TID 50, 
> ip-172-31-7-125.eu-west-1.compute.internal, executor 10): ExecutorLostFailure 
> (executor 10 exited caused by one of the running tasks) Reason: Container 
> killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical 
> memory used. Consider boosting spark.yarn.executor.memoryOverhead.
> {code}
> I tried to increase the  {{spark.yarn.executor.memoryOverhead}} to 3000 which 
> delays the errors but eventually I get them before the end of the job. The 
> job eventually fails.
> 
> If I run the above job in scala everything works as expected (without having 
> to adjust the memoryOverhead)
> {code}
> import org.apache.spark.sql.functions.udf
> val upper: String => String = _.toUpperCase
> val df = spark.read.load("s3://some/parquet/dir/myfile.parquet")
> val upperUDF = udf(upper)
> val newdf = df.withColumn("test_field", upperUDF(col("std_useragent")))
> newdf.write.parquet("/output.parquet")
> {code}
> 
> Cpu utilisation is very bad with pyspark
> 
> Is this a known bug with pyspark and udfs or is it a matter of bad 
> configuration? 
> Looking forward to suggestions. Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21935) Pyspark UDF causing ExecutorLostFailure

2017-09-06 Thread Nikolaos Tsipas (JIRA)
Nikolaos Tsipas created SPARK-21935:
---

 Summary: Pyspark UDF causing ExecutorLostFailure 
 Key: SPARK-21935
 URL: https://issues.apache.org/jira/browse/SPARK-21935
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.1.0
Reporter: Nikolaos Tsipas


Hi,

I'm using spark 2.1.0 on AWS EMR and trying to use a UDF in python as follows:
{code}
from pyspark.sql.functions import col, udf
from pyspark.sql.types import StringType

path = 's3://some/parquet/dir/myfile.parquet'
df = spark.read.load(path)
def _test_udf(useragent):
return useragent.upper()

test_udf = udf(_test_udf, StringType())
df = df.withColumn('test_field', test_udf(col('std_useragent')))
df.write.parquet('/output.parquet')
{code}

The following config is used in {{spark-defaults.conf}} (using 
{{maximizeResourceAllocation}} in EMR)
{code}
...
spark.executor.instances 4
spark.executor.cores 8
spark.driver.memory  8G
spark.executor.memory9658M
spark.default.parallelism64
spark.driver.maxResultSize   3G
...
{code}
The cluster has 4 worker nodes (+1 master) with the following specs: 8 vCPU, 15 
GiB memory, 160 SSD GB storage

The above example fails every single time with errors like the following:
{code}
17/09/06 09:58:08 WARN TaskSetManager: Lost task 26.1 in stage 1.0 (TID 50, 
ip-172-31-7-125.eu-west-1.compute.internal, executor 10): ExecutorLostFailure 
(executor 10 exited caused by one of the running tasks) Reason: Container 
killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory 
used. Consider boosting spark.yarn.executor.memoryOverhead.
{code}

I tried to increase the  {{spark.yarn.executor.memoryOverhead}} to 3000 which 
delays the errors but eventually I get them before the end of the job. The job 
eventually fails.



If I run the above job in scala everything works as expected (without having to 
adjust the memoryOverhead)

{code}
import org.apache.spark.sql.functions.udf

val upper: String => String = _.toUpperCase
val df = spark.read.load("s3://some/parquet/dir/myfile.parquet")
val upperUDF = udf(upper)
val newdf = df.withColumn("test_field", upperUDF(col("std_useragent")))
newdf.write.parquet("/output.parquet")
{code}



Cpu utilisation is very bad with pyspark



Is this a known bug with pyspark and udfs or is it a matter of bad 
configuration? 
Looking forward to suggestions. Thanks!





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21858) Make Spark grouping_id() compatible with Hive grouping__id

2017-09-06 Thread Yann Byron (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16155174#comment-16155174
 ] 

Yann Byron commented on SPARK-21858:


[~dongjoon]
Thank you for your reply.

SPARK-21055 just makes  `grouping__id` usable in Spark.
But the behavior is different from hive, i.e. the grouping ids generated in 
Spark and Hive are different.

I'm not very sure about HIVE-12833. Can we make it same with Hive, though there 
is a bug in Hive, as mentioned in HIVE-12833.


> Make Spark grouping_id() compatible with Hive grouping__id
> --
>
> Key: SPARK-21858
> URL: https://issues.apache.org/jira/browse/SPARK-21858
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Yann Byron
>
> If you want to migrate some ETLs using `grouping__id` in Hive to Spark and 
> use Spark `grouping_id()` instead of Hive `grouping__id`, you will find 
> difference between their evaluations.
> Here is an example.
> {code:java}
> select A, B, grouping__id/grouping_id() from t group by A, B grouping 
> sets((), (A), (B), (A,B))
> {code}
> Running it on Hive and Spark separately, you'll find this: (the selected 
> attribute in selected grouping set is represented by (/) and  otherwise by 
> (x))
> ||A B||Binary Expression in Spark||Spark||Hive||Binary Expression in Hive||B 
> A||
> |(x) (x)|11|3|0|00|(x) (x)|
> |(x) (/)|10|2|2|10|(/) (x)|
> |(/) (x)|01|1|1|01|(x) (/)|
> |(/) (/)|00|0|3|11|(/) (/)|
> As shown above,In Hive, (/) set to 0, (x) set to 1, and in Spark it's 
> opposite.
> Moreover, attributes in `group by` will reverse firstly in Hive. In Spark 
> it'll be evaluated directly.
> In my opinion, I suggest that modifying the behavior of `grouping_id()` make 
> it compatible with Hive `grouping__id`.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21903) Upgrade scalastyle to 1.0.0

2017-09-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16155173#comment-16155173
 ] 

Apache Spark commented on SPARK-21903:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/19146

> Upgrade scalastyle to 1.0.0
> ---
>
> Key: SPARK-21903
> URL: https://issues.apache.org/jira/browse/SPARK-21903
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Trivial
> Fix For: 2.3.0
>
>
> 1.0.0 fixes an issue with import order, explicit type for public methods, 
> line length limitation and comment validation:
> {code}
> [error] 
> .../spark/repl/scala-2.11/src/main/scala/org/apache/spark/repl/Main.scala:50:16:
>  Are you sure you want to println? If yes, wrap the code block with
> [error]   // scalastyle:off println
> [error]   println(...)
> [error]   // scalastyle:on println
> [error] 
> .../spark/repl/scala-2.11/src/main/scala/org/apache/spark/repl/SparkILoop.scala:49:
>  File line length exceeds 100 characters
> [error] 
> .../spark/repl/scala-2.11/src/main/scala/org/apache/spark/repl/SparkILoop.scala:22:21:
>  Are you sure you want to println? If yes, wrap the code block with
> [error]   // scalastyle:off println
> [error]   println(...)
> [error]   // scalastyle:on println
> [error] 
> .../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:35:6:
>  Public method must have explicit type
> [error] 
> .../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:51:6:
>  Public method must have explicit type
> [error] 
> .../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:93:15:
>  Public method must have explicit type
> [error] 
> .../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:98:15:
>  Public method must have explicit type
> [error] 
> .../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:47:2:
>  Insert a space after the start of the comment
> [error] 
> .../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:26:43:
>  JavaDStream should come before JavaDStreamLike.
> {code}
> I am also interested in {{org.scalastyle.scalariform.OverrideJavaChecker}} 
> feature, which we had to work around to check this SPARK-16877, which was 
> added from 0.9.0.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21918) HiveClient shouldn't share Hive object between different thread

2017-09-06 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16155162#comment-16155162
 ] 

Marco Gaido commented on SPARK-21918:
-

Yes, I think this would be great, thanks.

> HiveClient shouldn't share Hive object between different thread
> ---
>
> Key: SPARK-21918
> URL: https://issues.apache.org/jira/browse/SPARK-21918
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hu Liu,
>
> I'm testing the spark thrift server and found that all the DDL statements are 
> run by user hive even if hive.server2.enable.doAs=true
> The root cause is that Hive object is shared between different thread in 
> HiveClientImpl
> {code:java}
>   private def client: Hive = {
> if (clientLoader.cachedHive != null) {
>   clientLoader.cachedHive.asInstanceOf[Hive]
> } else {
>   val c = Hive.get(conf)
>   clientLoader.cachedHive = c
>   c
> }
>   }
> {code}
> But in impersonation mode, we should just share the Hive object inside the 
> thread so that the  metastore client in Hive could be associated with right 
> user.
> we can  pass the Hive object of parent thread to child thread when running 
> the sql to fix it
> I have already had a initial patch for review and I'm glad to work on it if 
> anyone could assign it to me.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21932) Add test CartesianProduct join case and BroadcastNestedLoop join case in JoinBenchmark

2017-09-06 Thread caoxuewen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caoxuewen updated SPARK-21932:
--
Priority: Minor  (was: Major)

> Add test CartesianProduct join case and BroadcastNestedLoop join case in 
> JoinBenchmark
> --
>
> Key: SPARK-21932
> URL: https://issues.apache.org/jira/browse/SPARK-21932
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 2.3.0
>Reporter: caoxuewen
>Priority: Minor
>
> this issue to fix two problems:
>   1. add new two test case. test CartesianProduct join and 
> BroadcastNestedLoop join. We understand the effect of CodeGen on them. and 
> through these two test cases, we can know Per Row (NS), is much more delayed 
> than our previous hash join.
>   2.similar 'logical.Join', is not right, to be consistent in 
> SparkStrategies.scala. fix to remove package name.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5159) Thrift server does not respect hive.server2.enable.doAs=true

2017-09-06 Thread Hu Liu, (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16155148#comment-16155148
 ] 

Hu Liu, commented on SPARK-5159:


I have patch for DDL operation: 
https://issues.apache.org/jira/browse/SPARK-21918 and I can merge it together 
if necessary

> Thrift server does not respect hive.server2.enable.doAs=true
> 
>
> Key: SPARK-5159
> URL: https://issues.apache.org/jira/browse/SPARK-5159
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Andrew Ray
> Attachments: spark_thrift_server_log.txt
>
>
> I'm currently testing the spark sql thrift server on a kerberos secured 
> cluster in YARN mode. Currently any user can access any table regardless of 
> HDFS permissions as all data is read as the hive user. In HiveServer2 the 
> property hive.server2.enable.doAs=true causes all access to be done as the 
> submitting user. We should do the same.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21918) HiveClient shouldn't share Hive object between different thread

2017-09-06 Thread Hu Liu, (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16155120#comment-16155120
 ] 

Hu Liu, edited comment on SPARK-21918 at 9/6/17 10:03 AM:
--

[~mgaido]
I just simply tested the command.
I connected to STS via beeline by using session user different from the user 
who started STS.After running the create table command, I check the owner of 
hdfs path which is the session user.
When I tried to drop table owned by  user who started STS and got permission 
denied exception.

For DML, their is opened issue: 
https://issues.apache.org/jira/browse/SPARK-5159.
I can fix those issue together if necessary


was (Author: huliu):
[~mgaido]
I just simply tested the command.
I connected to STS via beeline by using session user different from the user 
who started STS.After running the create table command, I check the owner of 
hdfs path which is the session user.
When I tried to drop table owned by  user who started STS and got permission 
denied exception.

For DML, their is opened issue: [link 
https://issues.apache.org/jira/browse/SPARK-5159].
I can fix those issue together if necessary

> HiveClient shouldn't share Hive object between different thread
> ---
>
> Key: SPARK-21918
> URL: https://issues.apache.org/jira/browse/SPARK-21918
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hu Liu,
>
> I'm testing the spark thrift server and found that all the DDL statements are 
> run by user hive even if hive.server2.enable.doAs=true
> The root cause is that Hive object is shared between different thread in 
> HiveClientImpl
> {code:java}
>   private def client: Hive = {
> if (clientLoader.cachedHive != null) {
>   clientLoader.cachedHive.asInstanceOf[Hive]
> } else {
>   val c = Hive.get(conf)
>   clientLoader.cachedHive = c
>   c
> }
>   }
> {code}
> But in impersonation mode, we should just share the Hive object inside the 
> thread so that the  metastore client in Hive could be associated with right 
> user.
> we can  pass the Hive object of parent thread to child thread when running 
> the sql to fix it
> I have already had a initial patch for review and I'm glad to work on it if 
> anyone could assign it to me.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21918) HiveClient shouldn't share Hive object between different thread

2017-09-06 Thread Hu Liu, (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16155120#comment-16155120
 ] 

Hu Liu, commented on SPARK-21918:
-

[~mgaido]
I just simply tested the command.
I connected to STS via beeline by using session user different from the user 
who started STS.After running the create table command, I check the owner of 
hdfs path which is the session user.
When I tried to drop table owned by  user who started STS and got permission 
denied exception.

For DML, their is opened issue: [link 
https://issues.apache.org/jira/browse/SPARK-5159].
I can fix those issue together if necessary

> HiveClient shouldn't share Hive object between different thread
> ---
>
> Key: SPARK-21918
> URL: https://issues.apache.org/jira/browse/SPARK-21918
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hu Liu,
>
> I'm testing the spark thrift server and found that all the DDL statements are 
> run by user hive even if hive.server2.enable.doAs=true
> The root cause is that Hive object is shared between different thread in 
> HiveClientImpl
> {code:java}
>   private def client: Hive = {
> if (clientLoader.cachedHive != null) {
>   clientLoader.cachedHive.asInstanceOf[Hive]
> } else {
>   val c = Hive.get(conf)
>   clientLoader.cachedHive = c
>   c
> }
>   }
> {code}
> But in impersonation mode, we should just share the Hive object inside the 
> thread so that the  metastore client in Hive could be associated with right 
> user.
> we can  pass the Hive object of parent thread to child thread when running 
> the sql to fix it
> I have already had a initial patch for review and I'm glad to work on it if 
> anyone could assign it to me.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21932) Add test CartesianProduct join case and BroadcastNestedLoop join case in JoinBenchmark

2017-09-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21932:


Assignee: Apache Spark

> Add test CartesianProduct join case and BroadcastNestedLoop join case in 
> JoinBenchmark
> --
>
> Key: SPARK-21932
> URL: https://issues.apache.org/jira/browse/SPARK-21932
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 2.3.0
>Reporter: caoxuewen
>Assignee: Apache Spark
>
> this issue to fix two problems:
>   1. add new two test case. test CartesianProduct join and 
> BroadcastNestedLoop join. We understand the effect of CodeGen on them. and 
> through these two test cases, we can know Per Row (NS), is much more delayed 
> than our previous hash join.
>   2.similar 'logical.Join', is not right, to be consistent in 
> SparkStrategies.scala. fix to remove package name.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21932) Add test CartesianProduct join case and BroadcastNestedLoop join case in JoinBenchmark

2017-09-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21932:


Assignee: (was: Apache Spark)

> Add test CartesianProduct join case and BroadcastNestedLoop join case in 
> JoinBenchmark
> --
>
> Key: SPARK-21932
> URL: https://issues.apache.org/jira/browse/SPARK-21932
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 2.3.0
>Reporter: caoxuewen
>
> this issue to fix two problems:
>   1. add new two test case. test CartesianProduct join and 
> BroadcastNestedLoop join. We understand the effect of CodeGen on them. and 
> through these two test cases, we can know Per Row (NS), is much more delayed 
> than our previous hash join.
>   2.similar 'logical.Join', is not right, to be consistent in 
> SparkStrategies.scala. fix to remove package name.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21932) Add test CartesianProduct join case and BroadcastNestedLoop join case in JoinBenchmark

2017-09-06 Thread caoxuewen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caoxuewen updated SPARK-21932:
--
Description: 
this issue to fix two problems:
  1. add new two test case. test CartesianProduct join and BroadcastNestedLoop 
join. We understand the effect of CodeGen on them. and through these two test 
cases, we can know Per Row (NS), is much more delayed than our previous hash 
join.
  2.similar 'logical.Join', is not right, to be consistent in 
SparkStrategies.scala. fix to remove package name.

  was:
this issue to fix two problems:
  1. similar 'logical.Join', is not right, to be consistent in 
SparkStrategies.scala. fix to remove package name.
  2. add two test case. test CartesianProduct join and BroadcastNestedLoop 
join. We understand the effect of CodeGen on them.


> Add test CartesianProduct join case and BroadcastNestedLoop join case in 
> JoinBenchmark
> --
>
> Key: SPARK-21932
> URL: https://issues.apache.org/jira/browse/SPARK-21932
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 2.3.0
>Reporter: caoxuewen
>
> this issue to fix two problems:
>   1. add new two test case. test CartesianProduct join and 
> BroadcastNestedLoop join. We understand the effect of CodeGen on them. and 
> through these two test cases, we can know Per Row (NS), is much more delayed 
> than our previous hash join.
>   2.similar 'logical.Join', is not right, to be consistent in 
> SparkStrategies.scala. fix to remove package name.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-21932) Add test CartesianProduct join case and BroadcastNestedLoop join case in JoinBenchmark

2017-09-06 Thread caoxuewen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caoxuewen reopened SPARK-21932:
---

> Add test CartesianProduct join case and BroadcastNestedLoop join case in 
> JoinBenchmark
> --
>
> Key: SPARK-21932
> URL: https://issues.apache.org/jira/browse/SPARK-21932
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 2.3.0
>Reporter: caoxuewen
>
> this issue to fix two problems:
>   1. add new two test case. test CartesianProduct join and 
> BroadcastNestedLoop join. We understand the effect of CodeGen on them. and 
> through these two test cases, we can know Per Row (NS), is much more delayed 
> than our previous hash join.
>   2.similar 'logical.Join', is not right, to be consistent in 
> SparkStrategies.scala. fix to remove package name.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21932) Add test CartesianProduct join case and BroadcastNestedLoop join case in JoinBenchmark

2017-09-06 Thread caoxuewen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caoxuewen updated SPARK-21932:
--
Summary: Add test CartesianProduct join case and BroadcastNestedLoop join 
case in JoinBenchmark  (was: remove package name similar 'logical.Join' to 
'Join' in JoinSelection)

> Add test CartesianProduct join case and BroadcastNestedLoop join case in 
> JoinBenchmark
> --
>
> Key: SPARK-21932
> URL: https://issues.apache.org/jira/browse/SPARK-21932
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 2.3.0
>Reporter: caoxuewen
>
> this issue to fix two problems:
>   1. similar 'logical.Join', is not right, to be consistent in 
> SparkStrategies.scala. fix to remove package name.
>   2. add two test case. test CartesianProduct join and BroadcastNestedLoop 
> join. We understand the effect of CodeGen on them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21932) remove package name similar 'logical.Join' to 'Join' in JoinSelection

2017-09-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21932.
---
Resolution: Not A Problem

> remove package name similar 'logical.Join' to 'Join' in JoinSelection
> -
>
> Key: SPARK-21932
> URL: https://issues.apache.org/jira/browse/SPARK-21932
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 2.3.0
>Reporter: caoxuewen
>
> this issue to fix two problems:
>   1. similar 'logical.Join', is not right, to be consistent in 
> SparkStrategies.scala. fix to remove package name.
>   2. add two test case. test CartesianProduct join and BroadcastNestedLoop 
> join. We understand the effect of CodeGen on them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21918) HiveClient shouldn't share Hive object between different thread

2017-09-06 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16155047#comment-16155047
 ] 

Marco Gaido commented on SPARK-21918:
-

What I meant is that if we want to support doAs, we shouldn't just support it 
for DDL operations, but also for all DML & DQL. Your fix I am pretty sure won't 
affect the DML & DQL behavior, ie. we would support the doAs only for DDL 
operations with your change. This means that there would be a hybrid situation: 
for DDL we'd have doAs working, for DML & DQL no. This is not a desirable 
condition.

PS For my sake of curiosity, may I ask you how you tested that your DDL 
commands were run using the session user?
Thanks.

> HiveClient shouldn't share Hive object between different thread
> ---
>
> Key: SPARK-21918
> URL: https://issues.apache.org/jira/browse/SPARK-21918
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hu Liu,
>
> I'm testing the spark thrift server and found that all the DDL statements are 
> run by user hive even if hive.server2.enable.doAs=true
> The root cause is that Hive object is shared between different thread in 
> HiveClientImpl
> {code:java}
>   private def client: Hive = {
> if (clientLoader.cachedHive != null) {
>   clientLoader.cachedHive.asInstanceOf[Hive]
> } else {
>   val c = Hive.get(conf)
>   clientLoader.cachedHive = c
>   c
> }
>   }
> {code}
> But in impersonation mode, we should just share the Hive object inside the 
> thread so that the  metastore client in Hive could be associated with right 
> user.
> we can  pass the Hive object of parent thread to child thread when running 
> the sql to fix it
> I have already had a initial patch for review and I'm glad to work on it if 
> anyone could assign it to me.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21918) HiveClient shouldn't share Hive object between different thread

2017-09-06 Thread Hu Liu, (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16155022#comment-16155022
 ] 

Hu Liu, edited comment on SPARK-21918 at 9/6/17 8:51 AM:
-

[~mgaido] Yes, all the jobs are executed using the same user. but the problem 
is't in STS.
The STS open session with impersonation when doAS is enabled
{code:java}
if (cliService.getHiveConf().getBoolVar(ConfVars.HIVE_SERVER2_ENABLE_DOAS) &&
(userName != null)) {
  String delegationTokenStr = getDelegationToken(userName);
  sessionHandle = cliService.openSessionWithImpersonation(protocol, 
userName,
  req.getPassword(), ipAddress, req.getConfiguration(), 
delegationTokenStr);
} else {
{code}
And run sql by session ugi in HiveSessionProxy.
For DDL operation, spark sql use Hive object in HiveClientImplement.java to 
communicate with metastore.
Currently the Hive object is shared between different threads that why all jobs 
is executed using same user in HiveClientImpl.java
{code:java}
  private def client: Hive = {
if (clientLoader.cachedHive != null) {
  clientLoader.cachedHive.asInstanceOf[Hive]
} else {
  val c = Hive.get(conf)
  clientLoader.cachedHive = c
  c
}
  }
{code}
Actually Hive object store different instance for different thread and class 
HiveSessionImplwithUGI have already create Hive object for current user session

{code:java}
   // create a new metastore connection for this particular user session
Hive.set(null);
try {
  sessionHive = Hive.get(getHiveConf());
} catch (HiveException e) {
  throw new HiveSQLException("Failed to setup metastore connection", e);
}
{code}

If we could pass the Hive object for current user session to the working 
thread, we can fix this problem
I have already fixed it and could run DDL operation using the session user. 




was (Author: huliu):
[~mgaido] Yes, all the jobs are executed using the same user. but the problem 
is't in STS.
The STS open session with impersonation when doAS is enabled
{code:java}
if (cliService.getHiveConf().getBoolVar(ConfVars.HIVE_SERVER2_ENABLE_DOAS) &&
(userName != null)) {
  String delegationTokenStr = getDelegationToken(userName);
  sessionHandle = cliService.openSessionWithImpersonation(protocol, 
userName,
  req.getPassword(), ipAddress, req.getConfiguration(), 
delegationTokenStr);
} else {
{code}
And run sql by session ugi in HiveSessionProxy.
For DDL operation, spark sql use Hive object in HiveClientImplement.java to 
communicate with metastore.
Currently the Hive object is shared between different threads that why all jobs 
is executed using same user in HiveClientImpl.java
{code:java}
  private def client: Hive = {
if (clientLoader.cachedHive != null) {
  clientLoader.cachedHive.asInstanceOf[Hive]
} else {
  val c = Hive.get(conf)
  clientLoader.cachedHive = c
  c
}
  }
{code}
Actually Hive object store different instance for different thread and class 
HiveSessionImplwithUGI have already create Hive object for current user session

{code:java}
   // create a new metastore connection for this particular user session
Hive.set(null);
try {
  sessionHive = Hive.get(getHiveConf());
} catch (HiveException e) {
  throw new HiveSQLException("Failed to setup metastore connection", e);
}
{code}

If we could pass the Hive object for current user session to the work thread, 
we can fix this problem
I have already fixed it and could run DDL operation using the session user. 



> HiveClient shouldn't share Hive object between different thread
> ---
>
> Key: SPARK-21918
> URL: https://issues.apache.org/jira/browse/SPARK-21918
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hu Liu,
>
> I'm testing the spark thrift server and found that all the DDL statements are 
> run by user hive even if hive.server2.enable.doAs=true
> The root cause is that Hive object is shared between different thread in 
> HiveClientImpl
> {code:java}
>   private def client: Hive = {
> if (clientLoader.cachedHive != null) {
>   clientLoader.cachedHive.asInstanceOf[Hive]
> } else {
>   val c = Hive.get(conf)
>   clientLoader.cachedHive = c
>   c
> }
>   }
> {code}
> But in impersonation mode, we should just share the Hive object inside the 
> thread so that the  metastore client in Hive could be associated with right 
> user.
> we can  pass the Hive object of parent thread to child thread when running 
> the sql to fix it
> I have already had a initial patch for review and I'm glad to work on it if 
> anyone could assign it to me.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (SPARK-21918) HiveClient shouldn't share Hive object between different thread

2017-09-06 Thread Hu Liu, (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16155022#comment-16155022
 ] 

Hu Liu, commented on SPARK-21918:
-

[~mgaido] Yes, all the jobs are executed using the same user. but the problem 
is't in STS.
The STS open session with impersonation when doAS is enabled
{code:java}
if (cliService.getHiveConf().getBoolVar(ConfVars.HIVE_SERVER2_ENABLE_DOAS) &&
(userName != null)) {
  String delegationTokenStr = getDelegationToken(userName);
  sessionHandle = cliService.openSessionWithImpersonation(protocol, 
userName,
  req.getPassword(), ipAddress, req.getConfiguration(), 
delegationTokenStr);
} else {
{code}
And run sql by session ugi in HiveSessionProxy.
For DDL operation, spark sql use Hive object in HiveClientImplement.java to 
communicate with metastore.
Currently the Hive object is shared between different threads that why all jobs 
is executed using same user in HiveClientImpl.java
{code:java}
  private def client: Hive = {
if (clientLoader.cachedHive != null) {
  clientLoader.cachedHive.asInstanceOf[Hive]
} else {
  val c = Hive.get(conf)
  clientLoader.cachedHive = c
  c
}
  }
{code}
Actually Hive object store different instance for different thread and class 
HiveSessionImplwithUGI have already create Hive object for current user session

{code:java}
   // create a new metastore connection for this particular user session
Hive.set(null);
try {
  sessionHive = Hive.get(getHiveConf());
} catch (HiveException e) {
  throw new HiveSQLException("Failed to setup metastore connection", e);
}
{code}

If we could pass the Hive object for current user session to the work thread, 
we can fix this problem
I have already fixed it and could run DDL operation using the session user. 



> HiveClient shouldn't share Hive object between different thread
> ---
>
> Key: SPARK-21918
> URL: https://issues.apache.org/jira/browse/SPARK-21918
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hu Liu,
>
> I'm testing the spark thrift server and found that all the DDL statements are 
> run by user hive even if hive.server2.enable.doAs=true
> The root cause is that Hive object is shared between different thread in 
> HiveClientImpl
> {code:java}
>   private def client: Hive = {
> if (clientLoader.cachedHive != null) {
>   clientLoader.cachedHive.asInstanceOf[Hive]
> } else {
>   val c = Hive.get(conf)
>   clientLoader.cachedHive = c
>   c
> }
>   }
> {code}
> But in impersonation mode, we should just share the Hive object inside the 
> thread so that the  metastore client in Hive could be associated with right 
> user.
> we can  pass the Hive object of parent thread to child thread when running 
> the sql to fix it
> I have already had a initial patch for review and I'm glad to work on it if 
> anyone could assign it to me.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21927) scalastyle 1.0.0 generates SBT warnings

2017-09-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-21927:
--
Priority: Minor  (was: Major)
 Summary: scalastyle 1.0.0 generates SBT warnings  (was: Spark pom.xml's 
dependency management is broken)

Fixing the warning is fine, but I think the resolution is getting the 
dependency version that the build needs.

CC [~hyukjin.kwon] because I think it is related to scalastyle 1.0.0. I think 
we'd generally suppress this warning because Maven doesn't make a big deal out 
of it. Here it's complaining because it's the plugins themselves that might be 
using incompatible libraries. I don't really want to get into managing their 
dependency graph and not sure we can manually resolve this anyway.

> scalastyle 1.0.0 generates SBT warnings
> ---
>
> Key: SPARK-21927
> URL: https://issues.apache.org/jira/browse/SPARK-21927
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
> Environment: Apache Spark current master (commit 
> 12ab7f7e89ec9e102859ab3b710815d3058a2e8d)
>Reporter: Kris Mok
>Priority: Minor
>
> When building the current Spark master just now (commit 
> 12ab7f7e89ec9e102859ab3b710815d3058a2e8d), I noticed the build prints a lot 
> of warning messages such as the following. Looks like the dependency 
> management in the POMs are somehow broken recently.
> {code:none}
> .../workspace/apache-spark/master (master) $ build/sbt clean package
> Attempting to fetch sbt
> Launching sbt from build/sbt-launch-0.13.16.jar
> [info] Loading project definition from 
> .../workspace/apache-spark/master/project
> [info] Updating 
> {file:.../workspace/apache-spark/master/project/}master-build...
> [info] Resolving org.fusesource.jansi#jansi;1.4 ...
> [info] downloading 
> https://repo1.maven.org/maven2/org/scalastyle/scalastyle-sbt-plugin_2.10_0.13/1.0.0/scalastyle-sbt-plugin-1.0.0.jar
>  ...
> [info] [SUCCESSFUL ] 
> org.scalastyle#scalastyle-sbt-plugin;1.0.0!scalastyle-sbt-plugin.jar (239ms)
> [info] downloading 
> https://repo1.maven.org/maven2/org/scalastyle/scalastyle_2.10/1.0.0/scalastyle_2.10-1.0.0.jar
>  ...
> [info] [SUCCESSFUL ] 
> org.scalastyle#scalastyle_2.10;1.0.0!scalastyle_2.10.jar (465ms)
> [info] Done updating.
> [warn] Found version conflict(s) in library dependencies; some are suspected 
> to be binary incompatible:
> [warn] 
> [warn] * org.apache.maven.wagon:wagon-provider-api:2.2 is selected over 
> 1.0-beta-6
> [warn] +- org.apache.maven:maven-compat:3.0.4(depends 
> on 2.2)
> [warn] +- org.apache.maven.wagon:wagon-file:2.2  (depends 
> on 2.2)
> [warn] +- org.spark-project:sbt-pom-reader:1.0.0-spark 
> (scalaVersion=2.10, sbtVersion=0.13) (depends on 2.2)
> [warn] +- org.apache.maven.wagon:wagon-http-shared4:2.2  (depends 
> on 2.2)
> [warn] +- org.apache.maven.wagon:wagon-http:2.2  (depends 
> on 2.2)
> [warn] +- org.apache.maven.wagon:wagon-http-lightweight:2.2  (depends 
> on 2.2)
> [warn] +- org.sonatype.aether:aether-connector-wagon:1.13.1  (depends 
> on 1.0-beta-6)
> [warn] 
> [warn] * org.codehaus.plexus:plexus-utils:3.0 is selected over {2.0.7, 
> 2.0.6, 2.1, 1.5.5}
> [warn] +- org.apache.maven.wagon:wagon-provider-api:2.2  (depends 
> on 3.0)
> [warn] +- org.apache.maven:maven-compat:3.0.4(depends 
> on 2.0.6)
> [warn] +- org.sonatype.sisu:sisu-inject-plexus:2.3.0 (depends 
> on 2.0.6)
> [warn] +- org.apache.maven:maven-artifact:3.0.4  (depends 
> on 2.0.6)
> [warn] +- org.apache.maven:maven-core:3.0.4  (depends 
> on 2.0.6)
> [warn] +- org.sonatype.plexus:plexus-sec-dispatcher:1.3  (depends 
> on 2.0.6)
> [warn] +- org.apache.maven:maven-embedder:3.0.4  (depends 
> on 2.0.6)
> [warn] +- org.apache.maven:maven-settings:3.0.4  (depends 
> on 2.0.6)
> [warn] +- org.apache.maven:maven-settings-builder:3.0.4  (depends 
> on 2.0.6)
> [warn] +- org.apache.maven:maven-model-builder:3.0.4 (depends 
> on 2.0.7)
> [warn] +- org.sonatype.aether:aether-connector-wagon:1.13.1  (depends 
> on 2.0.7)
> [warn] +- org.sonatype.sisu:sisu-inject-plexus:2.2.3 (depends 
> on 2.0.7)
> [warn] +- org.apache.maven:maven-model:3.0.4 (depends 
> on 2.0.7)
> [warn] +- org.apache.maven:maven-aether-provider:3.0.4   (depends 
> on 2.0.7)
> [warn] +- org.apache.maven:maven-repository-metadata:3.0.4   (depends 
> on 2.0.7)
> [warn] 
> [warn] * cglib:cglib is evicted completely
> [warn] +- org.sonatype.sisu:sisu-guice:3.0.3 (depends 
> 

[jira] [Commented] (SPARK-21476) RandomForest classification model not using broadcast in transform

2017-09-06 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16155018#comment-16155018
 ] 

Peng Meng commented on SPARK-21476:
---

When there is 50 instances for each partition, this Broadcast is a little 
faster than the current solution,
When there is 500 instances for each partition, this Broadcast is a little 
slower than the current solution.
Both of the difference is about 10%.

> RandomForest classification model not using broadcast in transform
> --
>
> Key: SPARK-21476
> URL: https://issues.apache.org/jira/browse/SPARK-21476
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Saurabh Agrawal
>Priority: Minor
>
> I notice significant task deserialization latency while running prediction 
> with pipelines using RandomForestClassificationModel. While digging into the 
> source, found that the transform method in RandomForestClassificationModel 
> binds to its parent ProbabilisticClassificationModel and the only concrete 
> definition that RandomForestClassificationModel provides and which is 
> actually used in transform is that of predictRaw. Broadcasting is not being 
> used in predictRaw.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9104) expose network layer memory usage

2017-09-06 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-9104:
---
Summary: expose network layer memory usage  (was: expose network layer 
memory usage in shuffle part)

> expose network layer memory usage
> -
>
> Key: SPARK-9104
> URL: https://issues.apache.org/jira/browse/SPARK-9104
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Zhang, Liye
>Assignee: Saisai Shao
> Fix For: 2.3.0
>
>
> The default network transportation is netty, and when transfering blocks for 
> shuffle, the network layer will consume a decent size of memory, we shall 
> collect the memory usage of this part and expose it. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21934) Expose Netty memory usage via Metrics System

2017-09-06 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-21934:

Issue Type: Sub-task  (was: New Feature)
Parent: SPARK-9103

> Expose Netty memory usage via Metrics System
> 
>
> Key: SPARK-21934
> URL: https://issues.apache.org/jira/browse/SPARK-21934
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Saisai Shao
>
> This is a follow-up work of SPARK-9104 to expose the Netty memory usage to 
> MetricsSystem. My initial thought is to only expose Shuffle memory usage, 
> since shuffle is a major part of memory usage in network communication 
> compared to RPC, file server, block transfer. 
> If user wants to also expose Netty memory usage for other modules, we could 
> add more metrics later.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21934) Expose Netty memory usage via Metrics System

2017-09-06 Thread Saisai Shao (JIRA)
Saisai Shao created SPARK-21934:
---

 Summary: Expose Netty memory usage via Metrics System
 Key: SPARK-21934
 URL: https://issues.apache.org/jira/browse/SPARK-21934
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: Saisai Shao


This is a follow-up work of SPARK-9104 to expose the Netty memory usage to 
MetricsSystem. My initial thought is to only expose Shuffle memory usage, since 
shuffle is a major part of memory usage in network communication compared to 
RPC, file server, block transfer. 

If user wants to also expose Netty memory usage for other modules, we could add 
more metrics later.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21933) Spark Streaming request more executors than excepted without DynamicAllocation

2017-09-06 Thread Congxian Qiu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16154998#comment-16154998
 ] 

Congxian Qiu commented on SPARK-21933:
--

create a PR to the issues https://github.com/apache/spark/pull/19145

> Spark Streaming request more executors than excepted without DynamicAllocation
> --
>
> Key: SPARK-21933
> URL: https://issues.apache.org/jira/browse/SPARK-21933
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.2.0
>Reporter: Congxian Qiu
>
> When Spark Streaming application runs on Yarn without DynamicAllocation, If 
> some nodemanager becomes lost, then the containers on the lost nodemanager 
> will be reported to all the applicationmaster, application master will 
> allocate new containers.
> But after application master allocate new containers, the lost nodemanager be 
> available, after this, resource manager restarted, after resource manager has 
> been restarted, the node manager   will report all the containers running on 
> it before to resource manager because of Yarn's HA, then application manager 
> will receive a duplicated completed container message, so the spark streaming 
> application will request more resource than it requires.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21933) Spark Streaming request more executors than excepted without DynamicAllocation

2017-09-06 Thread Congxian Qiu (JIRA)
Congxian Qiu created SPARK-21933:


 Summary: Spark Streaming request more executors than excepted 
without DynamicAllocation
 Key: SPARK-21933
 URL: https://issues.apache.org/jira/browse/SPARK-21933
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 2.2.0
Reporter: Congxian Qiu


When Spark Streaming application runs on Yarn without DynamicAllocation, If 
some nodemanager becomes lost, then the containers on the lost nodemanager will 
be reported to all the applicationmaster, application master will allocate new 
containers.

But after application master allocate new containers, the lost nodemanager be 
available, after this, resource manager restarted, after resource manager has 
been restarted, the node manager   will report all the containers running on it 
before to resource manager because of Yarn's HA, then application manager will 
receive a duplicated completed container message, so the spark streaming 
application will request more resource than it requires.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20201) Flaky Test: org.apache.spark.sql.catalyst.expressions.OrderingSuite

2017-09-06 Thread Armin Braun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16154975#comment-16154975
 ] 

Armin Braun commented on SPARK-20201:
-

This was resolved by this commit in June 
https://github.com/original-brownbear/spark/commit/b32b2123ddca66e00acf4c9d956232e07f779f9f#diff-4fe0e85423909b24c2a56287468271f1R138

> Flaky Test: org.apache.spark.sql.catalyst.expressions.OrderingSuite
> ---
>
> Key: SPARK-20201
> URL: https://issues.apache.org/jira/browse/SPARK-20201
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Takuya Ueshin
>Priority: Minor
>  Labels: flaky-test
>
> This test failed recently here:
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7/2856/testReport/junit/org.apache.spark.sql.catalyst.expressions/OrderingSuite/SPARK_16845__GeneratedClass$SpecificOrdering_grows_beyond_64_KB/
> Dashboard
> https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.catalyst.expressions.OrderingSuite_name=SPARK-16845%3A+GeneratedClass%24SpecificOrdering+grows+beyond+64+KB
> Error Message
> {code}
> java.lang.StackOverflowError
> {code}
> {code}
> com.google.common.util.concurrent.ExecutionError: java.lang.StackOverflowError
>   at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2261)
>   at com.google.common.cache.LocalCache.get(LocalCache.java:4000)
>   at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004)
>   at 
> com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:903)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.create(GenerateOrdering.scala:188)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.create(GenerateOrdering.scala:43)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:887)
>   at 
> org.apache.spark.sql.catalyst.expressions.OrderingSuite$$anonfun$1.apply$mcV$sp(OrderingSuite.scala:138)
>   at 
> org.apache.spark.sql.catalyst.expressions.OrderingSuite$$anonfun$1.apply(OrderingSuite.scala:131)
>   at 
> org.apache.spark.sql.catalyst.expressions.OrderingSuite$$anonfun$1.apply(OrderingSuite.scala:131)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
>   at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
>   at org.scalatest.Suite$class.run(Suite.scala:1424)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
>   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:31)

[jira] [Closed] (SPARK-21783) Turn on ORC filter push-down by default

2017-09-06 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-21783.
-
Resolution: Later

Maybe, this seems not a scope on Apache Spark 2.3.0 because it's a debut of 
Apache ORC 1.4.0. I'll close this PR. Thank you all for giving advice on this 
PR.

> Turn on ORC filter push-down by default
> ---
>
> Key: SPARK-21783
> URL: https://issues.apache.org/jira/browse/SPARK-21783
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Like Parquet (SPARK-9207), it would be great to turn on ORC option, too.
> This option was turned off by default from the begining, SPARK-2883
> - 
> https://github.com/apache/spark/commit/aa31e431fc09f0477f1c2351c6275769a31aca90#diff-41ef65b9ef5b518f77e2a03559893f4dR149



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >