[jira] [Commented] (SPARK-19148) do not expose the external table concept in Catalog

2017-03-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15952057#comment-15952057
 ] 

Apache Spark commented on SPARK-19148:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/17502

> do not expose the external table concept in Catalog
> ---
>
> Key: SPARK-19148
> URL: https://issues.apache.org/jira/browse/SPARK-19148
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9478) Add sample weights to Random Forest

2017-03-31 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15951685#comment-15951685
 ] 

Joseph K. Bradley edited comment on SPARK-9478 at 4/1/17 2:35 AM:
--

[~clamus] The current vote is to *not use* weights during sampling and then to 
*use* weights when growing the trees.  That will simplify the sampling process 
so we hopefully won't have to deal with the complexity you're mentioning.  Note 
that we'll have to weight the trees in the forest to make this approach work.

I'm also guessing that it will give better calibrated probability estimates in 
the final forest, though this is based on intuition rather than analysis.  
E.g., given the 4-instance dataset in [~sethah]'s example above, with 
subsampling 4 instances for each tree, I'd imagine:
* If we use weights during sampling but not when growing trees...
** Say we want 10 trees.  We pick 10 sets of 4 rows.  The probability of always 
picking the weight-1000 row is ~0.89.
** So our forest will probably give us 0/1 (poorly calibrated) probabilities.
* If we do not use weights during sampling but use them when growing trees... 
(current proposal)
** Say we want 10 trees.
** The probability of always picking the weight-1 rows is ~1e-5.  This means 
we'll have at least one tree with the weight-1000 row, so it will dominate our 
predictions (giving good accuracy).
** The probability of having at least 1 tree with only weight-1 rows is ~0.02.  
This means it's pretty likely we'll have some tree predicting label1, so we'll 
keep our probability predictions away from 0 and 1.

This is really hand-wavy, but it does alleviate my fears of having extreme log 
losses.  On the other hand, maybe it could be handle by adding smoothing to 
predictions...


was (Author: josephkb):
[~clamus] The current vote is to *not use* weights during sampling and then to 
*use* weights when growing the trees.  That will simplify the sampling process 
so we hopefully won't have to deal with the complexity you're mentioning.  Note 
that we'll have to weight the trees in the forest to make this approach work.

I'm also guessing that it will give better calibrated probability estimates in 
the final forest, though this is based on intuition rather than analysis.  
E.g., given the 4-instance dataset in [~sethah]'s example above, I'd imagine:
* If we use weights during sampling but not when growing trees...
** Say we want 10 trees.  We pick 10 sets of 4 rows.  The probability of always 
picking the weight-1000 row is ~0.89.
** So our forest will probably give us 0/1 (poorly calibrated) probabilities.
* If we do not use weights during sampling but use them when growing trees... 
(current proposal)
** Say we want 10 trees.
** The probability of always picking the weight-1 rows is ~1e-5.  This means 
we'll have at least one tree with the weight-1000 row, so it will dominate our 
predictions (giving good accuracy).
** The probability of having at least 1 tree with only weight-1 rows is ~0.02.  
This means it's pretty likely we'll have some tree predicting label1, so we'll 
keep our probability predictions away from 0 and 1.

This is really hand-wavy, but it does alleviate my fears of having extreme log 
losses.  On the other hand, maybe it could be handle by adding smoothing to 
predictions...

> Add sample weights to Random Forest
> ---
>
> Key: SPARK-9478
> URL: https://issues.apache.org/jira/browse/SPARK-9478
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.1
>Reporter: Patrick Crenshaw
>
> Currently, this implementation of random forest does not support sample 
> (instance) weights. Weights are important when there is imbalanced training 
> data or the evaluation metric of a classifier is imbalanced (e.g. true 
> positive rate at some false positive threshold).  Sample weights generalize 
> class weights, so this could be used to add class weights later on.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20183) Add outlierRatio option to testOutliersWithSmallWeights

2017-03-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20183:


Assignee: Seth Hendrickson  (was: Apache Spark)

> Add outlierRatio option to testOutliersWithSmallWeights
> ---
>
> Key: SPARK-20183
> URL: https://issues.apache.org/jira/browse/SPARK-20183
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, Tests
>Affects Versions: 2.1.0
>Reporter: Joseph K. Bradley
>Assignee: Seth Hendrickson
>
> Part 1 of parent PR: Add flexibility to testOutliersWithSmallWeights test.  
> See https://github.com/apache/spark/pull/16722 for perspective.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20183) Add outlierRatio option to testOutliersWithSmallWeights

2017-03-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15951865#comment-15951865
 ] 

Apache Spark commented on SPARK-20183:
--

User 'jkbradley' has created a pull request for this issue:
https://github.com/apache/spark/pull/17501

> Add outlierRatio option to testOutliersWithSmallWeights
> ---
>
> Key: SPARK-20183
> URL: https://issues.apache.org/jira/browse/SPARK-20183
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, Tests
>Affects Versions: 2.1.0
>Reporter: Joseph K. Bradley
>Assignee: Seth Hendrickson
>
> Part 1 of parent PR: Add flexibility to testOutliersWithSmallWeights test.  
> See https://github.com/apache/spark/pull/16722 for perspective.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20183) Add outlierRatio option to testOutliersWithSmallWeights

2017-03-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20183:


Assignee: Apache Spark  (was: Seth Hendrickson)

> Add outlierRatio option to testOutliersWithSmallWeights
> ---
>
> Key: SPARK-20183
> URL: https://issues.apache.org/jira/browse/SPARK-20183
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, Tests
>Affects Versions: 2.1.0
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>
> Part 1 of parent PR: Add flexibility to testOutliersWithSmallWeights test.  
> See https://github.com/apache/spark/pull/16722 for perspective.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20183) Add outlierRatio option to testOutliersWithSmallWeights

2017-03-31 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-20183:
-

 Summary: Add outlierRatio option to testOutliersWithSmallWeights
 Key: SPARK-20183
 URL: https://issues.apache.org/jira/browse/SPARK-20183
 Project: Spark
  Issue Type: Sub-task
  Components: ML, Tests
Affects Versions: 2.1.0
Reporter: Joseph K. Bradley
Assignee: Seth Hendrickson


Part 1 of parent PR: Add flexibility to testOutliersWithSmallWeights test.  See 
https://github.com/apache/spark/pull/16722 for perspective.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19591) Add sample weights to decision trees

2017-03-31 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-19591:
--
Description: Add sample weights to decision trees.  See [SPARK-9478] for 
details on the design.  (was: Add sample weights to decision trees)

> Add sample weights to decision trees
> 
>
> Key: SPARK-19591
> URL: https://issues.apache.org/jira/browse/SPARK-19591
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Seth Hendrickson
>Assignee: Seth Hendrickson
>
> Add sample weights to decision trees.  See [SPARK-9478] for details on the 
> design.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19591) Add sample weights to decision trees

2017-03-31 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-19591:
--
Issue Type: New Feature  (was: Sub-task)
Parent: (was: SPARK-9478)

> Add sample weights to decision trees
> 
>
> Key: SPARK-19591
> URL: https://issues.apache.org/jira/browse/SPARK-19591
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Seth Hendrickson
>Assignee: Seth Hendrickson
>
> Add sample weights to decision trees



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19408) cardinality estimation involving two columns of the same table

2017-03-31 Thread Ron Hu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Hu updated SPARK-19408:
---
Description: 
In SPARK-17075, we estimate cardinality of predicate expression "column (op) 
literal", where op is =, <, <=, >, or >=.  In SQL queries, we also see 
predicate expressions involving two columns such as "column-1 (op) column-2" 
where column-1 and column-2 belong to same table.  Note that, if column-1 and 
column-2 belong to different tables, then it is a join operator's work, NOT a 
filter operator's work.

In this jira, we want to estimate the filter factor of predicate expressions 
involving two columns of same table.   For example, multiple tpc-h queries have 
this kind of predicate "WHERE l_commitdate < l_receiptdate".

  was:
In SPARK-17075, we estimate cardinality of predicate expression "column (op) 
literal", where op is =, <, <=, >, or >=.  In SQL queries, we also see 
predicate expressions involving two columns such as "column-1 (op) column-2" 
where column-1 and column-2 belong to same table.  Note that, if column-1 and 
column-2 belong to different tables, then it is a join operator's work, NOT a 
filter operator's work.

In this jira, we want to estimate the filter factor of predicate expressions 
involving two columns of same table.   


> cardinality estimation involving two columns of the same table
> --
>
> Key: SPARK-19408
> URL: https://issues.apache.org/jira/browse/SPARK-19408
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 2.1.0
>Reporter: Ron Hu
>
> In SPARK-17075, we estimate cardinality of predicate expression "column (op) 
> literal", where op is =, <, <=, >, or >=.  In SQL queries, we also see 
> predicate expressions involving two columns such as "column-1 (op) column-2" 
> where column-1 and column-2 belong to same table.  Note that, if column-1 and 
> column-2 belong to different tables, then it is a join operator's work, NOT a 
> filter operator's work.
> In this jira, we want to estimate the filter factor of predicate expressions 
> involving two columns of same table.   For example, multiple tpc-h queries 
> have this kind of predicate "WHERE l_commitdate < l_receiptdate".



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20003) FPGrowthModel setMinConfidence should affect rules generation and transform

2017-03-31 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-20003:
--
Target Version/s: 2.2.0

> FPGrowthModel setMinConfidence should affect rules generation and transform
> ---
>
> Key: SPARK-20003
> URL: https://issues.apache.org/jira/browse/SPARK-20003
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>Assignee: yuhao yang
>Priority: Minor
>
> I was doing some test and find the issue. FPGrowthModel setMinConfidence 
> should affect rules generation and transform. 
> Currently associationRules in FPGrowthModel is a lazy val and 
> setMinConfidence in FPGrowthModel has no impact once associationRules got 
> computed .



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20003) FPGrowthModel setMinConfidence should affect rules generation and transform

2017-03-31 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-20003:
--
Shepherd: Joseph K. Bradley

> FPGrowthModel setMinConfidence should affect rules generation and transform
> ---
>
> Key: SPARK-20003
> URL: https://issues.apache.org/jira/browse/SPARK-20003
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>Assignee: yuhao yang
>Priority: Minor
>
> I was doing some test and find the issue. FPGrowthModel setMinConfidence 
> should affect rules generation and transform. 
> Currently associationRules in FPGrowthModel is a lazy val and 
> setMinConfidence in FPGrowthModel has no impact once associationRules got 
> computed .



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20003) FPGrowthModel setMinConfidence should affect rules generation and transform

2017-03-31 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-20003:
-

Assignee: yuhao yang

> FPGrowthModel setMinConfidence should affect rules generation and transform
> ---
>
> Key: SPARK-20003
> URL: https://issues.apache.org/jira/browse/SPARK-20003
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>Assignee: yuhao yang
>Priority: Minor
>
> I was doing some test and find the issue. FPGrowthModel setMinConfidence 
> should affect rules generation and transform. 
> Currently associationRules in FPGrowthModel is a lazy val and 
> setMinConfidence in FPGrowthModel has no impact once associationRules got 
> computed .



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20182) Dot in DataFrame Column title causes errors

2017-03-31 Thread Evan Zamir (JIRA)
Evan Zamir created SPARK-20182:
--

 Summary: Dot in DataFrame Column title causes errors
 Key: SPARK-20182
 URL: https://issues.apache.org/jira/browse/SPARK-20182
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.1.0
Reporter: Evan Zamir


I did a search and saw this issue pop up before, and while it seemed like it 
had been solved before 2.1, I'm still seeing an error.

```
emp = spark.createDataFrame([(["Joe", "Bob", "Mary"],),
(["Mike", "Matt", "Stacy"],)],
  ["first.names"])

print(emp.collect())

emp.select(['first.names']).alias('first')


```
[Row(first.names=['Joe', 'Bob', 'Mary']), Row(first.names=['Mike', 'Matt', 
'Stacy'])]
Py4JJavaError Traceback (most recent call last)
/usr/local/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
 62 try:
---> 63 return f(*a, **kw)
 64 except py4j.protocol.Py4JJavaError as e:

/usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py in 
get_return_value(answer, gateway_client, target_id, name)
318 "An error occurred while calling {0}{1}{2}.\n".
--> 319 format(target_id, ".", name), value)
320 else:

Py4JJavaError: An error occurred while calling o1734.select.
: org.apache.spark.sql.AnalysisException: cannot resolve '`first.names`' given 
input columns: [first.names];;
'Project ['first.names]
+- LogicalRDD [first.names#466]

at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:282)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:292)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:296)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:296)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$7.apply(QueryPlan.scala:301)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:301)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:74)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:128)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:57)
at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:48)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:63)
at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2822)
at org.apache.spark.sql.Dataset.select(Dataset.scala:1121)
at sun.reflect.GeneratedMethodAccessor52.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at 

[jira] [Updated] (SPARK-20164) AnalysisException not tolerant of null query plan

2017-03-31 Thread Kunal Khamar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kunal Khamar updated SPARK-20164:
-
Description: 
The query plan in an AnalysisException may be null when an AnalysisException 
object is serialized and then deserialized, since plan is marked @transient. Or 
when someone throws an AnalysisException with a null query plan (which should 
not happen).
def getMessage is not tolerant of this and throws a NullPointerException, 
leading to loss of information about the original exception.
The fix is to add a null check in getMessage.

  was:
The query plan in an `AnalysisException` may be `null` when an 
`AnalysisException` object is serialized and then deserialized, since `plan` is 
marked `@transient`. Or when someone throws an `AnalysisException` with a null 
query plan (which should not happen).
`def getMessage` is not tolerant of this and throws a `NullPointerException`, 
leading to loss of information about the original exception.
The fix is to add a `null` check in `getMessage`.


> AnalysisException not tolerant of null query plan
> -
>
> Key: SPARK-20164
> URL: https://issues.apache.org/jira/browse/SPARK-20164
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kunal Khamar
>Assignee: Kunal Khamar
> Fix For: 2.2.0, 2.1.2
>
>
> The query plan in an AnalysisException may be null when an AnalysisException 
> object is serialized and then deserialized, since plan is marked @transient. 
> Or when someone throws an AnalysisException with a null query plan (which 
> should not happen).
> def getMessage is not tolerant of this and throws a NullPointerException, 
> leading to loss of information about the original exception.
> The fix is to add a null check in getMessage.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20164) AnalysisException not tolerant of null query plan

2017-03-31 Thread Kunal Khamar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kunal Khamar updated SPARK-20164:
-
Description: 
The query plan in an `AnalysisException` may be `null` when an 
`AnalysisException` object is serialized and then deserialized, since `plan` is 
marked `@transient`. Or when someone throws an `AnalysisException` with a null 
query plan (which should not happen).
`def getMessage` is not tolerant of this and throws a `NullPointerException`, 
leading to loss of information about the original exception.
The fix is to add a `null` check in `getMessage`.

  was:When someone throws an AnalysisException with a null query plan (which 
ideally no one should), getMessage is not tolerant of this and throws a null 
pointer exception, leading to loss of information about original exception.


> AnalysisException not tolerant of null query plan
> -
>
> Key: SPARK-20164
> URL: https://issues.apache.org/jira/browse/SPARK-20164
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kunal Khamar
>Assignee: Kunal Khamar
> Fix For: 2.2.0, 2.1.2
>
>
> The query plan in an `AnalysisException` may be `null` when an 
> `AnalysisException` object is serialized and then deserialized, since `plan` 
> is marked `@transient`. Or when someone throws an `AnalysisException` with a 
> null query plan (which should not happen).
> `def getMessage` is not tolerant of this and throws a `NullPointerException`, 
> leading to loss of information about the original exception.
> The fix is to add a `null` check in `getMessage`.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-20163) Kill all running tasks in a stage in case of fetch failure

2017-03-31 Thread Sital Kedia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sital Kedia closed SPARK-20163.
---
Resolution: Duplicate

> Kill all running tasks in a stage in case of fetch failure
> --
>
> Key: SPARK-20163
> URL: https://issues.apache.org/jira/browse/SPARK-20163
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.0.1
>Reporter: Sital Kedia
>
> Currently, the scheduler does not kill the running tasks in a stage when it 
> encounters fetch failure, as a result, we might end up running many duplicate 
> tasks in the cluster. There is already a TODO in TaskSetManager to kill all 
> running tasks which has not been implemented.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20163) Kill all running tasks in a stage in case of fetch failure

2017-03-31 Thread Sital Kedia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15951738#comment-15951738
 ] 

Sital Kedia commented on SPARK-20163:
-

Thanks [~imranr], closing this as this is duplicate of  SPARK-2666. 

> Kill all running tasks in a stage in case of fetch failure
> --
>
> Key: SPARK-20163
> URL: https://issues.apache.org/jira/browse/SPARK-20163
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.0.1
>Reporter: Sital Kedia
>
> Currently, the scheduler does not kill the running tasks in a stage when it 
> encounters fetch failure, as a result, we might end up running many duplicate 
> tasks in the cluster. There is already a TODO in TaskSetManager to kill all 
> running tasks which has not been implemented.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20181) Avoid noisy Jetty WARN log when failing to bind a port

2017-03-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20181:


Assignee: (was: Apache Spark)

> Avoid noisy Jetty WARN log when failing to bind a port
> --
>
> Key: SPARK-20181
> URL: https://issues.apache.org/jira/browse/SPARK-20181
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Derek Dagit
>Priority: Minor
>
> As a user, I would like to suppress the Jetty WARN log about failing to bind 
> to a port already in use, so that my logs are less noisy.
> Currently, Jetty code prints the stack trace of the BindException at WARN 
> level. In the context of starting a service on an ephemeral port, this is not 
> a useful warning, and it is exceedingly verbose.
> {noformat}
> 17/03/06 14:57:26 WARN AbstractLifeCycle: FAILED 
> ServerConnector@79476a4e{HTTP/1.1}{0.0.0.0:4040}: java.net.BindException: 
> Address already in use
> java.net.BindException: Address already in use
>   at sun.nio.ch.Net.bind0(Native Method)
>   at sun.nio.ch.Net.bind(Net.java:433)
>   at sun.nio.ch.Net.bind(Net.java:425)
>   at 
> sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
>   at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
>   at 
> org.spark_project.jetty.server.ServerConnector.open(ServerConnector.java:321)
>   at 
> org.spark_project.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:80)
>   at 
> org.spark_project.jetty.server.ServerConnector.doStart(ServerConnector.java:236)
>   at 
> org.spark_project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
>   at org.spark_project.jetty.server.Server.doStart(Server.java:366)
>   at 
> org.spark_project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
>   at 
> org.apache.spark.ui.JettyUtils$.org$apache$spark$ui$JettyUtils$$connect$1(JettyUtils.scala:306)
>   at org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:316)
>   at org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:316)
>   at 
> org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:2175)
>   at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
>   at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:2166)
>   at 
> org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:316)
>   at org.apache.spark.ui.WebUI.bind(WebUI.scala:139)
>   at 
> org.apache.spark.SparkContext$$anonfun$10.apply(SparkContext.scala:448)
>   at 
> org.apache.spark.SparkContext$$anonfun$10.apply(SparkContext.scala:448)
>   at scala.Option.foreach(Option.scala:257)
>   at org.apache.spark.SparkContext.(SparkContext.scala:448)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2282)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:831)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:823)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:823)
>   at org.apache.spark.repl.Main$.createSparkSession(Main.scala:95)
>   at $line3.$read$$iw$$iw.(:15)
>   at $line3.$read$$iw.(:31)
>   at $line3.$read.(:33)
>   at $line3.$read$.(:37)
>   at $line3.$read$.()
>   at $line3.$eval$.$print$lzycompute(:7)
>   at $line3.$eval$.$print(:6)
>   at $line3.$eval.$print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:786)
>   at 
> scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1047)
>   at 
> scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:638)
>   at 
> scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:637)
>   at 
> scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31)
>   at 
> scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:19)
>   at 
> scala.tools.nsc.interpreter.IMain$WrappedRequest.loadAndRunReq(IMain.scala:637)
>   at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:569)
>   at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:565)
>  

[jira] [Assigned] (SPARK-20181) Avoid noisy Jetty WARN log when failing to bind a port

2017-03-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20181:


Assignee: Apache Spark

> Avoid noisy Jetty WARN log when failing to bind a port
> --
>
> Key: SPARK-20181
> URL: https://issues.apache.org/jira/browse/SPARK-20181
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Derek Dagit
>Assignee: Apache Spark
>Priority: Minor
>
> As a user, I would like to suppress the Jetty WARN log about failing to bind 
> to a port already in use, so that my logs are less noisy.
> Currently, Jetty code prints the stack trace of the BindException at WARN 
> level. In the context of starting a service on an ephemeral port, this is not 
> a useful warning, and it is exceedingly verbose.
> {noformat}
> 17/03/06 14:57:26 WARN AbstractLifeCycle: FAILED 
> ServerConnector@79476a4e{HTTP/1.1}{0.0.0.0:4040}: java.net.BindException: 
> Address already in use
> java.net.BindException: Address already in use
>   at sun.nio.ch.Net.bind0(Native Method)
>   at sun.nio.ch.Net.bind(Net.java:433)
>   at sun.nio.ch.Net.bind(Net.java:425)
>   at 
> sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
>   at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
>   at 
> org.spark_project.jetty.server.ServerConnector.open(ServerConnector.java:321)
>   at 
> org.spark_project.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:80)
>   at 
> org.spark_project.jetty.server.ServerConnector.doStart(ServerConnector.java:236)
>   at 
> org.spark_project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
>   at org.spark_project.jetty.server.Server.doStart(Server.java:366)
>   at 
> org.spark_project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
>   at 
> org.apache.spark.ui.JettyUtils$.org$apache$spark$ui$JettyUtils$$connect$1(JettyUtils.scala:306)
>   at org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:316)
>   at org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:316)
>   at 
> org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:2175)
>   at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
>   at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:2166)
>   at 
> org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:316)
>   at org.apache.spark.ui.WebUI.bind(WebUI.scala:139)
>   at 
> org.apache.spark.SparkContext$$anonfun$10.apply(SparkContext.scala:448)
>   at 
> org.apache.spark.SparkContext$$anonfun$10.apply(SparkContext.scala:448)
>   at scala.Option.foreach(Option.scala:257)
>   at org.apache.spark.SparkContext.(SparkContext.scala:448)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2282)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:831)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:823)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:823)
>   at org.apache.spark.repl.Main$.createSparkSession(Main.scala:95)
>   at $line3.$read$$iw$$iw.(:15)
>   at $line3.$read$$iw.(:31)
>   at $line3.$read.(:33)
>   at $line3.$read$.(:37)
>   at $line3.$read$.()
>   at $line3.$eval$.$print$lzycompute(:7)
>   at $line3.$eval$.$print(:6)
>   at $line3.$eval.$print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:786)
>   at 
> scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1047)
>   at 
> scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:638)
>   at 
> scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:637)
>   at 
> scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31)
>   at 
> scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:19)
>   at 
> scala.tools.nsc.interpreter.IMain$WrappedRequest.loadAndRunReq(IMain.scala:637)
>   at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:569)
>   at 

[jira] [Commented] (SPARK-20181) Avoid noisy Jetty WARN log when failing to bind a port

2017-03-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15951699#comment-15951699
 ] 

Apache Spark commented on SPARK-20181:
--

User 'd2r' has created a pull request for this issue:
https://github.com/apache/spark/pull/17500

> Avoid noisy Jetty WARN log when failing to bind a port
> --
>
> Key: SPARK-20181
> URL: https://issues.apache.org/jira/browse/SPARK-20181
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Derek Dagit
>Priority: Minor
>
> As a user, I would like to suppress the Jetty WARN log about failing to bind 
> to a port already in use, so that my logs are less noisy.
> Currently, Jetty code prints the stack trace of the BindException at WARN 
> level. In the context of starting a service on an ephemeral port, this is not 
> a useful warning, and it is exceedingly verbose.
> {noformat}
> 17/03/06 14:57:26 WARN AbstractLifeCycle: FAILED 
> ServerConnector@79476a4e{HTTP/1.1}{0.0.0.0:4040}: java.net.BindException: 
> Address already in use
> java.net.BindException: Address already in use
>   at sun.nio.ch.Net.bind0(Native Method)
>   at sun.nio.ch.Net.bind(Net.java:433)
>   at sun.nio.ch.Net.bind(Net.java:425)
>   at 
> sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
>   at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
>   at 
> org.spark_project.jetty.server.ServerConnector.open(ServerConnector.java:321)
>   at 
> org.spark_project.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:80)
>   at 
> org.spark_project.jetty.server.ServerConnector.doStart(ServerConnector.java:236)
>   at 
> org.spark_project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
>   at org.spark_project.jetty.server.Server.doStart(Server.java:366)
>   at 
> org.spark_project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
>   at 
> org.apache.spark.ui.JettyUtils$.org$apache$spark$ui$JettyUtils$$connect$1(JettyUtils.scala:306)
>   at org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:316)
>   at org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:316)
>   at 
> org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:2175)
>   at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
>   at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:2166)
>   at 
> org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:316)
>   at org.apache.spark.ui.WebUI.bind(WebUI.scala:139)
>   at 
> org.apache.spark.SparkContext$$anonfun$10.apply(SparkContext.scala:448)
>   at 
> org.apache.spark.SparkContext$$anonfun$10.apply(SparkContext.scala:448)
>   at scala.Option.foreach(Option.scala:257)
>   at org.apache.spark.SparkContext.(SparkContext.scala:448)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2282)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:831)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:823)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:823)
>   at org.apache.spark.repl.Main$.createSparkSession(Main.scala:95)
>   at $line3.$read$$iw$$iw.(:15)
>   at $line3.$read$$iw.(:31)
>   at $line3.$read.(:33)
>   at $line3.$read$.(:37)
>   at $line3.$read$.()
>   at $line3.$eval$.$print$lzycompute(:7)
>   at $line3.$eval$.$print(:6)
>   at $line3.$eval.$print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:786)
>   at 
> scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1047)
>   at 
> scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:638)
>   at 
> scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:637)
>   at 
> scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31)
>   at 
> scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:19)
>   at 
> scala.tools.nsc.interpreter.IMain$WrappedRequest.loadAndRunReq(IMain.scala:637)
>   at 

[jira] [Commented] (SPARK-20181) Avoid noisy Jetty WARN log when failing to bind a port

2017-03-31 Thread Derek Dagit (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15951686#comment-15951686
 ] 

Derek Dagit commented on SPARK-20181:
-

Working on this...

> Avoid noisy Jetty WARN log when failing to bind a port
> --
>
> Key: SPARK-20181
> URL: https://issues.apache.org/jira/browse/SPARK-20181
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Derek Dagit
>Priority: Minor
>
> As a user, I would like to suppress the Jetty WARN log about failing to bind 
> to a port already in use, so that my logs are less noisy.
> Currently, Jetty code prints the stack trace of the BindException at WARN 
> level. In the context of starting a service on an ephemeral port, this is not 
> a useful warning, and it is exceedingly verbose.
> {noformat}
> 17/03/06 14:57:26 WARN AbstractLifeCycle: FAILED 
> ServerConnector@79476a4e{HTTP/1.1}{0.0.0.0:4040}: java.net.BindException: 
> Address already in use
> java.net.BindException: Address already in use
>   at sun.nio.ch.Net.bind0(Native Method)
>   at sun.nio.ch.Net.bind(Net.java:433)
>   at sun.nio.ch.Net.bind(Net.java:425)
>   at 
> sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
>   at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
>   at 
> org.spark_project.jetty.server.ServerConnector.open(ServerConnector.java:321)
>   at 
> org.spark_project.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:80)
>   at 
> org.spark_project.jetty.server.ServerConnector.doStart(ServerConnector.java:236)
>   at 
> org.spark_project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
>   at org.spark_project.jetty.server.Server.doStart(Server.java:366)
>   at 
> org.spark_project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
>   at 
> org.apache.spark.ui.JettyUtils$.org$apache$spark$ui$JettyUtils$$connect$1(JettyUtils.scala:306)
>   at org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:316)
>   at org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:316)
>   at 
> org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:2175)
>   at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
>   at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:2166)
>   at 
> org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:316)
>   at org.apache.spark.ui.WebUI.bind(WebUI.scala:139)
>   at 
> org.apache.spark.SparkContext$$anonfun$10.apply(SparkContext.scala:448)
>   at 
> org.apache.spark.SparkContext$$anonfun$10.apply(SparkContext.scala:448)
>   at scala.Option.foreach(Option.scala:257)
>   at org.apache.spark.SparkContext.(SparkContext.scala:448)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2282)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:831)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:823)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:823)
>   at org.apache.spark.repl.Main$.createSparkSession(Main.scala:95)
>   at $line3.$read$$iw$$iw.(:15)
>   at $line3.$read$$iw.(:31)
>   at $line3.$read.(:33)
>   at $line3.$read$.(:37)
>   at $line3.$read$.()
>   at $line3.$eval$.$print$lzycompute(:7)
>   at $line3.$eval$.$print(:6)
>   at $line3.$eval.$print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:786)
>   at 
> scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1047)
>   at 
> scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:638)
>   at 
> scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:637)
>   at 
> scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31)
>   at 
> scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:19)
>   at 
> scala.tools.nsc.interpreter.IMain$WrappedRequest.loadAndRunReq(IMain.scala:637)
>   at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:569)
>   at 

[jira] [Commented] (SPARK-9478) Add sample weights to Random Forest

2017-03-31 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15951685#comment-15951685
 ] 

Joseph K. Bradley commented on SPARK-9478:
--

[~clamus] The current vote is to *not use* weights during sampling and then to 
*use* weights when growing the trees.  That will simplify the sampling process 
so we hopefully won't have to deal with the complexity you're mentioning.  Note 
that we'll have to weight the trees in the forest to make this approach work.

I'm also guessing that it will give better calibrated probability estimates in 
the final forest, though this is based on intuition rather than analysis.  
E.g., given the 4-instance dataset in [~sethah]'s example above, I'd imagine:
* If we use weights during sampling but not when growing trees...
** Say we want 10 trees.  We pick 10 sets of 4 rows.  The probability of always 
picking the weight-1000 row is ~0.89.
** So our forest will probably give us 0/1 (poorly calibrated) probabilities.
* If we do not use weights during sampling but use them when growing trees... 
(current proposal)
** Say we want 10 trees.
** The probability of always picking the weight-1 rows is ~1e-5.  This means 
we'll have at least one tree with the weight-1000 row, so it will dominate our 
predictions (giving good accuracy).
** The probability of having at least 1 tree with only weight-1 rows is ~0.02.  
This means it's pretty likely we'll have some tree predicting label1, so we'll 
keep our probability predictions away from 0 and 1.

This is really hand-wavy, but it does alleviate my fears of having extreme log 
losses.  On the other hand, maybe it could be handle by adding smoothing to 
predictions...

> Add sample weights to Random Forest
> ---
>
> Key: SPARK-9478
> URL: https://issues.apache.org/jira/browse/SPARK-9478
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.1
>Reporter: Patrick Crenshaw
>
> Currently, this implementation of random forest does not support sample 
> (instance) weights. Weights are important when there is imbalanced training 
> data or the evaluation metric of a classifier is imbalanced (e.g. true 
> positive rate at some false positive threshold).  Sample weights generalize 
> class weights, so this could be used to add class weights later on.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20181) Avoid noisy Jetty WARN log when failing to bind a port

2017-03-31 Thread Derek Dagit (JIRA)
Derek Dagit created SPARK-20181:
---

 Summary: Avoid noisy Jetty WARN log when failing to bind a port
 Key: SPARK-20181
 URL: https://issues.apache.org/jira/browse/SPARK-20181
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.1.0
Reporter: Derek Dagit
Priority: Minor


As a user, I would like to suppress the Jetty WARN log about failing to bind to 
a port already in use, so that my logs are less noisy.

Currently, Jetty code prints the stack trace of the BindException at WARN 
level. In the context of starting a service on an ephemeral port, this is not a 
useful warning, and it is exceedingly verbose.

{noformat}
17/03/06 14:57:26 WARN AbstractLifeCycle: FAILED 
ServerConnector@79476a4e{HTTP/1.1}{0.0.0.0:4040}: java.net.BindException: 
Address already in use
java.net.BindException: Address already in use
at sun.nio.ch.Net.bind0(Native Method)
at sun.nio.ch.Net.bind(Net.java:433)
at sun.nio.ch.Net.bind(Net.java:425)
at 
sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
at 
org.spark_project.jetty.server.ServerConnector.open(ServerConnector.java:321)
at 
org.spark_project.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:80)
at 
org.spark_project.jetty.server.ServerConnector.doStart(ServerConnector.java:236)
at 
org.spark_project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
at org.spark_project.jetty.server.Server.doStart(Server.java:366)
at 
org.spark_project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
at 
org.apache.spark.ui.JettyUtils$.org$apache$spark$ui$JettyUtils$$connect$1(JettyUtils.scala:306)
at org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:316)
at org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:316)
at 
org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:2175)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:2166)
at 
org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:316)
at org.apache.spark.ui.WebUI.bind(WebUI.scala:139)
at 
org.apache.spark.SparkContext$$anonfun$10.apply(SparkContext.scala:448)
at 
org.apache.spark.SparkContext$$anonfun$10.apply(SparkContext.scala:448)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.SparkContext.(SparkContext.scala:448)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2282)
at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:831)
at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:823)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:823)
at org.apache.spark.repl.Main$.createSparkSession(Main.scala:95)
at $line3.$read$$iw$$iw.(:15)
at $line3.$read$$iw.(:31)
at $line3.$read.(:33)
at $line3.$read$.(:37)
at $line3.$read$.()
at $line3.$eval$.$print$lzycompute(:7)
at $line3.$eval$.$print(:6)
at $line3.$eval.$print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:786)
at 
scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1047)
at 
scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:638)
at 
scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:637)
at 
scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31)
at 
scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:19)
at 
scala.tools.nsc.interpreter.IMain$WrappedRequest.loadAndRunReq(IMain.scala:637)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:569)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:565)
at 
scala.tools.nsc.interpreter.ILoop.interpretStartingWith(ILoop.scala:807)
at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:681)
at scala.tools.nsc.interpreter.ILoop.processLine(ILoop.scala:395)
at 

[jira] [Updated] (SPARK-9478) Add sample weights to Random Forest

2017-03-31 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9478:
-
Description: Currently, this implementation of random forest does not 
support sample (instance) weights. Weights are important when there is 
imbalanced training data or the evaluation metric of a classifier is imbalanced 
(e.g. true positive rate at some false positive threshold).  Sample weights 
generalize class weights, so this could be used to add class weights later on.  
(was: Currently, this implementation of random forest does not support class 
weights. Class weights are important when there is imbalanced training data or 
the evaluation metric of a classifier is imbalanced (e.g. true positive rate at 
some false positive threshold). )

> Add sample weights to Random Forest
> ---
>
> Key: SPARK-9478
> URL: https://issues.apache.org/jira/browse/SPARK-9478
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.1
>Reporter: Patrick Crenshaw
>
> Currently, this implementation of random forest does not support sample 
> (instance) weights. Weights are important when there is imbalanced training 
> data or the evaluation metric of a classifier is imbalanced (e.g. true 
> positive rate at some false positive threshold).  Sample weights generalize 
> class weights, so this could be used to add class weights later on.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9478) Add sample weights to Random Forest

2017-03-31 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9478:
-
Shepherd: Joseph K. Bradley

> Add sample weights to Random Forest
> ---
>
> Key: SPARK-9478
> URL: https://issues.apache.org/jira/browse/SPARK-9478
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.1
>Reporter: Patrick Crenshaw
>
> Currently, this implementation of random forest does not support sample 
> (instance) weights. Weights are important when there is imbalanced training 
> data or the evaluation metric of a classifier is imbalanced (e.g. true 
> positive rate at some false positive threshold).  Sample weights generalize 
> class weights, so this could be used to add class weights later on.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9478) Add sample weights to Random Forest

2017-03-31 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9478:
-
Shepherd:   (was: Joseph K. Bradley)

> Add sample weights to Random Forest
> ---
>
> Key: SPARK-9478
> URL: https://issues.apache.org/jira/browse/SPARK-9478
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.1
>Reporter: Patrick Crenshaw
>
> Currently, this implementation of random forest does not support sample 
> (instance) weights. Weights are important when there is imbalanced training 
> data or the evaluation metric of a classifier is imbalanced (e.g. true 
> positive rate at some false positive threshold).  Sample weights generalize 
> class weights, so this could be used to add class weights later on.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20161) Default log4j properties file should print thread-id in ConversionPattern

2017-03-31 Thread Sahil Takiar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15951656#comment-15951656
 ] 

Sahil Takiar commented on SPARK-20161:
--

[~xuefuz] could you comment on https://github.com/apache/spark/pull/17499 and 
maybe provide some more context as to how this will benefit HoS?

> Default log4j properties file should print thread-id in ConversionPattern
> -
>
> Key: SPARK-20161
> URL: https://issues.apache.org/jira/browse/SPARK-20161
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, YARN
>Affects Versions: 2.1.0
>Reporter: Sahil Takiar
>
> The default log4j file in {{spark/conf/log4j.properties.template}} doesn't 
> display the thread-id when printing out the logs. It would be very useful to 
> add this, especially for YARN. Currently, logs from all the different threads 
> in a single executor are sent to the same log file. This makes debugging 
> difficult as it is hard to filter out what logs come from what thread.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20161) Default log4j properties file should print thread-id in ConversionPattern

2017-03-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20161:


Assignee: (was: Apache Spark)

> Default log4j properties file should print thread-id in ConversionPattern
> -
>
> Key: SPARK-20161
> URL: https://issues.apache.org/jira/browse/SPARK-20161
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, YARN
>Affects Versions: 2.1.0
>Reporter: Sahil Takiar
>
> The default log4j file in {{spark/conf/log4j.properties.template}} doesn't 
> display the thread-id when printing out the logs. It would be very useful to 
> add this, especially for YARN. Currently, logs from all the different threads 
> in a single executor are sent to the same log file. This makes debugging 
> difficult as it is hard to filter out what logs come from what thread.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20161) Default log4j properties file should print thread-id in ConversionPattern

2017-03-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15951595#comment-15951595
 ] 

Apache Spark commented on SPARK-20161:
--

User 'sahilTakiar' has created a pull request for this issue:
https://github.com/apache/spark/pull/17499

> Default log4j properties file should print thread-id in ConversionPattern
> -
>
> Key: SPARK-20161
> URL: https://issues.apache.org/jira/browse/SPARK-20161
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, YARN
>Affects Versions: 2.1.0
>Reporter: Sahil Takiar
>
> The default log4j file in {{spark/conf/log4j.properties.template}} doesn't 
> display the thread-id when printing out the logs. It would be very useful to 
> add this, especially for YARN. Currently, logs from all the different threads 
> in a single executor are sent to the same log file. This makes debugging 
> difficult as it is hard to filter out what logs come from what thread.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20161) Default log4j properties file should print thread-id in ConversionPattern

2017-03-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20161:


Assignee: Apache Spark

> Default log4j properties file should print thread-id in ConversionPattern
> -
>
> Key: SPARK-20161
> URL: https://issues.apache.org/jira/browse/SPARK-20161
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, YARN
>Affects Versions: 2.1.0
>Reporter: Sahil Takiar
>Assignee: Apache Spark
>
> The default log4j file in {{spark/conf/log4j.properties.template}} doesn't 
> display the thread-id when printing out the logs. It would be very useful to 
> add this, especially for YARN. Currently, logs from all the different threads 
> in a single executor are sent to the same log file. This makes debugging 
> difficult as it is hard to filter out what logs come from what thread.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20161) Default log4j properties file should print thread-id in ConversionPattern

2017-03-31 Thread Sahil Takiar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sahil Takiar updated SPARK-20161:
-
Summary: Default log4j properties file should print thread-id in 
ConversionPattern  (was: Default spark/conf/log4j.properties.template should 
print thread-id in ConversionPattern)

> Default log4j properties file should print thread-id in ConversionPattern
> -
>
> Key: SPARK-20161
> URL: https://issues.apache.org/jira/browse/SPARK-20161
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, YARN
>Affects Versions: 2.1.0
>Reporter: Sahil Takiar
>
> The default log4j file in {{spark/conf/log4j.properties.template}} doesn't 
> display the thread-id when printing out the logs. It would be very useful to 
> add this, especially for YARN. Currently, logs from all the different threads 
> in a single executor are sent to the same log file. This makes debugging 
> difficult as it is hard to filter out what logs come from what thread.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20179) Major improvements to Spark's Prefix span

2017-03-31 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20179.
---
Resolution: Duplicate

> Major improvements to Spark's Prefix span
> -
>
> Key: SPARK-20179
> URL: https://issues.apache.org/jira/browse/SPARK-20179
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.1.0
> Environment: All
>Reporter: Cyril de Vogelaere
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> The code I would like to push allows major performances improvement for 
> single-item patterns through the use of a CP based solver (Namely OscaR, an 
> open-source solver). And slight perfomance improved for multi-item patterns. 
> As you can see in the log scale graph reachable from the link below, the 
> performance are at worse roughly the same but at best, up to 50x faster (FIFA 
> dataset)
> Link to graph : http://i67.tinypic.com/t06lw7.jpg
> Link to implementation : https://github.com/Syrux/spark
> Link for datasets : 
> http://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php (the 
> two slen datasets used are the first two in the list of 9)
> Performances were tested on the CECI servers, providing the driver with 10G 
> memory (more than needed) and running 4 slaves.
> Additionnally to the performance improvements. I also added a bunch of new 
> fonctionnalities :
>   - Unlimited Max pattern length (with input 0) : No improvement in 
> perfomance here, simply allows the use of unlimited max pattern length.
>   - Min pattern length : Any item below that length won't be outputted. No 
> improvement in performance, just a new functionnality.
>   - Max Item per itemset : An itemset won't be grown further than the inputed 
> number, thus reducing the search space. 
>   - Head start : During the initial dataset cleaning, the frequent item were 
> found then discarded. Which resulted in a inefficient first iteration of the 
> genFreqPattern method. The algorithm new uses them if they are provided, and 
> uses the empty pattern in case they're not. Slight improvement of 
> performances were found.
>   - Sub-problem limit : When resulting item sequence can be very long and the 
> user disposes of a small number of very powerfull machine, this parameter 
> will allow a quick switch to local execution. Tremendously improving 
> performances. Outside of those conditions, the performance may be the 
> negatively affected.
>   - Item constraints : Allow the user to specify constraint on the occurences 
> of an item (=, >, <, >=, <=, !=). For user who are looking for some specific 
> results, the search space can be greatly reduced. Which also improve 
> performances.
> Of course, all of theses added feature where tested for correctness. As you 
> can see on the github link.
> Please take note that the afformentionned fonctionnalities didn't come into 
> play when testing the performance. The performance shown on the graph are 
> 'merely' the result of the remplacement of the local execution by a CP based 
> algorithm. (maxLocalProjDBSize was also kept to it's default value 
> (3200L), to make sure the performance couldn't be artificially improved)
> Trade offs :
>   - The algorithm does have a slight trade off. Since the algorithm needs to 
> detect whether sequences can use the specialised single-item patterns CP 
> algorithm. It may be a bit slower when it is not needed. This trade of was 
> mitigated by introducing a round sequence cleaning before the local 
> execution. Thus improving the performance of multi-item local executions when 
> the cleaning is effective. In case the no item can be cleaned, the check will 
> appear in the performence, creating a slight drop in them.
>   - All other change provided shouldn't have any effect on efficiency or 
> complexity if left to their default value. (Where they are basically 
> desactivated). When activated, they may however reduce the search space and 
> thus improve performance.
> Additionnal note :
>  - The performance improvement are mostly seen far datasets where all itemset 
> are of size one, since it is a necessary condition for the use of the CP 
> based algorithm. But as you can see in the two Slen datasets, performances 
> were also slightly improved for algorithms which have multiple items per 
> itemset. The algorithm was built to detect when CP can be used, and to use 
> it, given the opportunity.
>  - The performance displayed here are the results of six months of work. 
> Various other things were tested to improve performance, without as much 
> success. I can thus say with a bit of confidence that the performances here 
> attained will be very hard to improve further.
>  - In case you want me to test the performance on a specific dataset or to 
> provide additionnal 

[jira] [Updated] (SPARK-20109) Need a way to convert from IndexedRowMatrix to Dense Block Matrices

2017-03-31 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-20109:
--
Priority: Minor  (was: Major)

> Need a way to convert from IndexedRowMatrix to Dense Block Matrices
> ---
>
> Key: SPARK-20109
> URL: https://issues.apache.org/jira/browse/SPARK-20109
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: John Compitello
>Priority: Minor
>
> The current implementation of toBlockMatrix on IndexedRowMatrix is 
> insufficient. It is implemented by first converting the IndexedRowMatrix to a 
> CoordinateMatrix, then converting that CoordinateMatrix to a BlockMatrix. Not 
> only is this slower than it needs to be, it also means that the created 
> BlockMatrix ends up being backed by instances of SparseMatrix, which a user 
> may not want. Users need an option to convert from IndexedRowMatrix to 
> BlockMatrix that backs the BlockMatrix with local instances of DenseMatrix. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20180) Unlimited max pattern length in Prefix span

2017-03-31 Thread Cyril de Vogelaere (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cyril de Vogelaere updated SPARK-20180:
---
Description: 
Right now, we need to use .setMaxPatternLength() method to
specify is the maximum pattern length of a sequence. Any pattern longer than 
that won't be outputted.

The current default maxPatternlength value being 10.

This should be changed so that with input 0, all pattern of any length would be 
outputted. Additionally, the default value should be changed to 0, so that a 
new user could find all patterns in his dataset without looking at this 
parameter.

  was:
Right now, we need to use .setMaxPatternLength() method to
specify is the maximum pattern length of a sequence. Any pattern longer than 
that won't be outputted.

The current default maxPatternlength value being 10.

This should be changed so that with input 0, all pattern of any length would be 
outputted. Additionally, the default value should be changed to 0, so that a 
new user could find all the pattern in his dataset without looking at this 
parameter.


> Unlimited max pattern length in Prefix span
> ---
>
> Key: SPARK-20180
> URL: https://issues.apache.org/jira/browse/SPARK-20180
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: Cyril de Vogelaere
>Priority: Minor
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> Right now, we need to use .setMaxPatternLength() method to
> specify is the maximum pattern length of a sequence. Any pattern longer than 
> that won't be outputted.
> The current default maxPatternlength value being 10.
> This should be changed so that with input 0, all pattern of any length would 
> be outputted. Additionally, the default value should be changed to 0, so that 
> a new user could find all patterns in his dataset without looking at this 
> parameter.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20180) Unlimited max pattern length in Prefix span

2017-03-31 Thread Cyril de Vogelaere (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cyril de Vogelaere updated SPARK-20180:
---
Description: 
Right now, we need to use .setMaxPatternLength() method to
specify is the maximum pattern length of a sequence. Any pattern longer than 
that won't be outputted.

The current default maxPatternlength value being 10.

This should be changed so that with input 0, all pattern of any length would be 
outputted. Additionally, the default value should be changed to 0, so that a 
new user could find all the pattern in his dataset without looking at this 
parameter.

  was:
Right now, we need to use .setMaxPatternLength(x) (with x > 0) to
specify is the maximum pattern length of a sequence. Any pattern longer than 
that won't be outputted.

The current default maxPatternlength value being 10.

This should be changed so that with input 0, all pattern of any length would be 
outputted. Additionally, the default value should be changed to 0, so that a 
new user could find all the pattern in his dataset without looking at this 
parameter.


> Unlimited max pattern length in Prefix span
> ---
>
> Key: SPARK-20180
> URL: https://issues.apache.org/jira/browse/SPARK-20180
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: Cyril de Vogelaere
>Priority: Minor
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> Right now, we need to use .setMaxPatternLength() method to
> specify is the maximum pattern length of a sequence. Any pattern longer than 
> that won't be outputted.
> The current default maxPatternlength value being 10.
> This should be changed so that with input 0, all pattern of any length would 
> be outputted. Additionally, the default value should be changed to 0, so that 
> a new user could find all the pattern in his dataset without looking at this 
> parameter.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20180) Unlimited max pattern length in Prefix span

2017-03-31 Thread Cyril de Vogelaere (JIRA)
Cyril de Vogelaere created SPARK-20180:
--

 Summary: Unlimited max pattern length in Prefix span
 Key: SPARK-20180
 URL: https://issues.apache.org/jira/browse/SPARK-20180
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 2.1.0
Reporter: Cyril de Vogelaere
Priority: Minor


Right now, we need to use .setMaxPatternLength(x) (with x > 0) to
specify is the maximum pattern length of a sequence. Any pattern longer than 
that won't be outputted.

The current default maxPatternlength value being 10.

This should be changed so that with input 0, all pattern of any length would be 
outputted. Additionally, the default value should be changed to 0, so that a 
new user could find all the pattern in his dataset without looking at this 
parameter.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20176) Spark Dataframe UDAF issue

2017-03-31 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15951473#comment-15951473
 ] 

Kazuaki Ishizaki commented on SPARK-20176:
--

Could you please post the program that can reproduce this issue?

> Spark Dataframe UDAF issue
> --
>
> Key: SPARK-20176
> URL: https://issues.apache.org/jira/browse/SPARK-20176
> Project: Spark
>  Issue Type: IT Help
>  Components: Spark Core
>Affects Versions: 2.0.2
>Reporter: Dinesh Man Amatya
>
> Getting following error in custom UDAF
> Error while decoding: java.util.concurrent.ExecutionException: 
> java.lang.Exception: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 58, Column 33: Incompatible expression types "boolean" and "java.lang.Boolean"
> /* 001 */ public java.lang.Object generate(Object[] references) {
> /* 002 */   return new SpecificSafeProjection(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ class SpecificSafeProjection extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
> /* 006 */
> /* 007 */   private Object[] references;
> /* 008 */   private MutableRow mutableRow;
> /* 009 */   private Object[] values;
> /* 010 */   private Object[] values1;
> /* 011 */   private org.apache.spark.sql.types.StructType schema;
> /* 012 */   private org.apache.spark.sql.types.StructType schema1;
> /* 013 */
> /* 014 */
> /* 015 */   public SpecificSafeProjection(Object[] references) {
> /* 016 */ this.references = references;
> /* 017 */ mutableRow = (MutableRow) references[references.length - 1];
> /* 018 */
> /* 019 */
> /* 020 */ this.schema = (org.apache.spark.sql.types.StructType) 
> references[0];
> /* 021 */ this.schema1 = (org.apache.spark.sql.types.StructType) 
> references[1];
> /* 022 */   }
> /* 023 */
> /* 024 */   public java.lang.Object apply(java.lang.Object _i) {
> /* 025 */ InternalRow i = (InternalRow) _i;
> /* 026 */
> /* 027 */ values = new Object[2];
> /* 028 */
> /* 029 */ boolean isNull2 = i.isNullAt(0);
> /* 030 */ UTF8String value2 = isNull2 ? null : (i.getUTF8String(0));
> /* 031 */
> /* 032 */ boolean isNull1 = isNull2;
> /* 033 */ final java.lang.String value1 = isNull1 ? null : 
> (java.lang.String) value2.toString();
> /* 034 */ isNull1 = value1 == null;
> /* 035 */ if (isNull1) {
> /* 036 */   values[0] = null;
> /* 037 */ } else {
> /* 038 */   values[0] = value1;
> /* 039 */ }
> /* 040 */
> /* 041 */ boolean isNull5 = i.isNullAt(1);
> /* 042 */ InternalRow value5 = isNull5 ? null : (i.getStruct(1, 2));
> /* 043 */ boolean isNull3 = false;
> /* 044 */ org.apache.spark.sql.Row value3 = null;
> /* 045 */ if (!false && isNull5) {
> /* 046 */
> /* 047 */   final org.apache.spark.sql.Row value6 = null;
> /* 048 */   isNull3 = true;
> /* 049 */   value3 = value6;
> /* 050 */ } else {
> /* 051 */
> /* 052 */   values1 = new Object[2];
> /* 053 */
> /* 054 */   boolean isNull10 = i.isNullAt(1);
> /* 055 */   InternalRow value10 = isNull10 ? null : (i.getStruct(1, 2));
> /* 056 */
> /* 057 */   boolean isNull9 = isNull10 || false;
> /* 058 */   final boolean value9 = isNull9 ? false : (Boolean) 
> value10.isNullAt(0);
> /* 059 */   boolean isNull8 = false;
> /* 060 */   double value8 = -1.0;
> /* 061 */   if (!isNull9 && value9) {
> /* 062 */
> /* 063 */ final double value12 = -1.0;
> /* 064 */ isNull8 = true;
> /* 065 */ value8 = value12;
> /* 066 */   } else {
> /* 067 */
> /* 068 */ boolean isNull14 = i.isNullAt(1);
> /* 069 */ InternalRow value14 = isNull14 ? null : (i.getStruct(1, 2));
> /* 070 */ boolean isNull13 = isNull14;
> /* 071 */ double value13 = -1.0;
> /* 072 */
> /* 073 */ if (!isNull14) {
> /* 074 */
> /* 075 */   if (value14.isNullAt(0)) {
> /* 076 */ isNull13 = true;
> /* 077 */   } else {
> /* 078 */ value13 = value14.getDouble(0);
> /* 079 */   }
> /* 080 */
> /* 081 */ }
> /* 082 */ isNull8 = isNull13;
> /* 083 */ value8 = value13;
> /* 084 */   }
> /* 085 */   if (isNull8) {
> /* 086 */ values1[0] = null;
> /* 087 */   } else {
> /* 088 */ values1[0] = value8;
> /* 089 */   }
> /* 090 */
> /* 091 */   boolean isNull17 = i.isNullAt(1);
> /* 092 */   InternalRow value17 = isNull17 ? null : (i.getStruct(1, 2));
> /* 093 */
> /* 094 */   boolean isNull16 = isNull17 || false;
> /* 095 */   final boolean value16 = isNull16 ? false : (Boolean) 
> value17.isNullAt(1);
> /* 096 */   boolean isNull15 = false;
> /* 097 */   double value15 = -1.0;
> /* 098 */   if (!isNull16 && value16) {
> /* 099 */
> /* 100 */ 

[jira] [Commented] (SPARK-20179) Major improvements to Spark's Prefix span

2017-03-31 Thread Cyril de Vogelaere (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15951438#comment-15951438
 ] 

Cyril de Vogelaere commented on SPARK-20179:


Hello Joseph,

Thanks for your very helpfull comment. I will start by treating every 
additionnal fonctionnalities separatly,
to get familiarised with the process. I will also explain it depth what it 
could bring to the user, and why I judge it important.
Before finishing with the main part of the code and the CP implementation.

Is it ok if I keep adding you as sheperd ?

> Major improvements to Spark's Prefix span
> -
>
> Key: SPARK-20179
> URL: https://issues.apache.org/jira/browse/SPARK-20179
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.1.0
> Environment: All
>Reporter: Cyril de Vogelaere
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> The code I would like to push allows major performances improvement for 
> single-item patterns through the use of a CP based solver (Namely OscaR, an 
> open-source solver). And slight perfomance improved for multi-item patterns. 
> As you can see in the log scale graph reachable from the link below, the 
> performance are at worse roughly the same but at best, up to 50x faster (FIFA 
> dataset)
> Link to graph : http://i67.tinypic.com/t06lw7.jpg
> Link to implementation : https://github.com/Syrux/spark
> Link for datasets : 
> http://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php (the 
> two slen datasets used are the first two in the list of 9)
> Performances were tested on the CECI servers, providing the driver with 10G 
> memory (more than needed) and running 4 slaves.
> Additionnally to the performance improvements. I also added a bunch of new 
> fonctionnalities :
>   - Unlimited Max pattern length (with input 0) : No improvement in 
> perfomance here, simply allows the use of unlimited max pattern length.
>   - Min pattern length : Any item below that length won't be outputted. No 
> improvement in performance, just a new functionnality.
>   - Max Item per itemset : An itemset won't be grown further than the inputed 
> number, thus reducing the search space. 
>   - Head start : During the initial dataset cleaning, the frequent item were 
> found then discarded. Which resulted in a inefficient first iteration of the 
> genFreqPattern method. The algorithm new uses them if they are provided, and 
> uses the empty pattern in case they're not. Slight improvement of 
> performances were found.
>   - Sub-problem limit : When resulting item sequence can be very long and the 
> user disposes of a small number of very powerfull machine, this parameter 
> will allow a quick switch to local execution. Tremendously improving 
> performances. Outside of those conditions, the performance may be the 
> negatively affected.
>   - Item constraints : Allow the user to specify constraint on the occurences 
> of an item (=, >, <, >=, <=, !=). For user who are looking for some specific 
> results, the search space can be greatly reduced. Which also improve 
> performances.
> Of course, all of theses added feature where tested for correctness. As you 
> can see on the github link.
> Please take note that the afformentionned fonctionnalities didn't come into 
> play when testing the performance. The performance shown on the graph are 
> 'merely' the result of the remplacement of the local execution by a CP based 
> algorithm. (maxLocalProjDBSize was also kept to it's default value 
> (3200L), to make sure the performance couldn't be artificially improved)
> Trade offs :
>   - The algorithm does have a slight trade off. Since the algorithm needs to 
> detect whether sequences can use the specialised single-item patterns CP 
> algorithm. It may be a bit slower when it is not needed. This trade of was 
> mitigated by introducing a round sequence cleaning before the local 
> execution. Thus improving the performance of multi-item local executions when 
> the cleaning is effective. In case the no item can be cleaned, the check will 
> appear in the performence, creating a slight drop in them.
>   - All other change provided shouldn't have any effect on efficiency or 
> complexity if left to their default value. (Where they are basically 
> desactivated). When activated, they may however reduce the search space and 
> thus improve performance.
> Additionnal note :
>  - The performance improvement are mostly seen far datasets where all itemset 
> are of size one, since it is a necessary condition for the use of the CP 
> based algorithm. But as you can see in the two Slen datasets, performances 
> were also slightly improved for algorithms which have multiple items per 
> itemset. The algorithm was built to detect when CP can be used, and to use 
> it, 

[jira] [Resolved] (SPARK-20165) Resolve state encoder's deserializer in driver in FlatMapGroupsWithStateExec

2017-03-31 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-20165.
---
Resolution: Fixed

Issue resolved by pull request 17488
[https://github.com/apache/spark/pull/17488]

> Resolve state encoder's deserializer in driver in FlatMapGroupsWithStateExec
> 
>
> Key: SPARK-20165
> URL: https://issues.apache.org/jira/browse/SPARK-20165
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
> Fix For: 2.2.0
>
>
> Encoder's deserializer must be resolved at the driver where the class is 
> defined. Otherwise there are corner cases using nested classes where 
> resolving at the executor can fail.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20160) Move ParquetConversions and OrcConversions Out Of HiveSessionCatalog

2017-03-31 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-20160.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17484
[https://github.com/apache/spark/pull/17484]

> Move ParquetConversions and OrcConversions Out Of HiveSessionCatalog
> 
>
> Key: SPARK-20160
> URL: https://issues.apache.org/jira/browse/SPARK-20160
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.2.0
>
>
> {{ParquetConversions}} and {{OrcConversions}} should be treated as regular 
> Analyzer rules. It is not reasonable to be part of {{HiveSessionCatalog}}. 
> After moving these two rules out of {{HiveSessionCatalog}}, the next step is 
> to rename {{HiveMetastoreCatalog}} because it is not related to the hive 
> package any more.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20179) Major improvements to Spark's Prefix span

2017-03-31 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15951269#comment-15951269
 ] 

Joseph K. Bradley commented on SPARK-20179:
---

Thanks for the thoughts & work.  Sean's right that these practices are 
described in the contributing guide, as well as a lot of other helpful info.  
I'd recommend a few things:
* Split the proposals up into smaller pieces.  Putting everything into 1 JIRA 
and/or PR makes it hard for reviewers to understand what is being proposed and 
how the changes interact.
* Make JIRA titles and descriptions very clear in terms of what the key change 
is.  If it's multiple changes, can these be broken into separate parts and 
added incrementally?  If the changes are related, it can be OK to create an 
umbrella JIRA which gives a holistic view; you can put the actual changes and 
PRs under subtasks.
* Start with the smallest incremental changes you're interested in to get 
familiar with the contribution process.
* Keep the perspective of reviewers in mind: If the code is long or complex to 
describe, it's going to be overwhelming to reviewers who have never seen it.

Thanks!

> Major improvements to Spark's Prefix span
> -
>
> Key: SPARK-20179
> URL: https://issues.apache.org/jira/browse/SPARK-20179
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.1.0
> Environment: All
>Reporter: Cyril de Vogelaere
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> The code I would like to push allows major performances improvement for 
> single-item patterns through the use of a CP based solver (Namely OscaR, an 
> open-source solver). And slight perfomance improved for multi-item patterns. 
> As you can see in the log scale graph reachable from the link below, the 
> performance are at worse roughly the same but at best, up to 50x faster (FIFA 
> dataset)
> Link to graph : http://i67.tinypic.com/t06lw7.jpg
> Link to implementation : https://github.com/Syrux/spark
> Link for datasets : 
> http://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php (the 
> two slen datasets used are the first two in the list of 9)
> Performances were tested on the CECI servers, providing the driver with 10G 
> memory (more than needed) and running 4 slaves.
> Additionnally to the performance improvements. I also added a bunch of new 
> fonctionnalities :
>   - Unlimited Max pattern length (with input 0) : No improvement in 
> perfomance here, simply allows the use of unlimited max pattern length.
>   - Min pattern length : Any item below that length won't be outputted. No 
> improvement in performance, just a new functionnality.
>   - Max Item per itemset : An itemset won't be grown further than the inputed 
> number, thus reducing the search space. 
>   - Head start : During the initial dataset cleaning, the frequent item were 
> found then discarded. Which resulted in a inefficient first iteration of the 
> genFreqPattern method. The algorithm new uses them if they are provided, and 
> uses the empty pattern in case they're not. Slight improvement of 
> performances were found.
>   - Sub-problem limit : When resulting item sequence can be very long and the 
> user disposes of a small number of very powerfull machine, this parameter 
> will allow a quick switch to local execution. Tremendously improving 
> performances. Outside of those conditions, the performance may be the 
> negatively affected.
>   - Item constraints : Allow the user to specify constraint on the occurences 
> of an item (=, >, <, >=, <=, !=). For user who are looking for some specific 
> results, the search space can be greatly reduced. Which also improve 
> performances.
> Of course, all of theses added feature where tested for correctness. As you 
> can see on the github link.
> Please take note that the afformentionned fonctionnalities didn't come into 
> play when testing the performance. The performance shown on the graph are 
> 'merely' the result of the remplacement of the local execution by a CP based 
> algorithm. (maxLocalProjDBSize was also kept to it's default value 
> (3200L), to make sure the performance couldn't be artificially improved)
> Trade offs :
>   - The algorithm does have a slight trade off. Since the algorithm needs to 
> detect whether sequences can use the specialised single-item patterns CP 
> algorithm. It may be a bit slower when it is not needed. This trade of was 
> mitigated by introducing a round sequence cleaning before the local 
> execution. Thus improving the performance of multi-item local executions when 
> the cleaning is effective. In case the no item can be cleaned, the check will 
> appear in the performence, creating a slight drop in them.
>   - All other change provided shouldn't have any effect on efficiency 

[jira] [Resolved] (SPARK-20084) Remove internal.metrics.updatedBlockStatuses accumulator from history files

2017-03-31 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-20084.

   Resolution: Fixed
 Assignee: Ryan Blue
Fix Version/s: 2.1.2
   2.2.0

> Remove internal.metrics.updatedBlockStatuses accumulator from history files
> ---
>
> Key: SPARK-20084
> URL: https://issues.apache.org/jira/browse/SPARK-20084
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Web UI
>Affects Versions: 2.1.0
>Reporter: Ryan Blue
>Assignee: Ryan Blue
> Fix For: 2.2.0, 2.1.2
>
>
> History files for large jobs can be hundreds of GB. These history files take 
> too much space and create a backlog on the history server.
> Most of the size is from Accumulables in SparkListenerTaskEnd. The largest 
> accumulable is internal.metrics.updatedBlockStatuses, which has a small 
> update (the blocks that were changed) but a huge value (all known blocks). 
> Nothing currently uses the accumulator value or update, so it is safe to 
> remove it. Information for any block updated during a task is also recorded 
> under Task Metrics / Updated Blocks.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20179) Major improvements to Spark's Prefix span

2017-03-31 Thread Cyril de Vogelaere (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cyril de Vogelaere updated SPARK-20179:
---
Description: 
The code I would like to push allows major performances improvement for 
single-item patterns through the use of a CP based solver (Namely OscaR, an 
open-source solver). And slight perfomance improved for multi-item patterns. As 
you can see in the log scale graph reachable from the link below, the 
performance are at worse roughly the same but at best, up to 50x faster (FIFA 
dataset)

Link to graph : http://i67.tinypic.com/t06lw7.jpg
Link to implementation : https://github.com/Syrux/spark
Link for datasets : 
http://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php (the 
two slen datasets used are the first two in the list of 9)

Performances were tested on the CECI servers, providing the driver with 10G 
memory (more than needed) and running 4 slaves.

Additionnally to the performance improvements. I also added a bunch of new 
fonctionnalities :

  - Unlimited Max pattern length (with input 0) : No improvement in perfomance 
here, simply allows the use of unlimited max pattern length.

  - Min pattern length : Any item below that length won't be outputted. No 
improvement in performance, just a new functionnality.

  - Max Item per itemset : An itemset won't be grown further than the inputed 
number, thus reducing the search space. 

  - Head start : During the initial dataset cleaning, the frequent item were 
found then discarded. Which resulted in a inefficient first iteration of the 
genFreqPattern method. The algorithm new uses them if they are provided, and 
uses the empty pattern in case they're not. Slight improvement of performances 
were found.

  - Sub-problem limit : When resulting item sequence can be very long and the 
user disposes of a small number of very powerfull machine, this parameter will 
allow a quick switch to local execution. Tremendously improving performances. 
Outside of those conditions, the performance may be the negatively affected.

  - Item constraints : Allow the user to specify constraint on the occurences 
of an item (=, >, <, >=, <=, !=). For user who are looking for some specific 
results, the search space can be greatly reduced. Which also improve 
performances.

Of course, all of theses added feature where tested for correctness. As you can 
see on the github link.

Please take note that the afformentionned fonctionnalities didn't come into 
play when testing the performance. The performance shown on the graph are 
'merely' the result of the remplacement of the local execution by a CP based 
algorithm. (maxLocalProjDBSize was also kept to it's default value (3200L), 
to make sure the performance couldn't be artificially improved)

Trade offs :

  - The algorithm does have a slight trade off. Since the algorithm needs to 
detect whether sequences can use the specialised single-item patterns CP 
algorithm. It may be a bit slower when it is not needed. This trade of was 
mitigated by introducing a round sequence cleaning before the local execution. 
Thus improving the performance of multi-item local executions when the cleaning 
is effective. In case the no item can be cleaned, the check will appear in the 
performence, creating a slight drop in them.

  - All other change provided shouldn't have any effect on efficiency or 
complexity if left to their default value. (Where they are basically 
desactivated). When activated, they may however reduce the search space and 
thus improve performance.

Additionnal note :

 - The performance improvement are mostly seen far datasets where all itemset 
are of size one, since it is a necessary condition for the use of the CP based 
algorithm. But as you can see in the two Slen datasets, performances were also 
slightly improved for algorithms which have multiple items per itemset. The 
algorithm was built to detect when CP can be used, and to use it, given the 
opportunity.

 - The performance displayed here are the results of six months of work. 
Various other things were tested to improve performance, without as much 
success. I can thus say with a bit of confidence that the performances here 
attained will be very hard to improve further.

 - In case you want me to test the performance on a specific dataset or to 
provide additionnal informations, it would be my pleasure to do so :). Just it 
me up with my email below :

Email : cyril.devogela...@gmail.com

 - I am a newbie contributor to Spark, and am not familliar with the whole 
procedure at all. In case, I did something incorrectly, I will fix it as soon 
as possible.


  was:
The code I would like to push allows major performances improvement for 
single-item patterns through the use of a CP based solver (Namely OscaR, an 
open-source solver). And slight perfomance improved for multi-item patterns. As 
you can see in the log 

[jira] [Resolved] (SPARK-20164) AnalysisException not tolerant of null query plan

2017-03-31 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-20164.
-
   Resolution: Fixed
 Assignee: Kunal Khamar
Fix Version/s: 2.2.0
   2.1.2

> AnalysisException not tolerant of null query plan
> -
>
> Key: SPARK-20164
> URL: https://issues.apache.org/jira/browse/SPARK-20164
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kunal Khamar
>Assignee: Kunal Khamar
> Fix For: 2.1.2, 2.2.0
>
>
> When someone throws an AnalysisException with a null query plan (which 
> ideally no one should), getMessage is not tolerant of this and throws a null 
> pointer exception, leading to loss of information about original exception.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20144) spark.read.parquet no long maintains ordering of the data

2017-03-31 Thread Li Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15951202#comment-15951202
 ] 

Li Jin commented on SPARK-20144:


Thanks Sean! I appreciate your time and help very much.

> spark.read.parquet no long maintains ordering of the data
> -
>
> Key: SPARK-20144
> URL: https://issues.apache.org/jira/browse/SPARK-20144
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Li Jin
>
> Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is 
> when we read parquet files in 2.0.2, the ordering of rows in the resulting 
> dataframe is not the same as the ordering of rows in the dataframe that the 
> parquet file was reproduced with. 
> This is because FileSourceStrategy.scala combines the parquet files into 
> fewer partitions and also reordered them. This breaks our workflows because 
> they assume the ordering of the data. 
> Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec 
> changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with 
> 2.1.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20178) Improve Scheduler fetch failures

2017-03-31 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15951194#comment-15951194
 ] 

Imran Rashid commented on SPARK-20178:
--

Thanks for writing this up Tom.

The only way I see to have a pluggable interface in the current code is to 
abstract out the *entire* thing -- DAGScheduler, TSM, TSI.  perhaps also CGSB 
and OCC.  that would be pretty extreme though, I'd only consider that if we 
actually have some reason to think we'd come up with a better version (eg. new 
abstractions with less shared state).

In addition to not destabilizing the current scheduler, we should also think of 
what the migration path would be for enabling these new changes.  Will there be 
a way for spark to auto-tune?  Or will we need to create a number of new confs? 
 I know everyone hates having a huge set of configuration that needs to be 
tuned, but at some point I think its OK if spark works reasonably well on small 
clusters by default, and for large clusters you've just got to have somebody 
that knows how to configure it carefully.

Another thing to keep in mind is that Spark is used on a huge variety of 
workloads.  I feel like right now we're very focused on large jobs on big 
clusters with long tasks; but spark is also used with very small tasks, 
especially streaming.  I think all the ideas we're thinking of only effect 
behavior after there is a failure, so hopefully it wouldn't matter.  But we 
need to be careful that we don't introduce complexity which effects performance 
even before any failures.

> Improve Scheduler fetch failures
> 
>
> Key: SPARK-20178
> URL: https://issues.apache.org/jira/browse/SPARK-20178
> Project: Spark
>  Issue Type: Epic
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Thomas Graves
>
> We have been having a lot of discussions around improving the handling of 
> fetch failures.  There are 4 jira currently related to this.  
> We should try to get a list of things we want to improve and come up with one 
> cohesive design.
> SPARK-20163,  SPARK-20091,  SPARK-14649 , and SPARK-19753
> I will put my initial thoughts in a follow on comment.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-20156) Local dependent library used for upper and lowercase conversions.

2017-03-31 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-20156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Serkan Taş updated SPARK-20156:
---
Comment: was deleted

(was: console log before setting locale)

> Local dependent library used for upper and lowercase conversions.
> -
>
> Key: SPARK-20156
> URL: https://issues.apache.org/jira/browse/SPARK-20156
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.1.0
> Environment: Ubunutu 16.04
> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_121)
>Reporter: Serkan Taş
> Attachments: sprk_shell.txt
>
>
> If the regional setting of the operation system is Turkish, the famous java 
> locale problem occurs (https://jira.atlassian.com/browse/CONF-5931 or 
> https://issues.apache.org/jira/browse/AVRO-1493). 
> e.g : 
> "SERDEINFO" lowers to "serdeınfo"
> "uniquetable" uppers to "UNİQUETABLE"
> work around : 
> add -Duser.country=US -Duser.language=en to the end of the line 
> SPARK_SUBMIT_OPTS="$SPARK_SUBMIT_OPTS -Dscala.usejavacp=true"
> in spark-shell.sh



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20163) Kill all running tasks in a stage in case of fetch failure

2017-03-31 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15951136#comment-15951136
 ] 

Imran Rashid commented on SPARK-20163:
--

I think this is a duplicate of SPARK-2666, which has more discussion in it.  
Unless there is something which makes this distinct, can we close this as a 
duplicate?

> Kill all running tasks in a stage in case of fetch failure
> --
>
> Key: SPARK-20163
> URL: https://issues.apache.org/jira/browse/SPARK-20163
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.0.1
>Reporter: Sital Kedia
>
> Currently, the scheduler does not kill the running tasks in a stage when it 
> encounters fetch failure, as a result, we might end up running many duplicate 
> tasks in the cluster. There is already a TODO in TaskSetManager to kill all 
> running tasks which has not been implemented.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20179) Major improvements to Spark's Prefix span

2017-03-31 Thread Cyril de Vogelaere (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cyril de Vogelaere updated SPARK-20179:
---
Description: 
The code I would like to push allows major performances improvement for 
single-item patterns through the use of a CP based solver (Namely OscaR, an 
open-source solver). And slight perfomance improved for multi-item patterns. As 
you can see in the log scale graph reachable from the link below, the 
performance are at worse roughly the same but at best, up to 50x faster (FIFA 
dataset)

Link to graph : http://i67.tinypic.com/t06lw7.jpg
Link to implementation : https://github.com/Syrux/spark
Link for datasets : 
http://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php (the 
two slen datasets used are the first two in the list of 9)

Performances were tested on the CECI servers, providing the driver with 10G 
memory (more than needed) and running 4 slaves.

Additionnally to the performance improvements. I also added a bunch of new 
fonctionnalities :

  - Unlimited Max pattern length (with input 0) : No improvement in perfomance 
here, simply allows the use of unlimited max pattern length.

  - Min pattern length : Any item below that length won't be outputted. No 
improvement in performance, just a new functionnality.

  - Max Item per itemset : An itemset won't be grown further than the inputed 
number, thus reducing the search space. 

  - Head start : During the initial dataset cleaning, the frequent item were 
found then discarded. Which resulted in a inefficient first iteration of the 
genFreqPattern method. The algorithm new uses them if they are provided, and 
uses the empty pattern in case they're not. Slight improvement of performances 
were found.

  - Sub-problem limit : When resulting item sequence can be very long and the 
user disposes of a small number of very powerfull machine, this parameter will 
allow a quick switch to local execution. Tremendously improving performances. 
Outside of those conditions, the performance may be the negatively affected.

  - Item constraints : Allow the user to specify constraint on the occurences 
of an item (=, >, <, >=, <=, !=). For user who are looking for some specific 
results, the search space can be greatly reduced. Which also improve 
performances.

Of course, all of theses added feature where test for correctness. As you can 
see on the github link.

Please take note that the afformentionned fonctionnalities didn't come into 
play when testing the performance. The performance shown on the graph are 
'merely' the result of the remplacement of the local execution by a CP based 
algorithm. (maxLocalProjDBSize was also kept to it's default value (3200L), 
to make sure the performance couldn't be artificially improved)

Trade offs :

  - The algorithm does have a slight trade off. Since the algorithm needs to 
detect whether sequences can use the specialised single-item patterns CP 
algorithm. It may be a bit slower when it is not needed. This trade of was 
mitigated by introducing a round sequence cleaning before the local execution. 
Thus improving the performance of multi-item local executions when the cleaning 
is effective. In case the no item can be cleaned, the check will appear in the 
performence, creating a slight drop in them.

  - All other change provided shouldn't have any effect on efficiency or 
complexity if left to their default value. (Where they are basically 
desactivated). When activated, they may however reduce the search space and 
thus improve performance.

Additionnal note :

 - The performance improvement are mostly seen far datasets where all itemset 
are of size one, since it is a necessary condition for the use of the CP based 
algorithm. But as you can see in the two Slen datasets, performances were also 
slightly improved for algorithms which have multiple items per itemset. The 
algorithm was built to detect when CP can be used, and to use it, given the 
opportunity.

 - The performance displayed here are the results of six months of work. 
Various other things were tested to improve performance, without as much 
success. I can thus say with a bit of confidence that the performances here 
attained will be very hard to improve further.

 - In case you want me to test the performance on a specific dataset or to 
provide additionnal informations, it would be my pleasure to do so :). Just it 
me up with my email below :

Email : cyril.devogela...@gmail.com

 - I am a newbie contributor to Spark, and am not familliar with the whole 
procedure at all. In case, I did something incorrectly, I will fix it as soon 
as possible.


  was:
The code I would like to push allows major performances improvement for 
single-item patterns through the use of a CP based solver (Namely OscaR, an 
open-source solver). And slight perfomance improved for multi-item patterns. As 
you can see in the log 

[jira] [Updated] (SPARK-20179) Major improvements to Spark's Prefix span

2017-03-31 Thread Cyril de Vogelaere (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cyril de Vogelaere updated SPARK-20179:
---
Description: 
The code I would like to push allows major performances improvement for 
single-item patterns through the use of a CP based solver (Namely OscaR, an 
open-source solver). And slight perfomance improved for multi-item patterns. As 
you can see in the log scale graph reachable from the link below, the 
performance are at worse roughly the same but at best, up to 50x faster (FIFA 
dataset)

Link to graph : http://i67.tinypic.com/t06lw7.jpg
Link to implementation : https://github.com/Syrux/spark
Link for datasets : 
http://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php (the 
two slen datasets used are the first two in the list of 9)

Performances were tested on the CECI servers, providing the driver with 10G 
memory (more than needed) and running 4 slaves.

Additionnally to the performance improvements. I also added a bunch of new 
fonctionnalities :

  - Unlimited Max pattern length (with input 0) : No improvement in perfomance 
here, simply allows the use of unlimited max pattern length.

  - Min pattern length : Any item below that length won't be outputted. No 
improvement in performance, just a new functionnality.

  - Max Item per itemset : An itemset won't be grown further than the inputed 
number, thus reducing the search space. 

  - Head start : During the initial dataset cleaning, the frequent item were 
found then discarded. Which resulted in a inefficient first iteration of the 
genFreqPattern method. The algorithm new uses them if they are provided, and 
uses the empty pattern in case they're not. Slight improvement of performances 
were found.

  - Sub-problem limit : When resulting item sequence can be very long and the 
user disposes of a small number of very powerfull machine, this parameter will 
allow a quick switch to local execution. Tremendously improving performances. 
Outside of those conditions, the performance may be the negatively affected.

  - Item constraints : Allow the user to specify constraint on the occurences 
of an item (=, >, <, >=, <=, !=). For user who are looking for some specific 
results, the search space can be greatly reduced. Which also improve 
performances.


Please take note that the afformentionned fonctionnalities didn't come into 
play when testing the performance. The performance shown on the graph are 
'merely' the result of the remplacement of the local execution by a CP based 
algorithm. (maxLocalProjDBSize was also kept to it's default value (3200L), 
to make sure the performance couldn't be artificially improved)

Trade offs :

  - The algorithm does have a slight trade off. Since the algorithm needs to 
detect whether sequences can use the specialised single-item patterns CP 
algorithm. It may be a bit slower when it is not needed. This trade of was 
mitigated by introducing a round sequence cleaning before the local execution. 
Thus improving the performance of multi-item local executions when the cleaning 
is effective. In case the no item can be cleaned, the check will appear in the 
performence, creating a slight drop in them.

  - All other change provided shouldn't have any effect on efficiency or 
complexity if left to their default value. (Where they are basically 
desactivated). When activated, they may however reduce the search space and 
thus improve performance.

Additionnal note :

 - The performance improvement are mostly seen far datasets where all itemset 
are of size one, since it is a necessary condition for the use of the CP based 
algorithm. But as you can see in the two Slen datasets, performances were also 
slightly improved for algorithms which have multiple items per itemset. The 
algorithm was built to detect when CP can be used, and to use it, given the 
opportunity.

 - The performance displayed here are the results of six months of work. 
Various other things were tested to improve performance, without as much 
success. I can thus say with a bit of confidence that the performances here 
attained will be very hard to improve further.

 - In case you want me to test the performance on a specific dataset or to 
provide additionnal informations, it would be my pleasure to do so :). Just it 
me up with my email below :

Email : cyril.devogela...@gmail.com

 - I am a newbie contributor to Spark, and am not familliar with the whole 
procedure at all. In case, I did something incorrectly, I will fix it as soon 
as possible.


  was:
The code I would like to push allows major performances improvement for 
single-item patterns through the use of a CP based solver (Namely OscaR, an 
open-source solver). And slight perfomance improved for multi-item patterns. As 
you can see in the log scale graph reachable from the link below, the 
performance are at worse roughly the same but at best, up 

[jira] [Updated] (SPARK-20179) Major improvements to Spark's Prefix span

2017-03-31 Thread Cyril de Vogelaere (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cyril de Vogelaere updated SPARK-20179:
---
Priority: Major  (was: Minor)

> Major improvements to Spark's Prefix span
> -
>
> Key: SPARK-20179
> URL: https://issues.apache.org/jira/browse/SPARK-20179
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.1.0
> Environment: All
>Reporter: Cyril de Vogelaere
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> The code I would like to push allows major performances improvement for 
> single-item patterns through the use of a CP based solver (Namely OscaR, an 
> open-source solver). And slight perfomance improved for multi-item patterns. 
> As you can see in the log scale graph reachable from the link below, the 
> performance are at worse roughly the same but at best, up to 50x faster (FIFA 
> dataset)
> Link to graph : http://i67.tinypic.com/t06lw7.jpg
> Link to implementation : https://github.com/Syrux/spark
> Link for datasets : 
> http://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php (the 
> two slen datasets used are the first two in the list of 9)
> Performances were tested on the CECI servers, providing the driver with 10G 
> memory (more than needed) and running 4 slaves.
> Additionnally to the performance improvements. I also added a bunch of new 
> fonctionnalities :
>   - Unlimited Max pattern length (with input 0) : 
>   - Min pattern length : Any item below that length won't be outputted
>   - Max Item per itemset : An itemset won't be grown further than the input 
> number, thus reducing the search space. 
>   - Head start : During the initial dataset cleaning, the frequent item were 
> found then discarded. Which resulted in a inefficient first iteration of the 
> genFreqPattern method. The algorithm new uses them if they are provided, and 
> uses the empty pattern in case they're not. Slight improvement of 
> performances were found.
>   - Sub-problem limit : When resulting item sequence can be very long and the 
> user disposes of a small number of very powerfull machine, this parameter 
> will allow a quick switch to local execution. Tremendously improving 
> performances. Outside of those conditions, the performance may be the 
> negatively affected.
>   - Item constraints : Allow the user to specify constraint on the occurences 
> of an item (=, >, <, >=, <=, !=). For user who are looking for some specific 
> results, the search space can be greatly reduced. Which also improve 
> performances.
> Please take note that the afformentionned fonctionnalities didn't come into 
> play when testing the performance. The performance shown on the graph are 
> 'merely' the result of the remplacement of the local execution by a CP based 
> algorithm. (maxLocalProjDBSize was also kept to it's default value 
> (3200L), to make sure the performance couldn't be artificially improved)
> Trade offs :
>   - The algorithm does have a slight trade off. Since the algorithm needs to 
> detect whether sequences can use the specialised single-item patterns CP 
> algorithm. It may be a bit slower when it is not needed. This trade of was 
> mitigated by introducing a round sequence cleaning before the local 
> execution. Thus improving the performance of multi-item local executions when 
> the cleaning is effective. In case the no item can be cleaned, the check will 
> appear in the performence, creating a slight drop in them.
>   - All other change provided shouldn't have any effect on efficiency or 
> complexity if left to their default value. (Where they are basically 
> desactivated). When activated, they may however reduce the search space and 
> thus improve performance.
> Additionnal note :
>  - The performance improvement are mostly seen far datasets where all itemset 
> are of size one, since it is a necessary condition for the use of the CP 
> based algorithm. But as you can see in the two Slen datasets, performances 
> were also slightly improved for algorithms which have multiple items per 
> itemset. The algorithm was built to detect when CP can be used, and to use 
> it, given the opportunity.
>  - The performance displayed here are the results of six months of work. 
> Various other things were tested to improve performance, without as much 
> success. I can thus say with a bit of confidence that the performances here 
> attained will be very hard to improve further.
>  - In case you want me to test the performance on a specific dataset or to 
> provide additionnal informations, it would be my pleasure to do so :). Just 
> it me up with my email below :
> Email : cyril.devogela...@gmail.com
>  - I am a newbie contributor to Spark, and am not familliar with the whole 
> procedure at all. In case, I did 

[jira] [Commented] (SPARK-20144) spark.read.parquet no long maintains ordering of the data

2017-03-31 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15951095#comment-15951095
 ] 

Sean Owen commented on SPARK-20144:
---

Probably best to wait for an informed opinion but I would assume for now you 
need to sort. 

I'm just saying that theoretically sorted data needs no data movement to become 
sorted because it is already. It may not actually even be expensive

> spark.read.parquet no long maintains ordering of the data
> -
>
> Key: SPARK-20144
> URL: https://issues.apache.org/jira/browse/SPARK-20144
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Li Jin
>
> Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is 
> when we read parquet files in 2.0.2, the ordering of rows in the resulting 
> dataframe is not the same as the ordering of rows in the dataframe that the 
> parquet file was reproduced with. 
> This is because FileSourceStrategy.scala combines the parquet files into 
> fewer partitions and also reordered them. This breaks our workflows because 
> they assume the ordering of the data. 
> Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec 
> changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with 
> 2.1.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20179) Major improvements to Spark's Prefix span

2017-03-31 Thread Cyril de Vogelaere (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cyril de Vogelaere updated SPARK-20179:
---
Description: 
The code I would like to push allows major performances improvement for 
single-item patterns through the use of a CP based solver (Namely OscaR, an 
open-source solver). And slight perfomance improved for multi-item patterns. As 
you can see in the log scale graph reachable from the link below, the 
performance are at worse roughly the same but at best, up to 50x faster (FIFA 
dataset)

Link to graph : http://i67.tinypic.com/t06lw7.jpg
Link to implementation : https://github.com/Syrux/spark
Link for datasets : 
http://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php (the 
two slen datasets used are the first two in the list of 9)

Performances were tested on the CECI servers, providing the driver with 10G 
memory (more than needed) and running 4 slaves.

Additionnally to the performance improvements. I also added a bunch of new 
fonctionnalities :

  - Unlimited Max pattern length (with input 0) : 

  - Min pattern length : Any item below that length won't be outputted

  - Max Item per itemset : An itemset won't be grown further than the input 
number, thus reducing the search space. 

  - Head start : During the initial dataset cleaning, the frequent item were 
found then discarded. Which resulted in a inefficient first iteration of the 
genFreqPattern method. The algorithm new uses them if they are provided, and 
uses the empty pattern in case they're not. Slight improvement of performances 
were found.

  - Sub-problem limit : When resulting item sequence can be very long and the 
user disposes of a small number of very powerfull machine, this parameter will 
allow a quick switch to local execution. Tremendously improving performances. 
Outside of those conditions, the performance may be the negatively affected.

  - Item constraints : Allow the user to specify constraint on the occurences 
of an item (=, >, <, >=, <=, !=). For user who are looking for some specific 
results, the search space can be greatly reduced. Which also improve 
performances.


Please take note that the afformentionned fonctionnalities didn't come into 
play when testing the performance. The performance shown on the graph are 
'merely' the result of the remplacement of the local execution by a CP based 
algorithm. (maxLocalProjDBSize was also kept to it's default value (3200L), 
to make sure the performance couldn't be artificially improved)

Trade offs :

  - The algorithm does have a slight trade off. Since the algorithm needs to 
detect whether sequences can use the specialised single-item patterns CP 
algorithm. It may be a bit slower when it is not needed. This trade of was 
mitigated by introducing a round sequence cleaning before the local execution. 
Thus improving the performance of multi-item local executions when the cleaning 
is effective. In case the no item can be cleaned, the check will appear in the 
performence, creating a slight drop in them.

  - All other change provided shouldn't have any effect on efficiency or 
complexity if left to their default value. (Where they are basically 
desactivated). When activated, they may however reduce the search space and 
thus improve performance.

Additionnal note :

 - The performance improvement are mostly seen far datasets where all itemset 
are of size one, since it is a necessary condition for the use of the CP based 
algorithm. But as you can see in the two Slen datasets, performances were also 
slightly improved for algorithms which have multiple items per itemset. The 
algorithm was built to detect when CP can be used, and to use it, given the 
opportunity.

 - The performance displayed here are the results of six months of work. 
Various other things were tested to improve performance, without as much 
success. I can thus say with a bit of confidence that the performances here 
attained will be very hard to improve further.

 - In case you want me to test the performance on a specific dataset or to 
provide additionnal informations, it would be my pleasure to do so :). Just it 
me up with my email below :

Email : cyril.devogela...@gmail.com

 - I am a newbie contributor to Spark, and am not familliar with the whole 
procedure at all. In case, I did something incorrectly, I will fix it as soon 
as possible.


  was:
The code I would like to push allows major performances improvement through the 
use of a CP based solver (Namely OscaR, an open-source solver). As you can see 
in the log scale graph reachable from the link below, the performance are at 
worse roughly the same but at best, up to 50x faster (FIFA dataset)

Link to graph : http://i67.tinypic.com/t06lw7.jpg
Link to implementation : https://github.com/Syrux/spark
Link for datasets : 
http://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php (the 

[jira] [Updated] (SPARK-20179) Major improvements to Spark's Prefix span

2017-03-31 Thread Cyril de Vogelaere (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cyril de Vogelaere updated SPARK-20179:
---
Description: 
The code I would like to push allows major performances improvement through the 
use of a CP based solver (Namely OscaR, an open-source solver). As you can see 
in the log scale graph reachable from the link below, the performance are at 
worse roughly the same but at best, up to 50x faster (FIFA dataset)

Link to graph : http://i67.tinypic.com/t06lw7.jpg
Link to implementation : https://github.com/Syrux/spark
Link for datasets : 
http://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php (the 
two slen datasets used are the first two in the list of 9)

Performances were tested on the CECI servers, providing the driver with 10G 
memory (more than needed) and running 4 slaves.

Additionnally to the performance improvements. I also added a bunch of new 
fonctionnalities :

  - Unlimited Max pattern length (with input 0) : 

  - Min pattern length : Any item below that length won't be outputted

  - Max Item per itemset : An itemset won't be grown further than the input 
number, thus reducing the search space. 

  - Head start : During the initial dataset cleaning, the frequent item were 
found then discarded. Which resulted in a inefficient first iteration of the 
genFreqPattern method. The algorithm new uses them if they are provided, and 
uses the empty pattern in case they're not. Slight improvement of performances 
were found.

  - Sub-problem limit : When resulting item sequence can be very long and the 
user disposes of a small number of very powerfull machine, this parameter will 
allow a quick switch to local execution. Tremendously improving performances. 
Outside of those conditions, the performance may be the negatively affected.

  - Item constraints : Allow the user to specify constraint on the occurences 
of an item (=, >, <, >=, <=, !=). For user who are looking for some specific 
results, the search space can be greatly reduced. Which also improve 
performances.


Please take note that the afformentionned fonctionnalities didn't come into 
play when testing the performance. The performance shown on the graph are 
'merely' the result of the remplacement of the local execution by a CP based 
algorithm. (maxLocalProjDBSize was also kept to it's default value (3200L), 
to make sure the performance couldn't be artificially improved)

Trade offs :

  - The algorithm does have a slight trade off. Since the algorithm needs to 
detect whether sequences can use the specialised single-item patterns CP 
algorithm. It may be a bit slower when it is not needed. This trade of was 
mitigated by introducing a round sequence cleaning before the local execution. 
Thus improving the performance of multi-item local executions when the cleaning 
is effective. In case the no item can be cleaned, the check will appear in the 
performence, creating a slight drop in them.

  - All other change provided shouldn't have any effect on efficiency or 
complexity if left to their default value. (Where they are basically 
desactivated). When activated, they may however reduce the search space and 
thus improve performance.

Additionnal note :

 - The performance improvement are mostly seen far datasets where all itemset 
are of size one, since it is a necessary condition for the use of the CP based 
algorithm. But as you can see in the two Slen datasets, performances were also 
slightly improved for algorithms which have multiple items per itemset. The 
algorithm was built to detect when CP can be used, and to use it, given the 
opportunity.

 - The performance displayed here are the results of six months of work. 
Various other things were tested to improve performance, without as much 
success. I can thus say with a bit of confidence that the performances here 
attained will be very hard to improve further.

 - In case you want me to test the performance on a specific dataset or to 
provide additionnal informations, it would be my pleasure to do so :). Just it 
me up with my email below :

Email : cyril.devogela...@gmail.com

 - I am a newbie contributor to Spark, and am not familliar with the whole 
procedure at all. In case, I did something incorrectly, I will fix it as soon 
as possible.


  was:
The code I would like to push allows major performances improvement through the 
use of a CP based solver (Namely OscaR, an open-source solver). As you can see 
in the log scale graph reachable from the link below, the performance are at 
worse roughly the same but at best, up to 50x faster (FIFA dataset)

Link to graph : http://i67.tinypic.com/t06lw7.jpg
Link to implementation : https://github.com/Syrux/spark
Link for datasets : 
http://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php (the 
two slen datasets used are the first two in the list of 9)

Performances were 

[jira] [Updated] (SPARK-20179) Major improvements to Spark's Prefix span

2017-03-31 Thread Cyril de Vogelaere (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cyril de Vogelaere updated SPARK-20179:
---
Description: 
The code I would like to push allows major performances improvement through the 
use of a CP based solver (Namely OscaR, an open-source solver). As you can see 
in the log scale graph reachable from the link below, the performance are at 
worse roughly the same but at best, up to 50x faster (FIFA dataset)

Link to graph : http://i67.tinypic.com/t06lw7.jpg
Link to implementation : https://github.com/Syrux/spark
Link for datasets : 
http://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php (the 
two slen datasets used are the first two in the list of 9)

Performances were tested on the CECI servers, providing the driver with 10G 
memory (more than needed) and running 4 slaves.

Trade offs :

  - The algorithm does have a slight trade off. Since the algorithm needs to 
detect whether sequences can use the specialised single-item patterns CP 
algorithm. It may be a bit slower when it is not needed. This trade of was 
mitigated by introducing a round sequence cleaning before the local execution. 
Thus improving the performance of multi-item local executions when the cleaning 
is effective. In case the no item can be cleaned, the check will appear in the 
performence, creating a slight drop in them.

  - All other change provided shouldn't have any effect on efficiency or 
complexity if left to their default value. (Where they are basically 
desactivated).

Additionnally to the performance improvements. I also added a bunch of new 
fonctionnalities :

  - Unlimited Max pattern length (with input 0) : 

  - Min pattern length : Any item below that length won't be outputted

  - Max Item per itemset : An itemset won't be grown further than the input 
number, thus reducing the search space. 

  - Head start : During the initial dataset cleaning, the frequent item were 
found then discarded. Which resulted in a inefficient first iteration of the 
genFreqPattern method. The algorithm new uses them if they are provided, and 
uses the empty pattern in case they're not. Slight improvement of performances 
were found.

  - Sub-problem limit : When resulting item sequence can be very long and the 
user disposes of a small number of very powerfull machine, this parameter will 
allow a quick switch to local execution. Tremendously improving performances. 
Outside of those conditions, the performance may be the negatively affected.

  - Item constraints : Allow the user to specify constraint on the occurences 
of an item (=, >, <, >=, <=, !=). For user who are looking for some specific 
results, the search space can be greatly reduced. Which also improve 
performances.


Please take note that the afformentionned fonctionnalities didn't come into 
play when testing the performance. The performance shown on the graph are 
'merely' the result of the remplacement of the local execution by a CP based 
algorithm. (maxLocalProjDBSize was also kept to it's default value (3200L), 
to make sure the performance couldn't be artificially improved)

Additionnal note :

 - The performance improvement are mostly seen far datasets where all itemset 
are of size one, since it is a necessary condition for the use of the CP based 
algorithm. But as you can see in the two Slen datasets, performances were also 
slightly improved for algorithms which have multiple items per itemset. The 
algorithm was built to detect when CP can be used, and to use it, given the 
opportunity.

 - The performance displayed here are the results of six months of work. 
Various other things were tested to improve performance, without as much 
success. I can thus say with a bit of confidence that the performances here 
attained will be very hard to improve further.

 - In case you want me to test the performance on a specific dataset or to 
provide additionnal informations, it would be my pleasure to do so :). Just it 
me up with my email below :

Email : cyril.devogela...@gmail.com

 - I am a newbie contributor to Spark, and am not familliar with the whole 
procedure at all. In case, I did something incorrectly, I will fix it as soon 
as possible.


  was:
The code I would like to push allows major performances improvement through the 
use of a CP based solver (Namely OscaR, an open-source solver). As you can see 
in the log scale graph reachable from the link below, the performance are at 
worse roughly the same but at best, up to 50x faster (FIFA dataset)

Link to graph : http://i67.tinypic.com/t06lw7.jpg
Link to implementation : https://github.com/Syrux/spark
Link for datasets : 
http://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php (the 
two slen datasets used are the first two in the list of 9)

Performances were tested on the CECI servers, providing the driver with 10G 
memory (more than needed) and 

[jira] [Commented] (SPARK-20144) spark.read.parquet no long maintains ordering of the data

2017-03-31 Thread Li Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15951084#comment-15951084
 ] 

Li Jin commented on SPARK-20144:


Also, I am not sure about "If the data were sorted, sorting would be pretty 
cheap". Can you explain more on this?

> spark.read.parquet no long maintains ordering of the data
> -
>
> Key: SPARK-20144
> URL: https://issues.apache.org/jira/browse/SPARK-20144
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Li Jin
>
> Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is 
> when we read parquet files in 2.0.2, the ordering of rows in the resulting 
> dataframe is not the same as the ordering of rows in the dataframe that the 
> parquet file was reproduced with. 
> This is because FileSourceStrategy.scala combines the parquet files into 
> fewer partitions and also reordered them. This breaks our workflows because 
> they assume the ordering of the data. 
> Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec 
> changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with 
> 2.1.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20179) Major improvements to Spark's Prefix span

2017-03-31 Thread Cyril de Vogelaere (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15951077#comment-15951077
 ] 

Cyril de Vogelaere commented on SPARK-20179:


Hello Sean,

I did have a look at the contributing page,
I don't really understand what you mean exatcly. Do you mean I should go over 
every bit of code I changed ? 
Because that may take a while ^^'

For trade offs in performances, I wil add them to the ticket. Starting right now

> Major improvements to Spark's Prefix span
> -
>
> Key: SPARK-20179
> URL: https://issues.apache.org/jira/browse/SPARK-20179
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.1.0
> Environment: All
>Reporter: Cyril de Vogelaere
>Priority: Minor
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> The code I would like to push allows major performances improvement through 
> the use of a CP based solver (Namely OscaR, an open-source solver). As you 
> can see in the log scale graph reachable from the link below, the performance 
> are at worse roughly the same but at best, up to 50x faster (FIFA dataset)
> Link to graph : http://i67.tinypic.com/t06lw7.jpg
> Link to implementation : https://github.com/Syrux/spark
> Link for datasets : 
> http://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php (the 
> two slen datasets used are the first two in the list of 9)
> Performances were tested on the CECI servers, providing the driver with 10G 
> memory (more than needed) and running 4 slaves.
> Additionnally to the performance improvements. I also added a bunch of new 
> fonctionnalities :
>   - Unlimited Max pattern length (with input 0) : 
>   - Min pattern length : Any item below that length won't be outputted
>   - Max Item per itemset : An itemset won't be grown further than the input 
> number, thus reducing the search space. 
>   - Head start : During the initial dataset cleaning, the frequent item were 
> found then discarded. Which resulted in a inefficient first iteration of the 
> genFreqPattern method. The algorithm new uses them if they are provided, and 
> uses the empty pattern in case they're not. Slight improvement of 
> performances were found.
>   - Sub-problem limit : When resulting item sequence can be very long and the 
> user disposes of a small number of very powerfull machine, this parameter 
> will allow a quick switch to local execution. Tremendously improving 
> performances. Outside of those conditions, the performance may be the 
> negatively affected.
>   - Item constraints : Allow the user to specify constraint on the occurences 
> of an item (=, >, <, >=, <=, !=). For user who are looking for some specific 
> results, the search space can be greatly reduced. Which also improve 
> performances.
> Please take note that the afformentionned fonctionnalities didn't come into 
> play when testing the performance. The performance shown on the graph are 
> 'merely' the result of the remplacement of the local execution by a CP based 
> algorithm. (maxLocalProjDBSize was also kept to it's default value 
> (3200L), to make sure the performance couldn't be artificially improved)
> Additionnal note :
>  - The performance improvement are mostly seen far datasets where all itemset 
> are of size one, since it is a necessary condition for the use of the CP 
> based algorithm. But as you can see in the two Slen datasets, performances 
> were also slightly improved for algorithms which have multiple items per 
> itemset. The algorithm was built to detect when CP can be used, and to use 
> it, given the opportunity.
>  - The performance displayed here are the results of six months of work. 
> Various other things were tested to improve performance, without as much 
> success. I can thus say with a bit of confidence that the performances here 
> attained will be very hard to improve further.
>  - In case you want me to test the performance on a specific dataset or to 
> provide additionnal informations, it would be my pleasure to do so :). Just 
> it me up with my email below :
> Email : cyril.devogela...@gmail.com
>  - I am a newbie contributor to Spark, and am not familliar with the whole 
> precedure at all. In case, I did something incorrectly, I will fix it as soon 
> as possible.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20144) spark.read.parquet no long maintains ordering of the data

2017-03-31 Thread Li Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15951073#comment-15951073
 ] 

Li Jin commented on SPARK-20144:


I totally agree Correctness takes precedence. If sorting is the only way, we 
will do that, but I think there is way we can maintain ordering in parquet 
format.

Parquet itself doesn't change the ordering, data in parquet is stored with 
parquet_file_0, parquet_file_1 ... and data are ordered within those files. 
However, it is FileSourceStrategy 
(https://github.com/apache/spark/blob/v2.0.2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala#L168)
 that resorts parquet files and end up changing the ordering.

If the expected semantics of Parquet doesn't maintain order, I won't complain 
the behavior of spark.read.parquet, but it seems it's Catalyst that is changing 
the ordering here.

> spark.read.parquet no long maintains ordering of the data
> -
>
> Key: SPARK-20144
> URL: https://issues.apache.org/jira/browse/SPARK-20144
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Li Jin
>
> Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is 
> when we read parquet files in 2.0.2, the ordering of rows in the resulting 
> dataframe is not the same as the ordering of rows in the dataframe that the 
> parquet file was reproduced with. 
> This is because FileSourceStrategy.scala combines the parquet files into 
> fewer partitions and also reordered them. This breaks our workflows because 
> they assume the ordering of the data. 
> Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec 
> changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with 
> 2.1.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20179) Major improvements to Spark's Prefix span

2017-03-31 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-20179:
--
Shepherd:   (was: Joseph K. Bradley)
   Flags:   (was: Important)
Target Version/s:   (was: 2.1.0)
  Labels:   (was: newbie performance test)
Priority: Minor  (was: Major)

Please start by reading the link I posted then. This is not how changes are 
proposed.

> Major improvements to Spark's Prefix span
> -
>
> Key: SPARK-20179
> URL: https://issues.apache.org/jira/browse/SPARK-20179
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.1.0
> Environment: All
>Reporter: Cyril de Vogelaere
>Priority: Minor
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> The code I would like to push allows major performances improvement through 
> the use of a CP based solver (Namely OscaR, an open-source solver). As you 
> can see in the log scale graph reachable from the link below, the performance 
> are at worse roughly the same but at best, up to 50x faster (FIFA dataset)
> Link to graph : http://i67.tinypic.com/t06lw7.jpg
> Link to implementation : https://github.com/Syrux/spark
> Link for datasets : 
> http://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php (the 
> two slen datasets used are the first two in the list of 9)
> Performances were tested on the CECI servers, providing the driver with 10G 
> memory (more than needed) and running 4 slaves.
> Additionnally to the performance improvements. I also added a bunch of new 
> fonctionnalities :
>   - Unlimited Max pattern length (with input 0) : 
>   - Min pattern length : Any item below that length won't be outputted
>   - Max Item per itemset : An itemset won't be grown further than the input 
> number, thus reducing the search space. 
>   - Head start : During the initial dataset cleaning, the frequent item were 
> found then discarded. Which resulted in a inefficient first iteration of the 
> genFreqPattern method. The algorithm new uses them if they are provided, and 
> uses the empty pattern in case they're not. Slight improvement of 
> performances were found.
>   - Sub-problem limit : When resulting item sequence can be very long and the 
> user disposes of a small number of very powerfull machine, this parameter 
> will allow a quick switch to local execution. Tremendously improving 
> performances. Outside of those conditions, the performance may be the 
> negatively affected.
>   - Item constraints : Allow the user to specify constraint on the occurences 
> of an item (=, >, <, >=, <=, !=). For user who are looking for some specific 
> results, the search space can be greatly reduced. Which also improve 
> performances.
> Please take note that the afformentionned fonctionnalities didn't come into 
> play when testing the performance. The performance shown on the graph are 
> 'merely' the result of the remplacement of the local execution by a CP based 
> algorithm. (maxLocalProjDBSize was also kept to it's default value 
> (3200L), to make sure the performance couldn't be artificially improved)
> Additionnal note :
>  - The performance improvement are mostly seen far datasets where all itemset 
> are of size one, since it is a necessary condition for the use of the CP 
> based algorithm. But as you can see in the two Slen datasets, performances 
> were also slightly improved for algorithms which have multiple items per 
> itemset. The algorithm was built to detect when CP can be used, and to use 
> it, given the opportunity.
>  - The performance displayed here are the results of six months of work. 
> Various other things were tested to improve performance, without as much 
> success. I can thus say with a bit of confidence that the performances here 
> attained will be very hard to improve further.
>  - In case you want me to test the performance on a specific dataset or to 
> provide additionnal informations, it would be my pleasure to do so :). Just 
> it me up with my email below :
> Email : cyril.devogela...@gmail.com
>  - I am a newbie contributor to Spark, and am not familliar with the whole 
> precedure at all. In case, I did something incorrectly, I will fix it as soon 
> as possible.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20179) Major improvements to Spark's Prefix span

2017-03-31 Thread Cyril de Vogelaere (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15951051#comment-15951051
 ] 

Cyril de Vogelaere commented on SPARK-20179:


I forgot to mention, I am ready to push the code anytime now. 
But I heard that it needed to be reviewed and corrected first. I am not very 
familliar on the procedure, so it would be helpfull if someone could advice me 
on what to do.

> Major improvements to Spark's Prefix span
> -
>
> Key: SPARK-20179
> URL: https://issues.apache.org/jira/browse/SPARK-20179
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.1.0
> Environment: All
>Reporter: Cyril de Vogelaere
>  Labels: newbie, performance, test
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> The code I would like to push allows major performances improvement through 
> the use of a CP based solver (Namely OscaR, an open-source solver). As you 
> can see in the log scale graph reachable from the link below, the performance 
> are at worse roughly the same but at best, up to 50x faster (FIFA dataset)
> Link to graph : http://i67.tinypic.com/t06lw7.jpg
> Link to implementation : https://github.com/Syrux/spark
> Link for datasets : 
> http://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php (the 
> two slen datasets used are the first two in the list of 9)
> Performances were tested on the CECI servers, providing the driver with 10G 
> memory (more than needed) and running 4 slaves.
> Additionnally to the performance improvements. I also added a bunch of new 
> fonctionnalities :
>   - Unlimited Max pattern length (with input 0) : 
>   - Min pattern length : Any item below that length won't be outputted
>   - Max Item per itemset : An itemset won't be grown further than the input 
> number, thus reducing the search space. 
>   - Head start : During the initial dataset cleaning, the frequent item were 
> found then discarded. Which resulted in a inefficient first iteration of the 
> genFreqPattern method. The algorithm new uses them if they are provided, and 
> uses the empty pattern in case they're not. Slight improvement of 
> performances were found.
>   - Sub-problem limit : When resulting item sequence can be very long and the 
> user disposes of a small number of very powerfull machine, this parameter 
> will allow a quick switch to local execution. Tremendously improving 
> performances. Outside of those conditions, the performance may be the 
> negatively affected.
>   - Item constraints : Allow the user to specify constraint on the occurences 
> of an item (=, >, <, >=, <=, !=). For user who are looking for some specific 
> results, the search space can be greatly reduced. Which also improve 
> performances.
> Please take note that the afformentionned fonctionnalities didn't come into 
> play when testing the performance. The performance shown on the graph are 
> 'merely' the result of the remplacement of the local execution by a CP based 
> algorithm. (maxLocalProjDBSize was also kept to it's default value 
> (3200L), to make sure the performance couldn't be artificially improved)
> Additionnal note :
>  - The performance improvement are mostly seen far datasets where all itemset 
> are of size one, since it is a necessary condition for the use of the CP 
> based algorithm. But as you can see in the two Slen datasets, performances 
> were also slightly improved for algorithms which have multiple items per 
> itemset. The algorithm was built to detect when CP can be used, and to use 
> it, given the opportunity.
>  - The performance displayed here are the results of six months of work. 
> Various other things were tested to improve performance, without as much 
> success. I can thus say with a bit of confidence that the performances here 
> attained will be very hard to improve further.
>  - In case you want me to test the performance on a specific dataset or to 
> provide additionnal informations, it would be my pleasure to do so :). Just 
> it me up with my email below :
> Email : cyril.devogela...@gmail.com
>  - I am a newbie contributor to Spark, and am not familliar with the whole 
> precedure at all. In case, I did something incorrectly, I will fix it as soon 
> as possible.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20179) Major improvements to Spark's Prefix span

2017-03-31 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15951044#comment-15951044
 ] 

Sean Owen commented on SPARK-20179:
---

It's not clear what you are proposing _for Spark_. You're describing some 
modifications you made in your own build, but not what changed, what the 
complexity or tradeoffs are. Have a look at 
http://spark.apache.org/contributing.html first please.

I think this is indeed a duplicate of SPARK-10678.

> Major improvements to Spark's Prefix span
> -
>
> Key: SPARK-20179
> URL: https://issues.apache.org/jira/browse/SPARK-20179
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.1.0
> Environment: All
>Reporter: Cyril de Vogelaere
>  Labels: newbie, performance, test
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> The code I would like to push allows major performances improvement through 
> the use of a CP based solver (Namely OscaR, an open-source solver). As you 
> can see in the log scale graph reachable from the link below, the performance 
> are at worse roughly the same but at best, up to 50x faster (FIFA dataset)
> Link to graph : http://i67.tinypic.com/t06lw7.jpg
> Link to implementation : https://github.com/Syrux/spark
> Link for datasets : 
> http://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php (the 
> two slen datasets used are the first two in the list of 9)
> Performances were tested on the CECI servers, providing the driver with 10G 
> memory (more than needed) and running 4 slaves.
> Additionnally to the performance improvements. I also added a bunch of new 
> fonctionnalities :
>   - Unlimited Max pattern length (with input 0) : 
>   - Min pattern length : Any item below that length won't be outputted
>   - Max Item per itemset : An itemset won't be grown further than the input 
> number, thus reducing the search space. 
>   - Head start : During the initial dataset cleaning, the frequent item were 
> found then discarded. Which resulted in a inefficient first iteration of the 
> genFreqPattern method. The algorithm new uses them if they are provided, and 
> uses the empty pattern in case they're not. Slight improvement of 
> performances were found.
>   - Sub-problem limit : When resulting item sequence can be very long and the 
> user disposes of a small number of very powerfull machine, this parameter 
> will allow a quick switch to local execution. Tremendously improving 
> performances. Outside of those conditions, the performance may be the 
> negatively affected.
>   - Item constraints : Allow the user to specify constraint on the occurences 
> of an item (=, >, <, >=, <=, !=). For user who are looking for some specific 
> results, the search space can be greatly reduced. Which also improve 
> performances.
> Please take note that the afformentionned fonctionnalities didn't come into 
> play when testing the performance. The performance shown on the graph are 
> 'merely' the result of the remplacement of the local execution by a CP based 
> algorithm. (maxLocalProjDBSize was also kept to it's default value 
> (3200L), to make sure the performance couldn't be artificially improved)
> Additionnal note :
>  - The performance improvement are mostly seen far datasets where all itemset 
> are of size one, since it is a necessary condition for the use of the CP 
> based algorithm. But as you can see in the two Slen datasets, performances 
> were also slightly improved for algorithms which have multiple items per 
> itemset. The algorithm was built to detect when CP can be used, and to use 
> it, given the opportunity.
>  - The performance displayed here are the results of six months of work. 
> Various other things were tested to improve performance, without as much 
> success. I can thus say with a bit of confidence that the performances here 
> attained will be very hard to improve further.
>  - In case you want me to test the performance on a specific dataset or to 
> provide additionnal informations, it would be my pleasure to do so :). Just 
> it me up with my email below :
> Email : cyril.devogela...@gmail.com
>  - I am a newbie contributor to Spark, and am not familliar with the whole 
> precedure at all. In case, I did something incorrectly, I will fix it as soon 
> as possible.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10678) Specialize PrefixSpan for single-item patterns

2017-03-31 Thread Cyril de Vogelaere (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15951040#comment-15951040
 ] 

Cyril de Vogelaere commented on SPARK-10678:


I hadn't seen this issue as I created mine. 
I have finished an implementation which specialise Prefix-span for single-item 
patterns, using a CP solver.

Here is the link the the issue I created, which proposes other improvement 
along side this particular one.
https://issues.apache.org/jira/browse/SPARK-20179

Any advice you have for me would be welcome, since I'm a newbie at contributing 
to Spark

> Specialize PrefixSpan for single-item patterns
> --
>
> Key: SPARK-10678
> URL: https://issues.apache.org/jira/browse/SPARK-10678
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.0
>Reporter: Xiangrui Meng
>
> We assume the input itemsets are multi-item in PrefixSpan, e.g., (ab)(cd). In 
> some use cases, all itemsets are single-item, e.g., abcd. In this case, our 
> implementation has overhead remembering the boundaries between itemsets. We 
> could detect it and put specialized implementation for this use case.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20179) Major improvements to Spark's Prefix span

2017-03-31 Thread Cyril de Vogelaere (JIRA)
Cyril de Vogelaere created SPARK-20179:
--

 Summary: Major improvements to Spark's Prefix span
 Key: SPARK-20179
 URL: https://issues.apache.org/jira/browse/SPARK-20179
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 2.1.0
 Environment: All
Reporter: Cyril de Vogelaere


The code I would like to push allows major performances improvement through the 
use of a CP based solver (Namely OscaR, an open-source solver). As you can see 
in the log scale graph reachable from the link below, the performance are at 
worse roughly the same but at best, up to 50x faster (FIFA dataset)

Link to graph : http://i67.tinypic.com/t06lw7.jpg
Link to implementation : https://github.com/Syrux/spark
Link for datasets : 
http://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php (the 
two slen datasets used are the first two in the list of 9)

Performances were tested on the CECI servers, providing the driver with 10G 
memory (more than needed) and running 4 slaves.

Additionnally to the performance improvements. I also added a bunch of new 
fonctionnalities :

  - Unlimited Max pattern length (with input 0) : 

  - Min pattern length : Any item below that length won't be outputted

  - Max Item per itemset : An itemset won't be grown further than the input 
number, thus reducing the search space. 

  - Head start : During the initial dataset cleaning, the frequent item were 
found then discarded. Which resulted in a inefficient first iteration of the 
genFreqPattern method. The algorithm new uses them if they are provided, and 
uses the empty pattern in case they're not. Slight improvement of performances 
were found.

  - Sub-problem limit : When resulting item sequence can be very long and the 
user disposes of a small number of very powerfull machine, this parameter will 
allow a quick switch to local execution. Tremendously improving performances. 
Outside of those conditions, the performance may be the negatively affected.

  - Item constraints : Allow the user to specify constraint on the occurences 
of an item (=, >, <, >=, <=, !=). For user who are looking for some specific 
results, the search space can be greatly reduced. Which also improve 
performances.


Please take note that the afformentionned fonctionnalities didn't come into 
play when testing the performance. The performance shown on the graph are 
'merely' the result of the remplacement of the local execution by a CP based 
algorithm. (maxLocalProjDBSize was also kept to it's default value (3200L), 
to make sure the performance couldn't be artificially improved)

Additionnal note :

 - The performance improvement are mostly seen far datasets where all itemset 
are of size one, since it is a necessary condition for the use of the CP based 
algorithm. But as you can see in the two Slen datasets, performances were also 
slightly improved for algorithms which have multiple items per itemset. The 
algorithm was built to detect when CP can be used, and to use it, given the 
opportunity.

 - The performance displayed here are the results of six months of work. 
Various other things were tested to improve performance, without as much 
success. I can thus say with a bit of confidence that the performances here 
attained will be very hard to improve further.

 - In case you want me to test the performance on a specific dataset or to 
provide additionnal informations, it would be my pleasure to do so :). Just it 
me up with my email below :

Email : cyril.devogela...@gmail.com

 - I am a newbie contributor to Spark, and am not familliar with the whole 
precedure at all. In case, I did something incorrectly, I will fix it as soon 
as possible.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20144) spark.read.parquet no long maintains ordering of the data

2017-03-31 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15950988#comment-15950988
 ] 

Sean Owen commented on SPARK-20144:
---

If the data were sorted, sorting would be pretty cheap, in general. Correctness 
has to take precedence in any event, if you're describing this as a blocker for 
you.
I don't believe projection can change ordering, no. I am saying that I would 
not necessarily expect that to extend to external serialization. I don't see 
that being tabular or on HDFS matters. I think some serializations would 
naturally preserve order and others would not. I am still not 100% sure what 
the expected semantics of Parquet are here, but you have de facto evidence it 
is not guaranteed.

> spark.read.parquet no long maintains ordering of the data
> -
>
> Key: SPARK-20144
> URL: https://issues.apache.org/jira/browse/SPARK-20144
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Li Jin
>
> Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is 
> when we read parquet files in 2.0.2, the ordering of rows in the resulting 
> dataframe is not the same as the ordering of rows in the dataframe that the 
> parquet file was reproduced with. 
> This is because FileSourceStrategy.scala combines the parquet files into 
> fewer partitions and also reordered them. This breaks our workflows because 
> they assume the ordering of the data. 
> Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec 
> changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with 
> 2.1.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20144) spark.read.parquet no long maintains ordering of the data

2017-03-31 Thread Li Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15950979#comment-15950979
 ] 

Li Jin edited comment on SPARK-20144 at 3/31/17 2:14 PM:
-

Thanks for getting back to me.

Sorting in this case will just add extra cost to in our workflow and we are 
trying to avoid it in the first place.

Because DataFrame presents the data in a tabular format, it is very surprising 
that the ordering of rows in the table changes after going through hdfs. In any 
other tabular format that I know of, ordering of rows is a property of the data 
and it is surprising that reading/writing changes properties of the data. This 
is also a bit scary because if ordering were not a property of a DataFrame, can 
things like cache or select("col") change ordering of rows in the future? 



was (Author: icexelloss):
Thanks for getting back to me.

Sorting in this case will just add extra cost to in our workflow and we are 
trying to avoid it in the first place.

Because DataFrame presents the data in a tabular format, it is very surprising 
that the table changes after going through hdfs. In any other tabular format 
that I know of, ordering of rows is a property of the data and it is surprising 
that reading/writing changes properties of the data. This is also a bit scary 
because if ordering were not a property of a DataFrame, can things like cache 
or select("col") change ordering of rows in the future? 


> spark.read.parquet no long maintains ordering of the data
> -
>
> Key: SPARK-20144
> URL: https://issues.apache.org/jira/browse/SPARK-20144
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Li Jin
>
> Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is 
> when we read parquet files in 2.0.2, the ordering of rows in the resulting 
> dataframe is not the same as the ordering of rows in the dataframe that the 
> parquet file was reproduced with. 
> This is because FileSourceStrategy.scala combines the parquet files into 
> fewer partitions and also reordered them. This breaks our workflows because 
> they assume the ordering of the data. 
> Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec 
> changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with 
> 2.1.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20144) spark.read.parquet no long maintains ordering of the data

2017-03-31 Thread Li Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15950979#comment-15950979
 ] 

Li Jin commented on SPARK-20144:


Thanks for getting back to me.

Sorting in this case will just add extra cost to in our workflow and we are 
trying to avoid it in the first place.

Because DataFrame presents the data in a tabular format, it is very surprising 
that the table changes after going through hdfs. In any other tabular format 
that I know of, ordering of rows is a property of the data and it is surprising 
that reading/writing changes properties of the data. This is also a bit scary 
because if ordering were not a property of a DataFrame, can things like cache 
or select("col") change ordering of rows in the future? 


> spark.read.parquet no long maintains ordering of the data
> -
>
> Key: SPARK-20144
> URL: https://issues.apache.org/jira/browse/SPARK-20144
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Li Jin
>
> Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is 
> when we read parquet files in 2.0.2, the ordering of rows in the resulting 
> dataframe is not the same as the ordering of rows in the dataframe that the 
> parquet file was reproduced with. 
> This is because FileSourceStrategy.scala combines the parquet files into 
> fewer partitions and also reordered them. This breaks our workflows because 
> they assume the ordering of the data. 
> Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec 
> changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with 
> 2.1.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20177) Document about compression way has some little detail changes.

2017-03-31 Thread guoxiaolongzte (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guoxiaolongzte updated SPARK-20177:
---
Description: 
Document compression way little detail changes.
1.spark.eventLog.compress add 'Compression will use spark.io.compression.codec.'
2.spark.broadcast.compress add 'Compression will use 
spark.io.compression.codec.'
3,spark.rdd.compress add 'Compression will use spark.io.compression.codec.'
4.spark.io.compression.codec add 'event log describe'
eg 
Through the documents, I don't know  what is compression mode about 'event log'.

  was:
Document compression way little detail changes.
1.spark.eventLog.compress add 'Compression will use spark.io.compression.codec.'
2.spark.broadcast.compress add 'Compression will use 
spark.io.compression.codec.'
3,spark.rdd.compress add 'Compression will use spark.io.compression.codec.'
4.spark.io.compression.codec add 'event log describe'


> Document about compression way has some little detail changes.
> --
>
> Key: SPARK-20177
> URL: https://issues.apache.org/jira/browse/SPARK-20177
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: guoxiaolongzte
>Priority: Minor
>
> Document compression way little detail changes.
> 1.spark.eventLog.compress add 'Compression will use 
> spark.io.compression.codec.'
> 2.spark.broadcast.compress add 'Compression will use 
> spark.io.compression.codec.'
> 3,spark.rdd.compress add 'Compression will use spark.io.compression.codec.'
> 4.spark.io.compression.codec add 'event log describe'
> eg 
> Through the documents, I don't know  what is compression mode about 'event 
> log'.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20177) Document about compression way has some little detail changes.

2017-03-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15950939#comment-15950939
 ] 

Apache Spark commented on SPARK-20177:
--

User 'guoxiaolongzte' has created a pull request for this issue:
https://github.com/apache/spark/pull/17498

> Document about compression way has some little detail changes.
> --
>
> Key: SPARK-20177
> URL: https://issues.apache.org/jira/browse/SPARK-20177
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: guoxiaolongzte
>Priority: Minor
>
> Document compression way little detail changes.
> 1.spark.eventLog.compress add 'Compression will use 
> spark.io.compression.codec.'
> 2.spark.broadcast.compress add 'Compression will use 
> spark.io.compression.codec.'
> 3,spark.rdd.compress add 'Compression will use spark.io.compression.codec.'
> 4.spark.io.compression.codec add 'event log describe'



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20178) Improve Scheduler fetch failures

2017-03-31 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15950917#comment-15950917
 ] 

Thomas Graves edited comment on SPARK-20178 at 3/31/17 1:53 PM:


Overall what I would like to accomplish is not throwing away work and making 
the failure case very performant. More and more people are running spark on 
larger clusters, this means failures are going to occur more.  We need those 
failures to be as fast as possible.  We need to be careful here and make sure 
we handle the node totally down case, the nodemanager totally down, and the 
nodemanager or node is just having intermittent issue.  Generally I see the 
last where the issue is just intermittent but some people recently have had 
more of the nodemanager totally down case in which case you want to fail all 
maps on that node quickly.  The decision on what to rerun is hard now because 
it could be very costly to rerun more, but at the same time it could be very 
costly to not rerun all immediately because you can fail all 4 stage attempts.  
This really depends on how long the maps and reduces run.  A lot of discussion 
on https://github.com/apache/spark/pull/17088 related to that. 

- We should not kill the Reduce tasks on fetch failure.  Leave the Reduce tasks 
running since it could have done useful work already like fetching X number of 
map outputs.  It can simply fail that map output which would cause the map to 
be rerun and only that specific map output would need to be refetched.  This 
does require checking to make sure there are enough resource to run the map and 
if not possibly killing a reducer or getting more resources if dynamic 
allocation.
- Improve logic around deciding which node is actually bad when you get a fetch 
failures.  Was it really the node the reduce was on or the node the map was on. 
 You can do something here like a % of reducers failed to fetch from map output 
node.
- We should only rerun the maps that are necessary. Other maps could have 
already been fetched (with bullet one) so no need to rerun those immediately.  
Since the reduce tasks keep running, other fetch failures can happen in 
parallel and that would just cause other maps to be rerun.  At some point based 
on bullet 2 above we can decide entire node is bad or to invalidate all output 
on that node. Make sure to think about intermittent failures vs shuffle handler 
totally down and not coming back.  Use that in determining logic
- Improve the blacklisting based on the above improvements
- make sure to think about how this plays into the stage attempt max failures 
(4, now settable)
- try to not waste resources.  ie right now we can have 2 of the same reduce 
tasks running which is using twice the resources and there are a bunch of 
different conditions that can occur as to whether this work is actually useful.


Question:
- should we consider having it fetch all map output from a host at once (rather 
then per executor).  This could improve fetching times (but would have to test) 
as well as fetch failure handling. This could cause it to fail more maps which 
is somewhat contradictory to bullet 3 above, need to think about this more.
- Do we need pluggable interface or how do we not destabilize current scheduler?

Bonus or future:
- Decision on when and how many maps to rerun is cost based estimate.  If maps 
only take a few seconds to run could rerun all maps on the host immediately
- option to prestart reduce tasks so that they can start fetching while last 
few maps are failing (if you have long tail maps)


was (Author: tgraves):
Overall what I would like to accomplish is not throwing away work and making 
the failure case very performant. More and more people are running spark on 
larger clusters, this means failures are going to occur more.  We need those 
failures to be as fast as possible.  We need to be careful here and make sure 
we handle the node totally down case, the nodemanager totally down, and the 
nodemanager or node is just having intermittent issue.  Generally I see the 
last where the issue is just intermittent but some people recently have had 
more of the nodemanager totally down case in which case you want to fail all 
maps on that node quickly.  The decision on what to rerun is hard now because 
it could be very costly to rerun more, but at the same time it could be very 
costly to not rerun all immediately because you can fail all 4 stage attempts.  
This really depends on how long the maps and reduces run.  A lot of discussion 
on https://github.com/apache/spark/pull/17088 related to that. 

- We should not kill the Reduce tasks on fetch failure.  Leave the Reduce tasks 
running since it could have done useful work already like fetching X number of 
map outputs.  It can simply fail that map output which would cause the map to 
be rerun and only that specific map output would need to be 

[jira] [Commented] (SPARK-20178) Improve Scheduler fetch failures

2017-03-31 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15950917#comment-15950917
 ] 

Thomas Graves commented on SPARK-20178:
---

Overall what I would like to accomplish is not throwing away work and making 
the failure case very performant. More and more people are running spark on 
larger clusters, this means failures are going to occur more.  We need those 
failures to be as fast as possible.  We need to be careful here and make sure 
we handle the node totally down case, the nodemanager totally down, and the 
nodemanager or node is just having intermittent issue.  Generally I see the 
last where the issue is just intermittent but some people recently have had 
more of the nodemanager totally down case in which case you want to fail all 
maps on that node quickly.  The decision on what to rerun is hard now because 
it could be very costly to rerun more, but at the same time it could be very 
costly to not rerun all immediately because you can fail all 4 stage attempts.  
This really depends on how long the maps and reduces run.  A lot of discussion 
on https://github.com/apache/spark/pull/17088 related to that. 

- We should not kill the Reduce tasks on fetch failure.  Leave the Reduce tasks 
running since it could have done useful work already like fetching X number of 
map outputs.  It can simply fail that map output which would cause the map to 
be rerun and only that specific map output would need to be refetched.  This 
does require checking to make sure there are enough resource to run the map and 
if not possibly killing a reducer or getting more resources if dynamic 
allocation.
- Improve logic around deciding which node is actually bad when you get a fetch 
failures.  Was it really the node the reduce was on or the node the map was on. 
 You can do something here like a % of reducers failed to fetch from map output 
node.
- We should only rerun the maps that failed (or have been logic around how to 
make this decision), other maps could have already been fetch (with bullet one) 
so no need to rerun if all reducers already fetched.  Since the reduce tasks 
keep running, other fetch failures can happen in parallel and that would just 
cause other maps to be rerun.  At some point based on bullet 2 above we can 
decide entire node is bad.
- Improve the blacklisting based on the above improvements
- make sure to think about how this plays into the stage attempt max failures 
(4, now settable)
- try to not waste resources.  ie right now we can have 2 of the same reduce 
tasks running which is using twice the resources and there are a bunch of 
different conditions that can occur as to whether this work is actually useful.

Question:
- should we consider having it fetch all map output from a host at once (rather 
then per executor).  This could improve fetching times (but would have to test) 
as well as fetch failure handling. This could cause it to fail more maps which 
is somewhat contradictory to bullet 3 above, need to think about this more.
- Do we need pluggable interface or how do we not destabilize current scheduler?

Bonus or future:
- Decision on when and how many maps to rerun is cost based estimate.  If maps 
only take a few seconds to run could rerun all maps on the host immediately
- option to prestart reduce tasks so that they can start fetching while last 
few maps are failing (if you have long tail maps)

> Improve Scheduler fetch failures
> 
>
> Key: SPARK-20178
> URL: https://issues.apache.org/jira/browse/SPARK-20178
> Project: Spark
>  Issue Type: Epic
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Thomas Graves
>
> We have been having a lot of discussions around improving the handling of 
> fetch failures.  There are 4 jira currently related to this.  
> We should try to get a list of things we want to improve and come up with one 
> cohesive design.
> SPARK-20163,  SPARK-20091,  SPARK-14649 , and SPARK-19753
> I will put my initial thoughts in a follow on comment.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-19443) The function to generate constraints takes too long when the query plan grows continuously

2017-03-31 Thread Liang-Chi Hsieh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh closed SPARK-19443.
---
Resolution: Won't Fix

> The function to generate constraints takes too long when the query plan grows 
> continuously
> --
>
> Key: SPARK-19443
> URL: https://issues.apache.org/jira/browse/SPARK-19443
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Liang-Chi Hsieh
>
> This issue is originally reported and discussed at 
> http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-ML-Pipeline-performance-regression-between-1-6-and-2-x-tc20803.html
> When run a ML `Pipeline` with many stages, during the iterative updating to 
> `Dataset` , it is observed the it takes longer time to finish the fit and 
> transform as the query plan grows continuously.
> Specially, the time spent on preparing optimized plan in current branch 
> (74294 ms) is much higher than 1.6 (292 ms). Actually, the time is spent 
> mostly on generating query plan's constraints during few optimization rules.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-19665) Improve constraint propagation

2017-03-31 Thread Liang-Chi Hsieh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh closed SPARK-19665.
---
Resolution: Won't Fix

> Improve constraint propagation
> --
>
> Key: SPARK-19665
> URL: https://issues.apache.org/jira/browse/SPARK-19665
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Liang-Chi Hsieh
>
> If there are aliased expression in the projection, we propagate constraints 
> by completely expanding the original constraints with aliases.
> This expanding costs much computation time when the number of aliases 
> increases.
> Another issue is we actually don't need the additional constraints at most of 
> time. For example, if there is a constraint "a > b", and "a" is aliased to 
> "c" and "d". When we use this constraint in filtering, we don't need all 
> constraints "a > b", "c > b", "d > b". We only need "a > b" because if it is 
> false, it is guaranteed that all other constraints are false too.
> Fully expanding all constraints at all the time makes iterative ML algorithms 
> where a ML pipeline with many stages runs very slow.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20178) Improve Scheduler fetch failures

2017-03-31 Thread Thomas Graves (JIRA)
Thomas Graves created SPARK-20178:
-

 Summary: Improve Scheduler fetch failures
 Key: SPARK-20178
 URL: https://issues.apache.org/jira/browse/SPARK-20178
 Project: Spark
  Issue Type: Epic
  Components: Scheduler
Affects Versions: 2.1.0
Reporter: Thomas Graves


We have been having a lot of discussions around improving the handling of fetch 
failures.  There are 4 jira currently related to this.  

We should try to get a list of things we want to improve and come up with one 
cohesive design.

SPARK-20163,  SPARK-20091,  SPARK-14649 , and SPARK-19753

I will put my initial thoughts in a follow on comment.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20177) Document about compression way has some little detail changes.

2017-03-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20177:


Assignee: Apache Spark

> Document about compression way has some little detail changes.
> --
>
> Key: SPARK-20177
> URL: https://issues.apache.org/jira/browse/SPARK-20177
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: guoxiaolongzte
>Assignee: Apache Spark
>Priority: Minor
>
> Document compression way little detail changes.
> 1.spark.eventLog.compress add 'Compression will use 
> spark.io.compression.codec.'
> 2.spark.broadcast.compress add 'Compression will use 
> spark.io.compression.codec.'
> 3,spark.rdd.compress add 'Compression will use spark.io.compression.codec.'
> 4.spark.io.compression.codec add 'event log describe'



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20177) Document about compression way has some little detail changes.

2017-03-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20177:


Assignee: (was: Apache Spark)

> Document about compression way has some little detail changes.
> --
>
> Key: SPARK-20177
> URL: https://issues.apache.org/jira/browse/SPARK-20177
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: guoxiaolongzte
>Priority: Minor
>
> Document compression way little detail changes.
> 1.spark.eventLog.compress add 'Compression will use 
> spark.io.compression.codec.'
> 2.spark.broadcast.compress add 'Compression will use 
> spark.io.compression.codec.'
> 3,spark.rdd.compress add 'Compression will use spark.io.compression.codec.'
> 4.spark.io.compression.codec add 'event log describe'



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20177) Document about compression way has some little detail changes.

2017-03-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15950826#comment-15950826
 ] 

Apache Spark commented on SPARK-20177:
--

User 'guoxiaolongzte' has created a pull request for this issue:
https://github.com/apache/spark/pull/17497

> Document about compression way has some little detail changes.
> --
>
> Key: SPARK-20177
> URL: https://issues.apache.org/jira/browse/SPARK-20177
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: guoxiaolongzte
>Priority: Minor
>
> Document compression way little detail changes.
> 1.spark.eventLog.compress add 'Compression will use 
> spark.io.compression.codec.'
> 2.spark.broadcast.compress add 'Compression will use 
> spark.io.compression.codec.'
> 3,spark.rdd.compress add 'Compression will use spark.io.compression.codec.'
> 4.spark.io.compression.codec add 'event log describe'



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20177) Document about compression way has some little detail changes.

2017-03-31 Thread guoxiaolongzte (JIRA)
guoxiaolongzte created SPARK-20177:
--

 Summary: Document about compression way has some little detail 
changes.
 Key: SPARK-20177
 URL: https://issues.apache.org/jira/browse/SPARK-20177
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 2.1.0
Reporter: guoxiaolongzte
Priority: Minor


Document compression way little detail changes.
1.spark.eventLog.compress add 'Compression will use spark.io.compression.codec.'
2.spark.broadcast.compress add 'Compression will use 
spark.io.compression.codec.'
3,spark.rdd.compress add 'Compression will use spark.io.compression.codec.'
4.spark.io.compression.codec add 'event log describe'



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14492) Spark SQL 1.6.0 does not work with Hive version lower than 1.2.0; its not backwards compatible with earlier version

2017-03-31 Thread Sunil Rangwani (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15950778#comment-15950778
 ] 

Sunil Rangwani edited comment on SPARK-14492 at 3/31/17 12:27 PM:
--

My problem exactly was a) Interacting with Hive metastore of an older version. 
I set it up with the various spark.sql.hive.metastore.* config options but that 
didn't work. 
I had to do a messy upgrade of the external hive metastore database and service 
to get it to work. 


was (Author: sunil.rangwani):
My problem exactly was a) Interacting with Hive metastore of an older version. 
I set it up with the various config options spark.sql.hive.metastore.* config 
options but that didn't work. 
I had to do a messy upgrade of the external hive metastore database and service 
to get it to work. 

> Spark SQL 1.6.0 does not work with Hive version lower than 1.2.0; its not 
> backwards compatible with earlier version
> ---
>
> Key: SPARK-14492
> URL: https://issues.apache.org/jira/browse/SPARK-14492
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Sunil Rangwani
>Priority: Critical
>
> Spark SQL when configured with a Hive version lower than 1.2.0 throws a 
> java.lang.NoSuchFieldError for the field METASTORE_CLIENT_SOCKET_LIFETIME 
> because this field was introduced in Hive 1.2.0 so its not possible to use 
> Hive metastore version lower than 1.2.0 with Spark. The details of the Hive 
> changes can be found here: https://issues.apache.org/jira/browse/HIVE-9508 
> {code:java}
> Exception in thread "main" java.lang.NoSuchFieldError: 
> METASTORE_CLIENT_SOCKET_LIFETIME
>   at 
> org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:500)
>   at 
> org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:250)
>   at 
> org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:237)
>   at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:441)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:272)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:271)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at org.apache.spark.sql.SQLContext.(SQLContext.scala:271)
>   at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:90)
>   at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:101)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:58)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:267)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:139)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14492) Spark SQL 1.6.0 does not work with Hive version lower than 1.2.0; its not backwards compatible with earlier version

2017-03-31 Thread Sunil Rangwani (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15950778#comment-15950778
 ] 

Sunil Rangwani commented on SPARK-14492:


My problem exactly was a) Interacting with Hive metastore of an older version. 
I set it up with the various config options spark.sql.hive.metastore.* config 
options but that didn't work. 
I had to do a messy upgrade of the external hive metastore database and service 
to get it to work. 

> Spark SQL 1.6.0 does not work with Hive version lower than 1.2.0; its not 
> backwards compatible with earlier version
> ---
>
> Key: SPARK-14492
> URL: https://issues.apache.org/jira/browse/SPARK-14492
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Sunil Rangwani
>Priority: Critical
>
> Spark SQL when configured with a Hive version lower than 1.2.0 throws a 
> java.lang.NoSuchFieldError for the field METASTORE_CLIENT_SOCKET_LIFETIME 
> because this field was introduced in Hive 1.2.0 so its not possible to use 
> Hive metastore version lower than 1.2.0 with Spark. The details of the Hive 
> changes can be found here: https://issues.apache.org/jira/browse/HIVE-9508 
> {code:java}
> Exception in thread "main" java.lang.NoSuchFieldError: 
> METASTORE_CLIENT_SOCKET_LIFETIME
>   at 
> org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:500)
>   at 
> org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:250)
>   at 
> org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:237)
>   at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:441)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:272)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:271)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at org.apache.spark.sql.SQLContext.(SQLContext.scala:271)
>   at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:90)
>   at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:101)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:58)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:267)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:139)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14492) Spark SQL 1.6.0 does not work with Hive version lower than 1.2.0; its not backwards compatible with earlier version

2017-03-31 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15950768#comment-15950768
 ] 

Sean Owen commented on SPARK-14492:
---

You are still describing two different things I think: a) interacting with Hive 
metastore X and b) building Spark with Hive X. a) should work as documented. 
What you describe in this JIRA is b) though. You do not need to, and cannot in 
fact, build Spark versus older Hive.

> Spark SQL 1.6.0 does not work with Hive version lower than 1.2.0; its not 
> backwards compatible with earlier version
> ---
>
> Key: SPARK-14492
> URL: https://issues.apache.org/jira/browse/SPARK-14492
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Sunil Rangwani
>Priority: Critical
>
> Spark SQL when configured with a Hive version lower than 1.2.0 throws a 
> java.lang.NoSuchFieldError for the field METASTORE_CLIENT_SOCKET_LIFETIME 
> because this field was introduced in Hive 1.2.0 so its not possible to use 
> Hive metastore version lower than 1.2.0 with Spark. The details of the Hive 
> changes can be found here: https://issues.apache.org/jira/browse/HIVE-9508 
> {code:java}
> Exception in thread "main" java.lang.NoSuchFieldError: 
> METASTORE_CLIENT_SOCKET_LIFETIME
>   at 
> org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:500)
>   at 
> org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:250)
>   at 
> org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:237)
>   at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:441)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:272)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:271)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at org.apache.spark.sql.SQLContext.(SQLContext.scala:271)
>   at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:90)
>   at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:101)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:58)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:267)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:139)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14492) Spark SQL 1.6.0 does not work with Hive version lower than 1.2.0; its not backwards compatible with earlier version

2017-03-31 Thread Sunil Rangwani (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15950760#comment-15950760
 ] 

Sunil Rangwani commented on SPARK-14492:


[~sowen] Can you please explain why is it not a problem? Interacting with 
different version of Hive metastore doesn't work as described in the 
documentation. I have met other people who have the same use case; they have 
legacy data in Hive and want to use spark to interact with it. 

> Spark SQL 1.6.0 does not work with Hive version lower than 1.2.0; its not 
> backwards compatible with earlier version
> ---
>
> Key: SPARK-14492
> URL: https://issues.apache.org/jira/browse/SPARK-14492
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Sunil Rangwani
>Priority: Critical
>
> Spark SQL when configured with a Hive version lower than 1.2.0 throws a 
> java.lang.NoSuchFieldError for the field METASTORE_CLIENT_SOCKET_LIFETIME 
> because this field was introduced in Hive 1.2.0 so its not possible to use 
> Hive metastore version lower than 1.2.0 with Spark. The details of the Hive 
> changes can be found here: https://issues.apache.org/jira/browse/HIVE-9508 
> {code:java}
> Exception in thread "main" java.lang.NoSuchFieldError: 
> METASTORE_CLIENT_SOCKET_LIFETIME
>   at 
> org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:500)
>   at 
> org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:250)
>   at 
> org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:237)
>   at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:441)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:272)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:271)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at org.apache.spark.sql.SQLContext.(SQLContext.scala:271)
>   at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:90)
>   at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:101)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:58)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:267)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:139)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18936) Infrastructure for session local timezone support

2017-03-31 Thread Navya Krishnappa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15950748#comment-15950748
 ] 

Navya Krishnappa edited comment on SPARK-18936 at 3/31/17 11:51 AM:


I think this fix helps us to set the time zone in the spark configurations. If 
it's so Can we set "UTC" as my time zone??

And let me know if I misunderstood the document.


was (Author: navya krishnappa):
I think this fix helps us to set the time zone in the spark configurations. If 
it's so Can we set "UTC" time zone??

And let me know if I misunderstood the document.

> Infrastructure for session local timezone support
> -
>
> Key: SPARK-18936
> URL: https://issues.apache.org/jira/browse/SPARK-18936
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Takuya Ueshin
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18936) Infrastructure for session local timezone support

2017-03-31 Thread Navya Krishnappa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15950748#comment-15950748
 ] 

Navya Krishnappa commented on SPARK-18936:
--

I think this fix helps us to set the time zone in the spark configurations. If 
it's so Can we set "UTC" time zone??

And let me know if I misunderstood the document.

> Infrastructure for session local timezone support
> -
>
> Key: SPARK-18936
> URL: https://issues.apache.org/jira/browse/SPARK-18936
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Takuya Ueshin
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20152) Time zone is not respected while parsing csv for timeStampFormat "MM-dd-yyyy'T'HH:mm:ss.SSSZZ"

2017-03-31 Thread Navya Krishnappa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15950745#comment-15950745
 ] 

Navya Krishnappa commented on SPARK-20152:
--

[~srowen] & [~hyukjin.kwon] Thank you for your comments. 

> Time zone is not respected while parsing csv for timeStampFormat 
> "MM-dd-'T'HH:mm:ss.SSSZZ"
> --
>
> Key: SPARK-20152
> URL: https://issues.apache.org/jira/browse/SPARK-20152
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Navya Krishnappa
>
> When reading the below mentioned time value by specifying the 
> "timestampFormat": "MM-dd-'T'HH:mm:ss.SSSZZ", time zone is ignored.
> Source File: 
> TimeColumn
> 03-21-2017T03:30:02Z
> Source code1:
> Dataset dataset = getSqlContext().read()
> .option(PARSER_LIB, "commons")
> .option(INFER_SCHEMA, "true")
> .option(DELIMITER, ",")
> .option(QUOTE, "\"")
> .option(ESCAPE, "\\")
> .option("timestampFormat" , "MM-dd-'T'HH:mm:ss.SSSZZ")
> .option(MODE, Mode.PERMISSIVE)
> .csv(sourceFile);
> Result: TimeColumn [ StringType] and value is "03-21-2017T03:30:02Z", but 
> expected result is TimeCoumn should be of "TimestampType"  and should 
> consider time zone for manipulation
> Source code2: 
> Dataset dataset = getSqlContext().read() 
> .option(PARSER_LIB, "commons") 
> .option(INFER_SCHEMA, "true") 
> .option(DELIMITER, ",") 
> .option(QUOTE, "\"") 
> .option(ESCAPE, "\\") 
> .option("timestampFormat" , "MM-dd-'T'HH:mm:ss") 
> .option(MODE, Mode.PERMISSIVE) 
> .csv(sourceFile); 
> Result: TimeColumn [ TimestampType] and value is "2017-03-21 03:30:02.0", but 
> expected result is TimeCoumn should consider time zone for manipulation



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20176) Spark Dataframe UDAF issue

2017-03-31 Thread Dinesh Man Amatya (JIRA)
Dinesh Man Amatya created SPARK-20176:
-

 Summary: Spark Dataframe UDAF issue
 Key: SPARK-20176
 URL: https://issues.apache.org/jira/browse/SPARK-20176
 Project: Spark
  Issue Type: IT Help
  Components: Spark Core
Affects Versions: 2.0.2
Reporter: Dinesh Man Amatya


Getting following error in custom UDAF

Error while decoding: java.util.concurrent.ExecutionException: 
java.lang.Exception: failed to compile: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 58, 
Column 33: Incompatible expression types "boolean" and "java.lang.Boolean"
/* 001 */ public java.lang.Object generate(Object[] references) {
/* 002 */   return new SpecificSafeProjection(references);
/* 003 */ }
/* 004 */
/* 005 */ class SpecificSafeProjection extends 
org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
/* 006 */
/* 007 */   private Object[] references;
/* 008 */   private MutableRow mutableRow;
/* 009 */   private Object[] values;
/* 010 */   private Object[] values1;
/* 011 */   private org.apache.spark.sql.types.StructType schema;
/* 012 */   private org.apache.spark.sql.types.StructType schema1;
/* 013 */
/* 014 */
/* 015 */   public SpecificSafeProjection(Object[] references) {
/* 016 */ this.references = references;
/* 017 */ mutableRow = (MutableRow) references[references.length - 1];
/* 018 */
/* 019 */
/* 020 */ this.schema = (org.apache.spark.sql.types.StructType) 
references[0];
/* 021 */ this.schema1 = (org.apache.spark.sql.types.StructType) 
references[1];
/* 022 */   }
/* 023 */
/* 024 */   public java.lang.Object apply(java.lang.Object _i) {
/* 025 */ InternalRow i = (InternalRow) _i;
/* 026 */
/* 027 */ values = new Object[2];
/* 028 */
/* 029 */ boolean isNull2 = i.isNullAt(0);
/* 030 */ UTF8String value2 = isNull2 ? null : (i.getUTF8String(0));
/* 031 */
/* 032 */ boolean isNull1 = isNull2;
/* 033 */ final java.lang.String value1 = isNull1 ? null : 
(java.lang.String) value2.toString();
/* 034 */ isNull1 = value1 == null;
/* 035 */ if (isNull1) {
/* 036 */   values[0] = null;
/* 037 */ } else {
/* 038 */   values[0] = value1;
/* 039 */ }
/* 040 */
/* 041 */ boolean isNull5 = i.isNullAt(1);
/* 042 */ InternalRow value5 = isNull5 ? null : (i.getStruct(1, 2));
/* 043 */ boolean isNull3 = false;
/* 044 */ org.apache.spark.sql.Row value3 = null;
/* 045 */ if (!false && isNull5) {
/* 046 */
/* 047 */   final org.apache.spark.sql.Row value6 = null;
/* 048 */   isNull3 = true;
/* 049 */   value3 = value6;
/* 050 */ } else {
/* 051 */
/* 052 */   values1 = new Object[2];
/* 053 */
/* 054 */   boolean isNull10 = i.isNullAt(1);
/* 055 */   InternalRow value10 = isNull10 ? null : (i.getStruct(1, 2));
/* 056 */
/* 057 */   boolean isNull9 = isNull10 || false;
/* 058 */   final boolean value9 = isNull9 ? false : (Boolean) 
value10.isNullAt(0);
/* 059 */   boolean isNull8 = false;
/* 060 */   double value8 = -1.0;
/* 061 */   if (!isNull9 && value9) {
/* 062 */
/* 063 */ final double value12 = -1.0;
/* 064 */ isNull8 = true;
/* 065 */ value8 = value12;
/* 066 */   } else {
/* 067 */
/* 068 */ boolean isNull14 = i.isNullAt(1);
/* 069 */ InternalRow value14 = isNull14 ? null : (i.getStruct(1, 2));
/* 070 */ boolean isNull13 = isNull14;
/* 071 */ double value13 = -1.0;
/* 072 */
/* 073 */ if (!isNull14) {
/* 074 */
/* 075 */   if (value14.isNullAt(0)) {
/* 076 */ isNull13 = true;
/* 077 */   } else {
/* 078 */ value13 = value14.getDouble(0);
/* 079 */   }
/* 080 */
/* 081 */ }
/* 082 */ isNull8 = isNull13;
/* 083 */ value8 = value13;
/* 084 */   }
/* 085 */   if (isNull8) {
/* 086 */ values1[0] = null;
/* 087 */   } else {
/* 088 */ values1[0] = value8;
/* 089 */   }
/* 090 */
/* 091 */   boolean isNull17 = i.isNullAt(1);
/* 092 */   InternalRow value17 = isNull17 ? null : (i.getStruct(1, 2));
/* 093 */
/* 094 */   boolean isNull16 = isNull17 || false;
/* 095 */   final boolean value16 = isNull16 ? false : (Boolean) 
value17.isNullAt(1);
/* 096 */   boolean isNull15 = false;
/* 097 */   double value15 = -1.0;
/* 098 */   if (!isNull16 && value16) {
/* 099 */
/* 100 */ final double value19 = -1.0;
/* 101 */ isNull15 = true;
/* 102 */ value15 = value19;
/* 103 */   } else {
/* 104 */
/* 105 */ boolean isNull21 = i.isNullAt(1);
/* 106 */ InternalRow value21 = isNull21 ? null : (i.getStruct(1, 2));
/* 107 */ boolean isNull20 = isNull21;
/* 108 */ double value20 = -1.0;
/* 109 */
/* 110 */ if (!isNull21) {
/* 111 */
/* 112 */   if (value21.isNullAt(1)) {
/* 113 */

[jira] [Commented] (SPARK-20173) Throw NullPointerException when HiveThriftServer2 is shutdown

2017-03-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15950683#comment-15950683
 ] 

Apache Spark commented on SPARK-20173:
--

User 'zuotingbing' has created a pull request for this issue:
https://github.com/apache/spark/pull/17496

> Throw NullPointerException when HiveThriftServer2 is shutdown
> -
>
> Key: SPARK-20173
> URL: https://issues.apache.org/jira/browse/SPARK-20173
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: zuotingbing
>
> Throw NullPointerException when HiveThriftServer2 is shutdown:
> 
> 2017-03-30 11:52:56,355 ERROR Utils: Uncaught exception in thread Thread-2
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$$anonfun$main$1.apply$mcV$sp(HiveThriftServer2.scala:85)
>   at 
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:215)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:187)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:187)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:187)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1953)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:187)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:187)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:187)
>   at scala.util.Try$.apply(Try.scala:192)
>   at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:187)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:177)
>   at 
> org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
> 2017-03-30 11:52:56,357 INFO ShutdownHookManager: Shutdown hook called



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20139) Spark UI reports partial success for completed stage while log shows all tasks are finished

2017-03-31 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15950640#comment-15950640
 ] 

Sean Owen commented on SPARK-20139:
---

So is the lesson here that the driver can't keep up at this scale with all of 
the event messages -- is it just cosmetic?

> Spark UI reports partial success for completed stage while log shows all 
> tasks are finished
> ---
>
> Key: SPARK-20139
> URL: https://issues.apache.org/jira/browse/SPARK-20139
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: Etti Gur
> Attachments: screenshot-1.png
>
>
> Spark UI reports partial success for completed stage while log shows all 
> tasks are finished - i.e.:
> We have a stage that is presented under completed stages on spark UI,
> but the successful tasks are shown like so: (146372/524964) not as you'd 
> expect (524964/524964)
> Looking at the application master log shows all tasks in that stage are 
> successful:
> 17/03/29 09:45:49 INFO TaskSetManager: Finished task 522973.0 in stage 0.0 
> (TID 522973) in 1163910 ms on ip-10-1-15-34.ec2.internal (executor 116) 
> (524963/524964)
> 17/03/29 09:45:49 INFO TaskSetManager: Finished task 12508.0 in stage 2.0 
> (TID 537472) in 241250 ms on ip-10-1-15-14.ec2.internal (executor 38) 
> (20234/20262)
> 17/03/29 09:45:49 INFO TaskSetManager: Finished task 12465.0 in stage 2.0 
> (TID 537429) in 241994 ms on ip-10-1-15-106.ec2.internal (executor 133) 
> (20235/20262)
> 17/03/29 09:45:49 INFO TaskSetManager: Finished task 15079.0 in stage 2.0 
> (TID 540043) in 202889 ms on ip-10-1-15-173.ec2.internal (executor 295) 
> (20236/20262)
> 17/03/29 09:45:49 INFO TaskSetManager: Finished task 19828.0 in stage 2.0 
> (TID 544792) in 137845 ms on ip-10-1-15-147.ec2.internal (executor 43) 
> (20237/20262)
> 17/03/29 09:45:50 INFO TaskSetManager: Finished task 19072.0 in stage 2.0 
> (TID 544036) in 147363 ms on ip-10-1-15-19.ec2.internal (executor 175) 
> (20238/20262)
> 17/03/29 09:45:50 INFO TaskSetManager: Finished task 524146.0 in stage 0.0 
> (TID 524146) in 889950 ms on ip-10-1-15-72.ec2.internal (executor 74) 
> (524964/524964)
> Also in the log we get an error:
> 17/03/29 08:24:16 ERROR LiveListenerBus: Dropping SparkListenerEvent because 
> no remaining room in event queue. This likely means one of the SparkListeners 
> is too slow and cannot keep up with the rate at which tasks are being started 
> by the scheduler.
> This looks like the stage is indeed completed with all its tasks but UI shows 
> like not all tasks really finished.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14492) Spark SQL 1.6.0 does not work with Hive version lower than 1.2.0; its not backwards compatible with earlier version

2017-03-31 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-14492.
---
Resolution: Not A Problem

> Spark SQL 1.6.0 does not work with Hive version lower than 1.2.0; its not 
> backwards compatible with earlier version
> ---
>
> Key: SPARK-14492
> URL: https://issues.apache.org/jira/browse/SPARK-14492
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Sunil Rangwani
>Priority: Critical
>
> Spark SQL when configured with a Hive version lower than 1.2.0 throws a 
> java.lang.NoSuchFieldError for the field METASTORE_CLIENT_SOCKET_LIFETIME 
> because this field was introduced in Hive 1.2.0 so its not possible to use 
> Hive metastore version lower than 1.2.0 with Spark. The details of the Hive 
> changes can be found here: https://issues.apache.org/jira/browse/HIVE-9508 
> {code:java}
> Exception in thread "main" java.lang.NoSuchFieldError: 
> METASTORE_CLIENT_SOCKET_LIFETIME
>   at 
> org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:500)
>   at 
> org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:250)
>   at 
> org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:237)
>   at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:441)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:272)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:271)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at org.apache.spark.sql.SQLContext.(SQLContext.scala:271)
>   at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:90)
>   at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:101)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:58)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:267)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:139)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-19862) In SparkEnv.scala,shortShuffleMgrNames tungsten-sort can be deleted.

2017-03-31 Thread guoxiaolong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guoxiaolong updated SPARK-19862:

Comment: was deleted

(was: @srowen
In spark2.1.0,"tungsten-sort" -> 
classOf[org.apache.spark.shuffle.sort.SortShuffleManager].getName has been 
deleted,but you didn't agree with my issue SPARK-19862.why?)

> In SparkEnv.scala,shortShuffleMgrNames tungsten-sort can be deleted. 
> -
>
> Key: SPARK-19862
> URL: https://issues.apache.org/jira/browse/SPARK-19862
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.2
>Reporter: guoxiaolong
>Priority: Trivial
>
> "tungsten-sort" -> 
> classOf[org.apache.spark.shuffle.sort.SortShuffleManager].getName can be 
> deleted. Because it is the same of "sort" -> 
> classOf[org.apache.spark.shuffle.sort.SortShuffleManager].getName.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19690) Join a streaming DataFrame with a batch DataFrame may not work

2017-03-31 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-19690:
--
Target Version/s: 2.2.0  (was: 2.1.1, 2.2.0)

> Join a streaming DataFrame with a batch DataFrame may not work
> --
>
> Key: SPARK-19690
> URL: https://issues.apache.org/jira/browse/SPARK-19690
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.3, 2.1.0, 2.1.1
>Reporter: Shixiong Zhu
>Priority: Critical
>
> When joining a streaming DataFrame with a batch DataFrame, if the batch 
> DataFrame has an aggregation, it will be converted to a streaming physical 
> aggregation. Then the query will crash.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20167) In SqlBase.g4,some of the comments is not correct.

2017-03-31 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20167.
---
Resolution: Not A Problem

> In SqlBase.g4,some of the comments is not correct.
> --
>
> Key: SPARK-20167
> URL: https://issues.apache.org/jira/browse/SPARK-20167
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: guoxiaolongzte
>Priority: Minor
>
> In SqlBase.g4,some of the comments is not correct.
> eg.
>   | DROP TABLE (IF EXISTS)? tableIdentifier PURGE? #dropTable
>   | DROP VIEW (IF EXISTS)? tableIdentifier  
> #dropTable
>  the comments of ‘DROP VIEW (IF EXISTS)? tableIdentifier ’should be  dropView



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20144) spark.read.parquet no long maintains ordering of the data

2017-03-31 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15950627#comment-15950627
 ] 

Sean Owen commented on SPARK-20144:
---

If you need a particular ordering, I think you need to sort. I am not sure 
ordering is particularly guaranteed in the format or the reading of it.

> spark.read.parquet no long maintains ordering of the data
> -
>
> Key: SPARK-20144
> URL: https://issues.apache.org/jira/browse/SPARK-20144
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Li Jin
>
> Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is 
> when we read parquet files in 2.0.2, the ordering of rows in the resulting 
> dataframe is not the same as the ordering of rows in the dataframe that the 
> parquet file was reproduced with. 
> This is because FileSourceStrategy.scala combines the parquet files into 
> fewer partitions and also reordered them. This breaks our workflows because 
> they assume the ordering of the data. 
> Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec 
> changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with 
> 2.1.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20175) Exists should not be evaluated in Join operator and can be converted to ScalarSubquery if no correlated reference

2017-03-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20175:


Assignee: Apache Spark

> Exists should not be evaluated in Join operator and can be converted to 
> ScalarSubquery if no correlated reference
> -
>
> Key: SPARK-20175
> URL: https://issues.apache.org/jira/browse/SPARK-20175
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>
> Similar to ListQuery, Exists should not be evaluated in Join operator too. 
> Otherwise, a query like following will fail:
> sql("select * from l, r where l.a = r.c + 1 AND (exists (select * from r) OR 
> l.a = r.c)")
> For the Exists subquery without correlated reference, this patch converts it 
> to scalar subquery with a count Aggregate operator.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20175) Exists should not be evaluated in Join operator and can be converted to ScalarSubquery if no correlated reference

2017-03-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20175:


Assignee: (was: Apache Spark)

> Exists should not be evaluated in Join operator and can be converted to 
> ScalarSubquery if no correlated reference
> -
>
> Key: SPARK-20175
> URL: https://issues.apache.org/jira/browse/SPARK-20175
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Liang-Chi Hsieh
>
> Similar to ListQuery, Exists should not be evaluated in Join operator too. 
> Otherwise, a query like following will fail:
> sql("select * from l, r where l.a = r.c + 1 AND (exists (select * from r) OR 
> l.a = r.c)")
> For the Exists subquery without correlated reference, this patch converts it 
> to scalar subquery with a count Aggregate operator.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20175) Exists should not be evaluated in Join operator and can be converted to ScalarSubquery if no correlated reference

2017-03-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15950624#comment-15950624
 ] 

Apache Spark commented on SPARK-20175:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/17491

> Exists should not be evaluated in Join operator and can be converted to 
> ScalarSubquery if no correlated reference
> -
>
> Key: SPARK-20175
> URL: https://issues.apache.org/jira/browse/SPARK-20175
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Liang-Chi Hsieh
>
> Similar to ListQuery, Exists should not be evaluated in Join operator too. 
> Otherwise, a query like following will fail:
> sql("select * from l, r where l.a = r.c + 1 AND (exists (select * from r) OR 
> l.a = r.c)")
> For the Exists subquery without correlated reference, this patch converts it 
> to scalar subquery with a count Aggregate operator.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20175) Exists should not be evaluated in Join operator and can be converted to ScalarSubquery if no correlated reference

2017-03-31 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-20175:
---

 Summary: Exists should not be evaluated in Join operator and can 
be converted to ScalarSubquery if no correlated reference
 Key: SPARK-20175
 URL: https://issues.apache.org/jira/browse/SPARK-20175
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.0
Reporter: Liang-Chi Hsieh


Similar to ListQuery, Exists should not be evaluated in Join operator too. 
Otherwise, a query like following will fail:

sql("select * from l, r where l.a = r.c + 1 AND (exists (select * from r) OR 
l.a = r.c)")

For the Exists subquery without correlated reference, this patch converts it to 
scalar subquery with a count Aggregate operator.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20173) Throw NullPointerException when HiveThriftServer2 is shutdown

2017-03-31 Thread Xiaochen Ouyang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15950619#comment-15950619
 ] 

Xiaochen Ouyang commented on SPARK-20173:
-

+1

> Throw NullPointerException when HiveThriftServer2 is shutdown
> -
>
> Key: SPARK-20173
> URL: https://issues.apache.org/jira/browse/SPARK-20173
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: zuotingbing
>
> Throw NullPointerException when HiveThriftServer2 is shutdown:
> 
> 2017-03-30 11:52:56,355 ERROR Utils: Uncaught exception in thread Thread-2
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$$anonfun$main$1.apply$mcV$sp(HiveThriftServer2.scala:85)
>   at 
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:215)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:187)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:187)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:187)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1953)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:187)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:187)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:187)
>   at scala.util.Try$.apply(Try.scala:192)
>   at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:187)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:177)
>   at 
> org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
> 2017-03-30 11:52:56,357 INFO ShutdownHookManager: Shutdown hook called



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20172) Event log without read permission should be filtered out before actually reading it

2017-03-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20172:


Assignee: (was: Apache Spark)

> Event log without read permission should be filtered out before actually 
> reading it
> ---
>
> Key: SPARK-20172
> URL: https://issues.apache.org/jira/browse/SPARK-20172
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Saisai Shao
>Priority: Minor
>
> In the current Spark's HistoryServer, we expected to check file permission 
> when listing all the files, and filter out this files with no read 
> permission. That was not worked because we actually doesn't check the access 
> permission, so we defer this permission check until reading files, that is 
> not necessary and the exception is printed out in every 10 seconds by default.
> So to avoid this problem we should add a access check logic in listing files.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20172) Event log without read permission should be filtered out before actually reading it

2017-03-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20172:


Assignee: Apache Spark

> Event log without read permission should be filtered out before actually 
> reading it
> ---
>
> Key: SPARK-20172
> URL: https://issues.apache.org/jira/browse/SPARK-20172
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Saisai Shao
>Assignee: Apache Spark
>Priority: Minor
>
> In the current Spark's HistoryServer, we expected to check file permission 
> when listing all the files, and filter out this files with no read 
> permission. That was not worked because we actually doesn't check the access 
> permission, so we defer this permission check until reading files, that is 
> not necessary and the exception is printed out in every 10 seconds by default.
> So to avoid this problem we should add a access check logic in listing files.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20172) Event log without read permission should be filtered out before actually reading it

2017-03-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15950616#comment-15950616
 ] 

Apache Spark commented on SPARK-20172:
--

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/17495

> Event log without read permission should be filtered out before actually 
> reading it
> ---
>
> Key: SPARK-20172
> URL: https://issues.apache.org/jira/browse/SPARK-20172
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Saisai Shao
>Priority: Minor
>
> In the current Spark's HistoryServer, we expected to check file permission 
> when listing all the files, and filter out this files with no read 
> permission. That was not worked because we actually doesn't check the access 
> permission, so we defer this permission check until reading files, that is 
> not necessary and the exception is printed out in every 10 seconds by default.
> So to avoid this problem we should add a access check logic in listing files.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >