[jira] [Commented] (SPARK-19148) do not expose the external table concept in Catalog
[ https://issues.apache.org/jira/browse/SPARK-19148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15952057#comment-15952057 ] Apache Spark commented on SPARK-19148: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/17502 > do not expose the external table concept in Catalog > --- > > Key: SPARK-19148 > URL: https://issues.apache.org/jira/browse/SPARK-19148 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9478) Add sample weights to Random Forest
[ https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15951685#comment-15951685 ] Joseph K. Bradley edited comment on SPARK-9478 at 4/1/17 2:35 AM: -- [~clamus] The current vote is to *not use* weights during sampling and then to *use* weights when growing the trees. That will simplify the sampling process so we hopefully won't have to deal with the complexity you're mentioning. Note that we'll have to weight the trees in the forest to make this approach work. I'm also guessing that it will give better calibrated probability estimates in the final forest, though this is based on intuition rather than analysis. E.g., given the 4-instance dataset in [~sethah]'s example above, with subsampling 4 instances for each tree, I'd imagine: * If we use weights during sampling but not when growing trees... ** Say we want 10 trees. We pick 10 sets of 4 rows. The probability of always picking the weight-1000 row is ~0.89. ** So our forest will probably give us 0/1 (poorly calibrated) probabilities. * If we do not use weights during sampling but use them when growing trees... (current proposal) ** Say we want 10 trees. ** The probability of always picking the weight-1 rows is ~1e-5. This means we'll have at least one tree with the weight-1000 row, so it will dominate our predictions (giving good accuracy). ** The probability of having at least 1 tree with only weight-1 rows is ~0.02. This means it's pretty likely we'll have some tree predicting label1, so we'll keep our probability predictions away from 0 and 1. This is really hand-wavy, but it does alleviate my fears of having extreme log losses. On the other hand, maybe it could be handle by adding smoothing to predictions... was (Author: josephkb): [~clamus] The current vote is to *not use* weights during sampling and then to *use* weights when growing the trees. That will simplify the sampling process so we hopefully won't have to deal with the complexity you're mentioning. Note that we'll have to weight the trees in the forest to make this approach work. I'm also guessing that it will give better calibrated probability estimates in the final forest, though this is based on intuition rather than analysis. E.g., given the 4-instance dataset in [~sethah]'s example above, I'd imagine: * If we use weights during sampling but not when growing trees... ** Say we want 10 trees. We pick 10 sets of 4 rows. The probability of always picking the weight-1000 row is ~0.89. ** So our forest will probably give us 0/1 (poorly calibrated) probabilities. * If we do not use weights during sampling but use them when growing trees... (current proposal) ** Say we want 10 trees. ** The probability of always picking the weight-1 rows is ~1e-5. This means we'll have at least one tree with the weight-1000 row, so it will dominate our predictions (giving good accuracy). ** The probability of having at least 1 tree with only weight-1 rows is ~0.02. This means it's pretty likely we'll have some tree predicting label1, so we'll keep our probability predictions away from 0 and 1. This is really hand-wavy, but it does alleviate my fears of having extreme log losses. On the other hand, maybe it could be handle by adding smoothing to predictions... > Add sample weights to Random Forest > --- > > Key: SPARK-9478 > URL: https://issues.apache.org/jira/browse/SPARK-9478 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.4.1 >Reporter: Patrick Crenshaw > > Currently, this implementation of random forest does not support sample > (instance) weights. Weights are important when there is imbalanced training > data or the evaluation metric of a classifier is imbalanced (e.g. true > positive rate at some false positive threshold). Sample weights generalize > class weights, so this could be used to add class weights later on. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20183) Add outlierRatio option to testOutliersWithSmallWeights
[ https://issues.apache.org/jira/browse/SPARK-20183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20183: Assignee: Seth Hendrickson (was: Apache Spark) > Add outlierRatio option to testOutliersWithSmallWeights > --- > > Key: SPARK-20183 > URL: https://issues.apache.org/jira/browse/SPARK-20183 > Project: Spark > Issue Type: Sub-task > Components: ML, Tests >Affects Versions: 2.1.0 >Reporter: Joseph K. Bradley >Assignee: Seth Hendrickson > > Part 1 of parent PR: Add flexibility to testOutliersWithSmallWeights test. > See https://github.com/apache/spark/pull/16722 for perspective. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20183) Add outlierRatio option to testOutliersWithSmallWeights
[ https://issues.apache.org/jira/browse/SPARK-20183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15951865#comment-15951865 ] Apache Spark commented on SPARK-20183: -- User 'jkbradley' has created a pull request for this issue: https://github.com/apache/spark/pull/17501 > Add outlierRatio option to testOutliersWithSmallWeights > --- > > Key: SPARK-20183 > URL: https://issues.apache.org/jira/browse/SPARK-20183 > Project: Spark > Issue Type: Sub-task > Components: ML, Tests >Affects Versions: 2.1.0 >Reporter: Joseph K. Bradley >Assignee: Seth Hendrickson > > Part 1 of parent PR: Add flexibility to testOutliersWithSmallWeights test. > See https://github.com/apache/spark/pull/16722 for perspective. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20183) Add outlierRatio option to testOutliersWithSmallWeights
[ https://issues.apache.org/jira/browse/SPARK-20183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20183: Assignee: Apache Spark (was: Seth Hendrickson) > Add outlierRatio option to testOutliersWithSmallWeights > --- > > Key: SPARK-20183 > URL: https://issues.apache.org/jira/browse/SPARK-20183 > Project: Spark > Issue Type: Sub-task > Components: ML, Tests >Affects Versions: 2.1.0 >Reporter: Joseph K. Bradley >Assignee: Apache Spark > > Part 1 of parent PR: Add flexibility to testOutliersWithSmallWeights test. > See https://github.com/apache/spark/pull/16722 for perspective. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20183) Add outlierRatio option to testOutliersWithSmallWeights
Joseph K. Bradley created SPARK-20183: - Summary: Add outlierRatio option to testOutliersWithSmallWeights Key: SPARK-20183 URL: https://issues.apache.org/jira/browse/SPARK-20183 Project: Spark Issue Type: Sub-task Components: ML, Tests Affects Versions: 2.1.0 Reporter: Joseph K. Bradley Assignee: Seth Hendrickson Part 1 of parent PR: Add flexibility to testOutliersWithSmallWeights test. See https://github.com/apache/spark/pull/16722 for perspective. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19591) Add sample weights to decision trees
[ https://issues.apache.org/jira/browse/SPARK-19591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-19591: -- Description: Add sample weights to decision trees. See [SPARK-9478] for details on the design. (was: Add sample weights to decision trees) > Add sample weights to decision trees > > > Key: SPARK-19591 > URL: https://issues.apache.org/jira/browse/SPARK-19591 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.1.0 >Reporter: Seth Hendrickson >Assignee: Seth Hendrickson > > Add sample weights to decision trees. See [SPARK-9478] for details on the > design. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19591) Add sample weights to decision trees
[ https://issues.apache.org/jira/browse/SPARK-19591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-19591: -- Issue Type: New Feature (was: Sub-task) Parent: (was: SPARK-9478) > Add sample weights to decision trees > > > Key: SPARK-19591 > URL: https://issues.apache.org/jira/browse/SPARK-19591 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.1.0 >Reporter: Seth Hendrickson >Assignee: Seth Hendrickson > > Add sample weights to decision trees -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19408) cardinality estimation involving two columns of the same table
[ https://issues.apache.org/jira/browse/SPARK-19408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ron Hu updated SPARK-19408: --- Description: In SPARK-17075, we estimate cardinality of predicate expression "column (op) literal", where op is =, <, <=, >, or >=. In SQL queries, we also see predicate expressions involving two columns such as "column-1 (op) column-2" where column-1 and column-2 belong to same table. Note that, if column-1 and column-2 belong to different tables, then it is a join operator's work, NOT a filter operator's work. In this jira, we want to estimate the filter factor of predicate expressions involving two columns of same table. For example, multiple tpc-h queries have this kind of predicate "WHERE l_commitdate < l_receiptdate". was: In SPARK-17075, we estimate cardinality of predicate expression "column (op) literal", where op is =, <, <=, >, or >=. In SQL queries, we also see predicate expressions involving two columns such as "column-1 (op) column-2" where column-1 and column-2 belong to same table. Note that, if column-1 and column-2 belong to different tables, then it is a join operator's work, NOT a filter operator's work. In this jira, we want to estimate the filter factor of predicate expressions involving two columns of same table. > cardinality estimation involving two columns of the same table > -- > > Key: SPARK-19408 > URL: https://issues.apache.org/jira/browse/SPARK-19408 > Project: Spark > Issue Type: Sub-task > Components: Optimizer >Affects Versions: 2.1.0 >Reporter: Ron Hu > > In SPARK-17075, we estimate cardinality of predicate expression "column (op) > literal", where op is =, <, <=, >, or >=. In SQL queries, we also see > predicate expressions involving two columns such as "column-1 (op) column-2" > where column-1 and column-2 belong to same table. Note that, if column-1 and > column-2 belong to different tables, then it is a join operator's work, NOT a > filter operator's work. > In this jira, we want to estimate the filter factor of predicate expressions > involving two columns of same table. For example, multiple tpc-h queries > have this kind of predicate "WHERE l_commitdate < l_receiptdate". -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20003) FPGrowthModel setMinConfidence should affect rules generation and transform
[ https://issues.apache.org/jira/browse/SPARK-20003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-20003: -- Target Version/s: 2.2.0 > FPGrowthModel setMinConfidence should affect rules generation and transform > --- > > Key: SPARK-20003 > URL: https://issues.apache.org/jira/browse/SPARK-20003 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang >Assignee: yuhao yang >Priority: Minor > > I was doing some test and find the issue. FPGrowthModel setMinConfidence > should affect rules generation and transform. > Currently associationRules in FPGrowthModel is a lazy val and > setMinConfidence in FPGrowthModel has no impact once associationRules got > computed . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20003) FPGrowthModel setMinConfidence should affect rules generation and transform
[ https://issues.apache.org/jira/browse/SPARK-20003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-20003: -- Shepherd: Joseph K. Bradley > FPGrowthModel setMinConfidence should affect rules generation and transform > --- > > Key: SPARK-20003 > URL: https://issues.apache.org/jira/browse/SPARK-20003 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang >Assignee: yuhao yang >Priority: Minor > > I was doing some test and find the issue. FPGrowthModel setMinConfidence > should affect rules generation and transform. > Currently associationRules in FPGrowthModel is a lazy val and > setMinConfidence in FPGrowthModel has no impact once associationRules got > computed . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20003) FPGrowthModel setMinConfidence should affect rules generation and transform
[ https://issues.apache.org/jira/browse/SPARK-20003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-20003: - Assignee: yuhao yang > FPGrowthModel setMinConfidence should affect rules generation and transform > --- > > Key: SPARK-20003 > URL: https://issues.apache.org/jira/browse/SPARK-20003 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang >Assignee: yuhao yang >Priority: Minor > > I was doing some test and find the issue. FPGrowthModel setMinConfidence > should affect rules generation and transform. > Currently associationRules in FPGrowthModel is a lazy val and > setMinConfidence in FPGrowthModel has no impact once associationRules got > computed . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20182) Dot in DataFrame Column title causes errors
Evan Zamir created SPARK-20182: -- Summary: Dot in DataFrame Column title causes errors Key: SPARK-20182 URL: https://issues.apache.org/jira/browse/SPARK-20182 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.1.0 Reporter: Evan Zamir I did a search and saw this issue pop up before, and while it seemed like it had been solved before 2.1, I'm still seeing an error. ``` emp = spark.createDataFrame([(["Joe", "Bob", "Mary"],), (["Mike", "Matt", "Stacy"],)], ["first.names"]) print(emp.collect()) emp.select(['first.names']).alias('first') ``` [Row(first.names=['Joe', 'Bob', 'Mary']), Row(first.names=['Mike', 'Matt', 'Stacy'])] Py4JJavaError Traceback (most recent call last) /usr/local/spark/python/pyspark/sql/utils.py in deco(*a, **kw) 62 try: ---> 63 return f(*a, **kw) 64 except py4j.protocol.Py4JJavaError as e: /usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 318 "An error occurred while calling {0}{1}{2}.\n". --> 319 format(target_id, ".", name), value) 320 else: Py4JJavaError: An error occurred while calling o1734.select. : org.apache.spark.sql.AnalysisException: cannot resolve '`first.names`' given input columns: [first.names];; 'Project ['first.names] +- LogicalRDD [first.names#466] at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:282) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:292) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:296) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:296) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$7.apply(QueryPlan.scala:301) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:301) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:74) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:128) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67) at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:57) at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:48) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:63) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2822) at org.apache.spark.sql.Dataset.select(Dataset.scala:1121) at sun.reflect.GeneratedMethodAccessor52.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at
[jira] [Updated] (SPARK-20164) AnalysisException not tolerant of null query plan
[ https://issues.apache.org/jira/browse/SPARK-20164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kunal Khamar updated SPARK-20164: - Description: The query plan in an AnalysisException may be null when an AnalysisException object is serialized and then deserialized, since plan is marked @transient. Or when someone throws an AnalysisException with a null query plan (which should not happen). def getMessage is not tolerant of this and throws a NullPointerException, leading to loss of information about the original exception. The fix is to add a null check in getMessage. was: The query plan in an `AnalysisException` may be `null` when an `AnalysisException` object is serialized and then deserialized, since `plan` is marked `@transient`. Or when someone throws an `AnalysisException` with a null query plan (which should not happen). `def getMessage` is not tolerant of this and throws a `NullPointerException`, leading to loss of information about the original exception. The fix is to add a `null` check in `getMessage`. > AnalysisException not tolerant of null query plan > - > > Key: SPARK-20164 > URL: https://issues.apache.org/jira/browse/SPARK-20164 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kunal Khamar >Assignee: Kunal Khamar > Fix For: 2.2.0, 2.1.2 > > > The query plan in an AnalysisException may be null when an AnalysisException > object is serialized and then deserialized, since plan is marked @transient. > Or when someone throws an AnalysisException with a null query plan (which > should not happen). > def getMessage is not tolerant of this and throws a NullPointerException, > leading to loss of information about the original exception. > The fix is to add a null check in getMessage. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20164) AnalysisException not tolerant of null query plan
[ https://issues.apache.org/jira/browse/SPARK-20164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kunal Khamar updated SPARK-20164: - Description: The query plan in an `AnalysisException` may be `null` when an `AnalysisException` object is serialized and then deserialized, since `plan` is marked `@transient`. Or when someone throws an `AnalysisException` with a null query plan (which should not happen). `def getMessage` is not tolerant of this and throws a `NullPointerException`, leading to loss of information about the original exception. The fix is to add a `null` check in `getMessage`. was:When someone throws an AnalysisException with a null query plan (which ideally no one should), getMessage is not tolerant of this and throws a null pointer exception, leading to loss of information about original exception. > AnalysisException not tolerant of null query plan > - > > Key: SPARK-20164 > URL: https://issues.apache.org/jira/browse/SPARK-20164 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kunal Khamar >Assignee: Kunal Khamar > Fix For: 2.2.0, 2.1.2 > > > The query plan in an `AnalysisException` may be `null` when an > `AnalysisException` object is serialized and then deserialized, since `plan` > is marked `@transient`. Or when someone throws an `AnalysisException` with a > null query plan (which should not happen). > `def getMessage` is not tolerant of this and throws a `NullPointerException`, > leading to loss of information about the original exception. > The fix is to add a `null` check in `getMessage`. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-20163) Kill all running tasks in a stage in case of fetch failure
[ https://issues.apache.org/jira/browse/SPARK-20163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sital Kedia closed SPARK-20163. --- Resolution: Duplicate > Kill all running tasks in a stage in case of fetch failure > -- > > Key: SPARK-20163 > URL: https://issues.apache.org/jira/browse/SPARK-20163 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.0.1 >Reporter: Sital Kedia > > Currently, the scheduler does not kill the running tasks in a stage when it > encounters fetch failure, as a result, we might end up running many duplicate > tasks in the cluster. There is already a TODO in TaskSetManager to kill all > running tasks which has not been implemented. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20163) Kill all running tasks in a stage in case of fetch failure
[ https://issues.apache.org/jira/browse/SPARK-20163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15951738#comment-15951738 ] Sital Kedia commented on SPARK-20163: - Thanks [~imranr], closing this as this is duplicate of SPARK-2666. > Kill all running tasks in a stage in case of fetch failure > -- > > Key: SPARK-20163 > URL: https://issues.apache.org/jira/browse/SPARK-20163 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.0.1 >Reporter: Sital Kedia > > Currently, the scheduler does not kill the running tasks in a stage when it > encounters fetch failure, as a result, we might end up running many duplicate > tasks in the cluster. There is already a TODO in TaskSetManager to kill all > running tasks which has not been implemented. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20181) Avoid noisy Jetty WARN log when failing to bind a port
[ https://issues.apache.org/jira/browse/SPARK-20181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20181: Assignee: (was: Apache Spark) > Avoid noisy Jetty WARN log when failing to bind a port > -- > > Key: SPARK-20181 > URL: https://issues.apache.org/jira/browse/SPARK-20181 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Derek Dagit >Priority: Minor > > As a user, I would like to suppress the Jetty WARN log about failing to bind > to a port already in use, so that my logs are less noisy. > Currently, Jetty code prints the stack trace of the BindException at WARN > level. In the context of starting a service on an ephemeral port, this is not > a useful warning, and it is exceedingly verbose. > {noformat} > 17/03/06 14:57:26 WARN AbstractLifeCycle: FAILED > ServerConnector@79476a4e{HTTP/1.1}{0.0.0.0:4040}: java.net.BindException: > Address already in use > java.net.BindException: Address already in use > at sun.nio.ch.Net.bind0(Native Method) > at sun.nio.ch.Net.bind(Net.java:433) > at sun.nio.ch.Net.bind(Net.java:425) > at > sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223) > at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74) > at > org.spark_project.jetty.server.ServerConnector.open(ServerConnector.java:321) > at > org.spark_project.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:80) > at > org.spark_project.jetty.server.ServerConnector.doStart(ServerConnector.java:236) > at > org.spark_project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68) > at org.spark_project.jetty.server.Server.doStart(Server.java:366) > at > org.spark_project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68) > at > org.apache.spark.ui.JettyUtils$.org$apache$spark$ui$JettyUtils$$connect$1(JettyUtils.scala:306) > at org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:316) > at org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:316) > at > org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:2175) > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160) > at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:2166) > at > org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:316) > at org.apache.spark.ui.WebUI.bind(WebUI.scala:139) > at > org.apache.spark.SparkContext$$anonfun$10.apply(SparkContext.scala:448) > at > org.apache.spark.SparkContext$$anonfun$10.apply(SparkContext.scala:448) > at scala.Option.foreach(Option.scala:257) > at org.apache.spark.SparkContext.(SparkContext.scala:448) > at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2282) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:831) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:823) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:823) > at org.apache.spark.repl.Main$.createSparkSession(Main.scala:95) > at $line3.$read$$iw$$iw.(:15) > at $line3.$read$$iw.(:31) > at $line3.$read.(:33) > at $line3.$read$.(:37) > at $line3.$read$.() > at $line3.$eval$.$print$lzycompute(:7) > at $line3.$eval$.$print(:6) > at $line3.$eval.$print() > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:786) > at > scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1047) > at > scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:638) > at > scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:637) > at > scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31) > at > scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:19) > at > scala.tools.nsc.interpreter.IMain$WrappedRequest.loadAndRunReq(IMain.scala:637) > at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:569) > at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:565) >
[jira] [Assigned] (SPARK-20181) Avoid noisy Jetty WARN log when failing to bind a port
[ https://issues.apache.org/jira/browse/SPARK-20181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20181: Assignee: Apache Spark > Avoid noisy Jetty WARN log when failing to bind a port > -- > > Key: SPARK-20181 > URL: https://issues.apache.org/jira/browse/SPARK-20181 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Derek Dagit >Assignee: Apache Spark >Priority: Minor > > As a user, I would like to suppress the Jetty WARN log about failing to bind > to a port already in use, so that my logs are less noisy. > Currently, Jetty code prints the stack trace of the BindException at WARN > level. In the context of starting a service on an ephemeral port, this is not > a useful warning, and it is exceedingly verbose. > {noformat} > 17/03/06 14:57:26 WARN AbstractLifeCycle: FAILED > ServerConnector@79476a4e{HTTP/1.1}{0.0.0.0:4040}: java.net.BindException: > Address already in use > java.net.BindException: Address already in use > at sun.nio.ch.Net.bind0(Native Method) > at sun.nio.ch.Net.bind(Net.java:433) > at sun.nio.ch.Net.bind(Net.java:425) > at > sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223) > at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74) > at > org.spark_project.jetty.server.ServerConnector.open(ServerConnector.java:321) > at > org.spark_project.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:80) > at > org.spark_project.jetty.server.ServerConnector.doStart(ServerConnector.java:236) > at > org.spark_project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68) > at org.spark_project.jetty.server.Server.doStart(Server.java:366) > at > org.spark_project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68) > at > org.apache.spark.ui.JettyUtils$.org$apache$spark$ui$JettyUtils$$connect$1(JettyUtils.scala:306) > at org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:316) > at org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:316) > at > org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:2175) > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160) > at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:2166) > at > org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:316) > at org.apache.spark.ui.WebUI.bind(WebUI.scala:139) > at > org.apache.spark.SparkContext$$anonfun$10.apply(SparkContext.scala:448) > at > org.apache.spark.SparkContext$$anonfun$10.apply(SparkContext.scala:448) > at scala.Option.foreach(Option.scala:257) > at org.apache.spark.SparkContext.(SparkContext.scala:448) > at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2282) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:831) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:823) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:823) > at org.apache.spark.repl.Main$.createSparkSession(Main.scala:95) > at $line3.$read$$iw$$iw.(:15) > at $line3.$read$$iw.(:31) > at $line3.$read.(:33) > at $line3.$read$.(:37) > at $line3.$read$.() > at $line3.$eval$.$print$lzycompute(:7) > at $line3.$eval$.$print(:6) > at $line3.$eval.$print() > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:786) > at > scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1047) > at > scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:638) > at > scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:637) > at > scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31) > at > scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:19) > at > scala.tools.nsc.interpreter.IMain$WrappedRequest.loadAndRunReq(IMain.scala:637) > at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:569) > at
[jira] [Commented] (SPARK-20181) Avoid noisy Jetty WARN log when failing to bind a port
[ https://issues.apache.org/jira/browse/SPARK-20181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15951699#comment-15951699 ] Apache Spark commented on SPARK-20181: -- User 'd2r' has created a pull request for this issue: https://github.com/apache/spark/pull/17500 > Avoid noisy Jetty WARN log when failing to bind a port > -- > > Key: SPARK-20181 > URL: https://issues.apache.org/jira/browse/SPARK-20181 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Derek Dagit >Priority: Minor > > As a user, I would like to suppress the Jetty WARN log about failing to bind > to a port already in use, so that my logs are less noisy. > Currently, Jetty code prints the stack trace of the BindException at WARN > level. In the context of starting a service on an ephemeral port, this is not > a useful warning, and it is exceedingly verbose. > {noformat} > 17/03/06 14:57:26 WARN AbstractLifeCycle: FAILED > ServerConnector@79476a4e{HTTP/1.1}{0.0.0.0:4040}: java.net.BindException: > Address already in use > java.net.BindException: Address already in use > at sun.nio.ch.Net.bind0(Native Method) > at sun.nio.ch.Net.bind(Net.java:433) > at sun.nio.ch.Net.bind(Net.java:425) > at > sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223) > at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74) > at > org.spark_project.jetty.server.ServerConnector.open(ServerConnector.java:321) > at > org.spark_project.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:80) > at > org.spark_project.jetty.server.ServerConnector.doStart(ServerConnector.java:236) > at > org.spark_project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68) > at org.spark_project.jetty.server.Server.doStart(Server.java:366) > at > org.spark_project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68) > at > org.apache.spark.ui.JettyUtils$.org$apache$spark$ui$JettyUtils$$connect$1(JettyUtils.scala:306) > at org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:316) > at org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:316) > at > org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:2175) > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160) > at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:2166) > at > org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:316) > at org.apache.spark.ui.WebUI.bind(WebUI.scala:139) > at > org.apache.spark.SparkContext$$anonfun$10.apply(SparkContext.scala:448) > at > org.apache.spark.SparkContext$$anonfun$10.apply(SparkContext.scala:448) > at scala.Option.foreach(Option.scala:257) > at org.apache.spark.SparkContext.(SparkContext.scala:448) > at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2282) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:831) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:823) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:823) > at org.apache.spark.repl.Main$.createSparkSession(Main.scala:95) > at $line3.$read$$iw$$iw.(:15) > at $line3.$read$$iw.(:31) > at $line3.$read.(:33) > at $line3.$read$.(:37) > at $line3.$read$.() > at $line3.$eval$.$print$lzycompute(:7) > at $line3.$eval$.$print(:6) > at $line3.$eval.$print() > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:786) > at > scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1047) > at > scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:638) > at > scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:637) > at > scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31) > at > scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:19) > at > scala.tools.nsc.interpreter.IMain$WrappedRequest.loadAndRunReq(IMain.scala:637) > at
[jira] [Commented] (SPARK-20181) Avoid noisy Jetty WARN log when failing to bind a port
[ https://issues.apache.org/jira/browse/SPARK-20181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15951686#comment-15951686 ] Derek Dagit commented on SPARK-20181: - Working on this... > Avoid noisy Jetty WARN log when failing to bind a port > -- > > Key: SPARK-20181 > URL: https://issues.apache.org/jira/browse/SPARK-20181 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Derek Dagit >Priority: Minor > > As a user, I would like to suppress the Jetty WARN log about failing to bind > to a port already in use, so that my logs are less noisy. > Currently, Jetty code prints the stack trace of the BindException at WARN > level. In the context of starting a service on an ephemeral port, this is not > a useful warning, and it is exceedingly verbose. > {noformat} > 17/03/06 14:57:26 WARN AbstractLifeCycle: FAILED > ServerConnector@79476a4e{HTTP/1.1}{0.0.0.0:4040}: java.net.BindException: > Address already in use > java.net.BindException: Address already in use > at sun.nio.ch.Net.bind0(Native Method) > at sun.nio.ch.Net.bind(Net.java:433) > at sun.nio.ch.Net.bind(Net.java:425) > at > sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223) > at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74) > at > org.spark_project.jetty.server.ServerConnector.open(ServerConnector.java:321) > at > org.spark_project.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:80) > at > org.spark_project.jetty.server.ServerConnector.doStart(ServerConnector.java:236) > at > org.spark_project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68) > at org.spark_project.jetty.server.Server.doStart(Server.java:366) > at > org.spark_project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68) > at > org.apache.spark.ui.JettyUtils$.org$apache$spark$ui$JettyUtils$$connect$1(JettyUtils.scala:306) > at org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:316) > at org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:316) > at > org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:2175) > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160) > at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:2166) > at > org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:316) > at org.apache.spark.ui.WebUI.bind(WebUI.scala:139) > at > org.apache.spark.SparkContext$$anonfun$10.apply(SparkContext.scala:448) > at > org.apache.spark.SparkContext$$anonfun$10.apply(SparkContext.scala:448) > at scala.Option.foreach(Option.scala:257) > at org.apache.spark.SparkContext.(SparkContext.scala:448) > at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2282) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:831) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:823) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:823) > at org.apache.spark.repl.Main$.createSparkSession(Main.scala:95) > at $line3.$read$$iw$$iw.(:15) > at $line3.$read$$iw.(:31) > at $line3.$read.(:33) > at $line3.$read$.(:37) > at $line3.$read$.() > at $line3.$eval$.$print$lzycompute(:7) > at $line3.$eval$.$print(:6) > at $line3.$eval.$print() > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:786) > at > scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1047) > at > scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:638) > at > scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:637) > at > scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31) > at > scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:19) > at > scala.tools.nsc.interpreter.IMain$WrappedRequest.loadAndRunReq(IMain.scala:637) > at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:569) > at
[jira] [Commented] (SPARK-9478) Add sample weights to Random Forest
[ https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15951685#comment-15951685 ] Joseph K. Bradley commented on SPARK-9478: -- [~clamus] The current vote is to *not use* weights during sampling and then to *use* weights when growing the trees. That will simplify the sampling process so we hopefully won't have to deal with the complexity you're mentioning. Note that we'll have to weight the trees in the forest to make this approach work. I'm also guessing that it will give better calibrated probability estimates in the final forest, though this is based on intuition rather than analysis. E.g., given the 4-instance dataset in [~sethah]'s example above, I'd imagine: * If we use weights during sampling but not when growing trees... ** Say we want 10 trees. We pick 10 sets of 4 rows. The probability of always picking the weight-1000 row is ~0.89. ** So our forest will probably give us 0/1 (poorly calibrated) probabilities. * If we do not use weights during sampling but use them when growing trees... (current proposal) ** Say we want 10 trees. ** The probability of always picking the weight-1 rows is ~1e-5. This means we'll have at least one tree with the weight-1000 row, so it will dominate our predictions (giving good accuracy). ** The probability of having at least 1 tree with only weight-1 rows is ~0.02. This means it's pretty likely we'll have some tree predicting label1, so we'll keep our probability predictions away from 0 and 1. This is really hand-wavy, but it does alleviate my fears of having extreme log losses. On the other hand, maybe it could be handle by adding smoothing to predictions... > Add sample weights to Random Forest > --- > > Key: SPARK-9478 > URL: https://issues.apache.org/jira/browse/SPARK-9478 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.4.1 >Reporter: Patrick Crenshaw > > Currently, this implementation of random forest does not support sample > (instance) weights. Weights are important when there is imbalanced training > data or the evaluation metric of a classifier is imbalanced (e.g. true > positive rate at some false positive threshold). Sample weights generalize > class weights, so this could be used to add class weights later on. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20181) Avoid noisy Jetty WARN log when failing to bind a port
Derek Dagit created SPARK-20181: --- Summary: Avoid noisy Jetty WARN log when failing to bind a port Key: SPARK-20181 URL: https://issues.apache.org/jira/browse/SPARK-20181 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.1.0 Reporter: Derek Dagit Priority: Minor As a user, I would like to suppress the Jetty WARN log about failing to bind to a port already in use, so that my logs are less noisy. Currently, Jetty code prints the stack trace of the BindException at WARN level. In the context of starting a service on an ephemeral port, this is not a useful warning, and it is exceedingly verbose. {noformat} 17/03/06 14:57:26 WARN AbstractLifeCycle: FAILED ServerConnector@79476a4e{HTTP/1.1}{0.0.0.0:4040}: java.net.BindException: Address already in use java.net.BindException: Address already in use at sun.nio.ch.Net.bind0(Native Method) at sun.nio.ch.Net.bind(Net.java:433) at sun.nio.ch.Net.bind(Net.java:425) at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223) at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74) at org.spark_project.jetty.server.ServerConnector.open(ServerConnector.java:321) at org.spark_project.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:80) at org.spark_project.jetty.server.ServerConnector.doStart(ServerConnector.java:236) at org.spark_project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68) at org.spark_project.jetty.server.Server.doStart(Server.java:366) at org.spark_project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68) at org.apache.spark.ui.JettyUtils$.org$apache$spark$ui$JettyUtils$$connect$1(JettyUtils.scala:306) at org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:316) at org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:316) at org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:2175) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160) at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:2166) at org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:316) at org.apache.spark.ui.WebUI.bind(WebUI.scala:139) at org.apache.spark.SparkContext$$anonfun$10.apply(SparkContext.scala:448) at org.apache.spark.SparkContext$$anonfun$10.apply(SparkContext.scala:448) at scala.Option.foreach(Option.scala:257) at org.apache.spark.SparkContext.(SparkContext.scala:448) at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2282) at org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:831) at org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:823) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:823) at org.apache.spark.repl.Main$.createSparkSession(Main.scala:95) at $line3.$read$$iw$$iw.(:15) at $line3.$read$$iw.(:31) at $line3.$read.(:33) at $line3.$read$.(:37) at $line3.$read$.() at $line3.$eval$.$print$lzycompute(:7) at $line3.$eval$.$print(:6) at $line3.$eval.$print() at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:786) at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1047) at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:638) at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:637) at scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31) at scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:19) at scala.tools.nsc.interpreter.IMain$WrappedRequest.loadAndRunReq(IMain.scala:637) at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:569) at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:565) at scala.tools.nsc.interpreter.ILoop.interpretStartingWith(ILoop.scala:807) at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:681) at scala.tools.nsc.interpreter.ILoop.processLine(ILoop.scala:395) at
[jira] [Updated] (SPARK-9478) Add sample weights to Random Forest
[ https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-9478: - Description: Currently, this implementation of random forest does not support sample (instance) weights. Weights are important when there is imbalanced training data or the evaluation metric of a classifier is imbalanced (e.g. true positive rate at some false positive threshold). Sample weights generalize class weights, so this could be used to add class weights later on. (was: Currently, this implementation of random forest does not support class weights. Class weights are important when there is imbalanced training data or the evaluation metric of a classifier is imbalanced (e.g. true positive rate at some false positive threshold). ) > Add sample weights to Random Forest > --- > > Key: SPARK-9478 > URL: https://issues.apache.org/jira/browse/SPARK-9478 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.4.1 >Reporter: Patrick Crenshaw > > Currently, this implementation of random forest does not support sample > (instance) weights. Weights are important when there is imbalanced training > data or the evaluation metric of a classifier is imbalanced (e.g. true > positive rate at some false positive threshold). Sample weights generalize > class weights, so this could be used to add class weights later on. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9478) Add sample weights to Random Forest
[ https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-9478: - Shepherd: Joseph K. Bradley > Add sample weights to Random Forest > --- > > Key: SPARK-9478 > URL: https://issues.apache.org/jira/browse/SPARK-9478 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.4.1 >Reporter: Patrick Crenshaw > > Currently, this implementation of random forest does not support sample > (instance) weights. Weights are important when there is imbalanced training > data or the evaluation metric of a classifier is imbalanced (e.g. true > positive rate at some false positive threshold). Sample weights generalize > class weights, so this could be used to add class weights later on. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9478) Add sample weights to Random Forest
[ https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-9478: - Shepherd: (was: Joseph K. Bradley) > Add sample weights to Random Forest > --- > > Key: SPARK-9478 > URL: https://issues.apache.org/jira/browse/SPARK-9478 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.4.1 >Reporter: Patrick Crenshaw > > Currently, this implementation of random forest does not support sample > (instance) weights. Weights are important when there is imbalanced training > data or the evaluation metric of a classifier is imbalanced (e.g. true > positive rate at some false positive threshold). Sample weights generalize > class weights, so this could be used to add class weights later on. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20161) Default log4j properties file should print thread-id in ConversionPattern
[ https://issues.apache.org/jira/browse/SPARK-20161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15951656#comment-15951656 ] Sahil Takiar commented on SPARK-20161: -- [~xuefuz] could you comment on https://github.com/apache/spark/pull/17499 and maybe provide some more context as to how this will benefit HoS? > Default log4j properties file should print thread-id in ConversionPattern > - > > Key: SPARK-20161 > URL: https://issues.apache.org/jira/browse/SPARK-20161 > Project: Spark > Issue Type: Improvement > Components: Deploy, YARN >Affects Versions: 2.1.0 >Reporter: Sahil Takiar > > The default log4j file in {{spark/conf/log4j.properties.template}} doesn't > display the thread-id when printing out the logs. It would be very useful to > add this, especially for YARN. Currently, logs from all the different threads > in a single executor are sent to the same log file. This makes debugging > difficult as it is hard to filter out what logs come from what thread. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20161) Default log4j properties file should print thread-id in ConversionPattern
[ https://issues.apache.org/jira/browse/SPARK-20161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20161: Assignee: (was: Apache Spark) > Default log4j properties file should print thread-id in ConversionPattern > - > > Key: SPARK-20161 > URL: https://issues.apache.org/jira/browse/SPARK-20161 > Project: Spark > Issue Type: Improvement > Components: Deploy, YARN >Affects Versions: 2.1.0 >Reporter: Sahil Takiar > > The default log4j file in {{spark/conf/log4j.properties.template}} doesn't > display the thread-id when printing out the logs. It would be very useful to > add this, especially for YARN. Currently, logs from all the different threads > in a single executor are sent to the same log file. This makes debugging > difficult as it is hard to filter out what logs come from what thread. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20161) Default log4j properties file should print thread-id in ConversionPattern
[ https://issues.apache.org/jira/browse/SPARK-20161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15951595#comment-15951595 ] Apache Spark commented on SPARK-20161: -- User 'sahilTakiar' has created a pull request for this issue: https://github.com/apache/spark/pull/17499 > Default log4j properties file should print thread-id in ConversionPattern > - > > Key: SPARK-20161 > URL: https://issues.apache.org/jira/browse/SPARK-20161 > Project: Spark > Issue Type: Improvement > Components: Deploy, YARN >Affects Versions: 2.1.0 >Reporter: Sahil Takiar > > The default log4j file in {{spark/conf/log4j.properties.template}} doesn't > display the thread-id when printing out the logs. It would be very useful to > add this, especially for YARN. Currently, logs from all the different threads > in a single executor are sent to the same log file. This makes debugging > difficult as it is hard to filter out what logs come from what thread. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20161) Default log4j properties file should print thread-id in ConversionPattern
[ https://issues.apache.org/jira/browse/SPARK-20161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20161: Assignee: Apache Spark > Default log4j properties file should print thread-id in ConversionPattern > - > > Key: SPARK-20161 > URL: https://issues.apache.org/jira/browse/SPARK-20161 > Project: Spark > Issue Type: Improvement > Components: Deploy, YARN >Affects Versions: 2.1.0 >Reporter: Sahil Takiar >Assignee: Apache Spark > > The default log4j file in {{spark/conf/log4j.properties.template}} doesn't > display the thread-id when printing out the logs. It would be very useful to > add this, especially for YARN. Currently, logs from all the different threads > in a single executor are sent to the same log file. This makes debugging > difficult as it is hard to filter out what logs come from what thread. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20161) Default log4j properties file should print thread-id in ConversionPattern
[ https://issues.apache.org/jira/browse/SPARK-20161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sahil Takiar updated SPARK-20161: - Summary: Default log4j properties file should print thread-id in ConversionPattern (was: Default spark/conf/log4j.properties.template should print thread-id in ConversionPattern) > Default log4j properties file should print thread-id in ConversionPattern > - > > Key: SPARK-20161 > URL: https://issues.apache.org/jira/browse/SPARK-20161 > Project: Spark > Issue Type: Improvement > Components: Deploy, YARN >Affects Versions: 2.1.0 >Reporter: Sahil Takiar > > The default log4j file in {{spark/conf/log4j.properties.template}} doesn't > display the thread-id when printing out the logs. It would be very useful to > add this, especially for YARN. Currently, logs from all the different threads > in a single executor are sent to the same log file. This makes debugging > difficult as it is hard to filter out what logs come from what thread. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20179) Major improvements to Spark's Prefix span
[ https://issues.apache.org/jira/browse/SPARK-20179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-20179. --- Resolution: Duplicate > Major improvements to Spark's Prefix span > - > > Key: SPARK-20179 > URL: https://issues.apache.org/jira/browse/SPARK-20179 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.1.0 > Environment: All >Reporter: Cyril de Vogelaere > Original Estimate: 0h > Remaining Estimate: 0h > > The code I would like to push allows major performances improvement for > single-item patterns through the use of a CP based solver (Namely OscaR, an > open-source solver). And slight perfomance improved for multi-item patterns. > As you can see in the log scale graph reachable from the link below, the > performance are at worse roughly the same but at best, up to 50x faster (FIFA > dataset) > Link to graph : http://i67.tinypic.com/t06lw7.jpg > Link to implementation : https://github.com/Syrux/spark > Link for datasets : > http://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php (the > two slen datasets used are the first two in the list of 9) > Performances were tested on the CECI servers, providing the driver with 10G > memory (more than needed) and running 4 slaves. > Additionnally to the performance improvements. I also added a bunch of new > fonctionnalities : > - Unlimited Max pattern length (with input 0) : No improvement in > perfomance here, simply allows the use of unlimited max pattern length. > - Min pattern length : Any item below that length won't be outputted. No > improvement in performance, just a new functionnality. > - Max Item per itemset : An itemset won't be grown further than the inputed > number, thus reducing the search space. > - Head start : During the initial dataset cleaning, the frequent item were > found then discarded. Which resulted in a inefficient first iteration of the > genFreqPattern method. The algorithm new uses them if they are provided, and > uses the empty pattern in case they're not. Slight improvement of > performances were found. > - Sub-problem limit : When resulting item sequence can be very long and the > user disposes of a small number of very powerfull machine, this parameter > will allow a quick switch to local execution. Tremendously improving > performances. Outside of those conditions, the performance may be the > negatively affected. > - Item constraints : Allow the user to specify constraint on the occurences > of an item (=, >, <, >=, <=, !=). For user who are looking for some specific > results, the search space can be greatly reduced. Which also improve > performances. > Of course, all of theses added feature where tested for correctness. As you > can see on the github link. > Please take note that the afformentionned fonctionnalities didn't come into > play when testing the performance. The performance shown on the graph are > 'merely' the result of the remplacement of the local execution by a CP based > algorithm. (maxLocalProjDBSize was also kept to it's default value > (3200L), to make sure the performance couldn't be artificially improved) > Trade offs : > - The algorithm does have a slight trade off. Since the algorithm needs to > detect whether sequences can use the specialised single-item patterns CP > algorithm. It may be a bit slower when it is not needed. This trade of was > mitigated by introducing a round sequence cleaning before the local > execution. Thus improving the performance of multi-item local executions when > the cleaning is effective. In case the no item can be cleaned, the check will > appear in the performence, creating a slight drop in them. > - All other change provided shouldn't have any effect on efficiency or > complexity if left to their default value. (Where they are basically > desactivated). When activated, they may however reduce the search space and > thus improve performance. > Additionnal note : > - The performance improvement are mostly seen far datasets where all itemset > are of size one, since it is a necessary condition for the use of the CP > based algorithm. But as you can see in the two Slen datasets, performances > were also slightly improved for algorithms which have multiple items per > itemset. The algorithm was built to detect when CP can be used, and to use > it, given the opportunity. > - The performance displayed here are the results of six months of work. > Various other things were tested to improve performance, without as much > success. I can thus say with a bit of confidence that the performances here > attained will be very hard to improve further. > - In case you want me to test the performance on a specific dataset or to > provide additionnal
[jira] [Updated] (SPARK-20109) Need a way to convert from IndexedRowMatrix to Dense Block Matrices
[ https://issues.apache.org/jira/browse/SPARK-20109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-20109: -- Priority: Minor (was: Major) > Need a way to convert from IndexedRowMatrix to Dense Block Matrices > --- > > Key: SPARK-20109 > URL: https://issues.apache.org/jira/browse/SPARK-20109 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 2.1.0 >Reporter: John Compitello >Priority: Minor > > The current implementation of toBlockMatrix on IndexedRowMatrix is > insufficient. It is implemented by first converting the IndexedRowMatrix to a > CoordinateMatrix, then converting that CoordinateMatrix to a BlockMatrix. Not > only is this slower than it needs to be, it also means that the created > BlockMatrix ends up being backed by instances of SparseMatrix, which a user > may not want. Users need an option to convert from IndexedRowMatrix to > BlockMatrix that backs the BlockMatrix with local instances of DenseMatrix. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20180) Unlimited max pattern length in Prefix span
[ https://issues.apache.org/jira/browse/SPARK-20180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cyril de Vogelaere updated SPARK-20180: --- Description: Right now, we need to use .setMaxPatternLength() method to specify is the maximum pattern length of a sequence. Any pattern longer than that won't be outputted. The current default maxPatternlength value being 10. This should be changed so that with input 0, all pattern of any length would be outputted. Additionally, the default value should be changed to 0, so that a new user could find all patterns in his dataset without looking at this parameter. was: Right now, we need to use .setMaxPatternLength() method to specify is the maximum pattern length of a sequence. Any pattern longer than that won't be outputted. The current default maxPatternlength value being 10. This should be changed so that with input 0, all pattern of any length would be outputted. Additionally, the default value should be changed to 0, so that a new user could find all the pattern in his dataset without looking at this parameter. > Unlimited max pattern length in Prefix span > --- > > Key: SPARK-20180 > URL: https://issues.apache.org/jira/browse/SPARK-20180 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.1.0 >Reporter: Cyril de Vogelaere >Priority: Minor > Original Estimate: 0h > Remaining Estimate: 0h > > Right now, we need to use .setMaxPatternLength() method to > specify is the maximum pattern length of a sequence. Any pattern longer than > that won't be outputted. > The current default maxPatternlength value being 10. > This should be changed so that with input 0, all pattern of any length would > be outputted. Additionally, the default value should be changed to 0, so that > a new user could find all patterns in his dataset without looking at this > parameter. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20180) Unlimited max pattern length in Prefix span
[ https://issues.apache.org/jira/browse/SPARK-20180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cyril de Vogelaere updated SPARK-20180: --- Description: Right now, we need to use .setMaxPatternLength() method to specify is the maximum pattern length of a sequence. Any pattern longer than that won't be outputted. The current default maxPatternlength value being 10. This should be changed so that with input 0, all pattern of any length would be outputted. Additionally, the default value should be changed to 0, so that a new user could find all the pattern in his dataset without looking at this parameter. was: Right now, we need to use .setMaxPatternLength(x) (with x > 0) to specify is the maximum pattern length of a sequence. Any pattern longer than that won't be outputted. The current default maxPatternlength value being 10. This should be changed so that with input 0, all pattern of any length would be outputted. Additionally, the default value should be changed to 0, so that a new user could find all the pattern in his dataset without looking at this parameter. > Unlimited max pattern length in Prefix span > --- > > Key: SPARK-20180 > URL: https://issues.apache.org/jira/browse/SPARK-20180 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.1.0 >Reporter: Cyril de Vogelaere >Priority: Minor > Original Estimate: 0h > Remaining Estimate: 0h > > Right now, we need to use .setMaxPatternLength() method to > specify is the maximum pattern length of a sequence. Any pattern longer than > that won't be outputted. > The current default maxPatternlength value being 10. > This should be changed so that with input 0, all pattern of any length would > be outputted. Additionally, the default value should be changed to 0, so that > a new user could find all the pattern in his dataset without looking at this > parameter. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20180) Unlimited max pattern length in Prefix span
Cyril de Vogelaere created SPARK-20180: -- Summary: Unlimited max pattern length in Prefix span Key: SPARK-20180 URL: https://issues.apache.org/jira/browse/SPARK-20180 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 2.1.0 Reporter: Cyril de Vogelaere Priority: Minor Right now, we need to use .setMaxPatternLength(x) (with x > 0) to specify is the maximum pattern length of a sequence. Any pattern longer than that won't be outputted. The current default maxPatternlength value being 10. This should be changed so that with input 0, all pattern of any length would be outputted. Additionally, the default value should be changed to 0, so that a new user could find all the pattern in his dataset without looking at this parameter. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20176) Spark Dataframe UDAF issue
[ https://issues.apache.org/jira/browse/SPARK-20176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15951473#comment-15951473 ] Kazuaki Ishizaki commented on SPARK-20176: -- Could you please post the program that can reproduce this issue? > Spark Dataframe UDAF issue > -- > > Key: SPARK-20176 > URL: https://issues.apache.org/jira/browse/SPARK-20176 > Project: Spark > Issue Type: IT Help > Components: Spark Core >Affects Versions: 2.0.2 >Reporter: Dinesh Man Amatya > > Getting following error in custom UDAF > Error while decoding: java.util.concurrent.ExecutionException: > java.lang.Exception: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 58, Column 33: Incompatible expression types "boolean" and "java.lang.Boolean" > /* 001 */ public java.lang.Object generate(Object[] references) { > /* 002 */ return new SpecificSafeProjection(references); > /* 003 */ } > /* 004 */ > /* 005 */ class SpecificSafeProjection extends > org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { > /* 006 */ > /* 007 */ private Object[] references; > /* 008 */ private MutableRow mutableRow; > /* 009 */ private Object[] values; > /* 010 */ private Object[] values1; > /* 011 */ private org.apache.spark.sql.types.StructType schema; > /* 012 */ private org.apache.spark.sql.types.StructType schema1; > /* 013 */ > /* 014 */ > /* 015 */ public SpecificSafeProjection(Object[] references) { > /* 016 */ this.references = references; > /* 017 */ mutableRow = (MutableRow) references[references.length - 1]; > /* 018 */ > /* 019 */ > /* 020 */ this.schema = (org.apache.spark.sql.types.StructType) > references[0]; > /* 021 */ this.schema1 = (org.apache.spark.sql.types.StructType) > references[1]; > /* 022 */ } > /* 023 */ > /* 024 */ public java.lang.Object apply(java.lang.Object _i) { > /* 025 */ InternalRow i = (InternalRow) _i; > /* 026 */ > /* 027 */ values = new Object[2]; > /* 028 */ > /* 029 */ boolean isNull2 = i.isNullAt(0); > /* 030 */ UTF8String value2 = isNull2 ? null : (i.getUTF8String(0)); > /* 031 */ > /* 032 */ boolean isNull1 = isNull2; > /* 033 */ final java.lang.String value1 = isNull1 ? null : > (java.lang.String) value2.toString(); > /* 034 */ isNull1 = value1 == null; > /* 035 */ if (isNull1) { > /* 036 */ values[0] = null; > /* 037 */ } else { > /* 038 */ values[0] = value1; > /* 039 */ } > /* 040 */ > /* 041 */ boolean isNull5 = i.isNullAt(1); > /* 042 */ InternalRow value5 = isNull5 ? null : (i.getStruct(1, 2)); > /* 043 */ boolean isNull3 = false; > /* 044 */ org.apache.spark.sql.Row value3 = null; > /* 045 */ if (!false && isNull5) { > /* 046 */ > /* 047 */ final org.apache.spark.sql.Row value6 = null; > /* 048 */ isNull3 = true; > /* 049 */ value3 = value6; > /* 050 */ } else { > /* 051 */ > /* 052 */ values1 = new Object[2]; > /* 053 */ > /* 054 */ boolean isNull10 = i.isNullAt(1); > /* 055 */ InternalRow value10 = isNull10 ? null : (i.getStruct(1, 2)); > /* 056 */ > /* 057 */ boolean isNull9 = isNull10 || false; > /* 058 */ final boolean value9 = isNull9 ? false : (Boolean) > value10.isNullAt(0); > /* 059 */ boolean isNull8 = false; > /* 060 */ double value8 = -1.0; > /* 061 */ if (!isNull9 && value9) { > /* 062 */ > /* 063 */ final double value12 = -1.0; > /* 064 */ isNull8 = true; > /* 065 */ value8 = value12; > /* 066 */ } else { > /* 067 */ > /* 068 */ boolean isNull14 = i.isNullAt(1); > /* 069 */ InternalRow value14 = isNull14 ? null : (i.getStruct(1, 2)); > /* 070 */ boolean isNull13 = isNull14; > /* 071 */ double value13 = -1.0; > /* 072 */ > /* 073 */ if (!isNull14) { > /* 074 */ > /* 075 */ if (value14.isNullAt(0)) { > /* 076 */ isNull13 = true; > /* 077 */ } else { > /* 078 */ value13 = value14.getDouble(0); > /* 079 */ } > /* 080 */ > /* 081 */ } > /* 082 */ isNull8 = isNull13; > /* 083 */ value8 = value13; > /* 084 */ } > /* 085 */ if (isNull8) { > /* 086 */ values1[0] = null; > /* 087 */ } else { > /* 088 */ values1[0] = value8; > /* 089 */ } > /* 090 */ > /* 091 */ boolean isNull17 = i.isNullAt(1); > /* 092 */ InternalRow value17 = isNull17 ? null : (i.getStruct(1, 2)); > /* 093 */ > /* 094 */ boolean isNull16 = isNull17 || false; > /* 095 */ final boolean value16 = isNull16 ? false : (Boolean) > value17.isNullAt(1); > /* 096 */ boolean isNull15 = false; > /* 097 */ double value15 = -1.0; > /* 098 */ if (!isNull16 && value16) { > /* 099 */ > /* 100 */
[jira] [Commented] (SPARK-20179) Major improvements to Spark's Prefix span
[ https://issues.apache.org/jira/browse/SPARK-20179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15951438#comment-15951438 ] Cyril de Vogelaere commented on SPARK-20179: Hello Joseph, Thanks for your very helpfull comment. I will start by treating every additionnal fonctionnalities separatly, to get familiarised with the process. I will also explain it depth what it could bring to the user, and why I judge it important. Before finishing with the main part of the code and the CP implementation. Is it ok if I keep adding you as sheperd ? > Major improvements to Spark's Prefix span > - > > Key: SPARK-20179 > URL: https://issues.apache.org/jira/browse/SPARK-20179 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.1.0 > Environment: All >Reporter: Cyril de Vogelaere > Original Estimate: 0h > Remaining Estimate: 0h > > The code I would like to push allows major performances improvement for > single-item patterns through the use of a CP based solver (Namely OscaR, an > open-source solver). And slight perfomance improved for multi-item patterns. > As you can see in the log scale graph reachable from the link below, the > performance are at worse roughly the same but at best, up to 50x faster (FIFA > dataset) > Link to graph : http://i67.tinypic.com/t06lw7.jpg > Link to implementation : https://github.com/Syrux/spark > Link for datasets : > http://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php (the > two slen datasets used are the first two in the list of 9) > Performances were tested on the CECI servers, providing the driver with 10G > memory (more than needed) and running 4 slaves. > Additionnally to the performance improvements. I also added a bunch of new > fonctionnalities : > - Unlimited Max pattern length (with input 0) : No improvement in > perfomance here, simply allows the use of unlimited max pattern length. > - Min pattern length : Any item below that length won't be outputted. No > improvement in performance, just a new functionnality. > - Max Item per itemset : An itemset won't be grown further than the inputed > number, thus reducing the search space. > - Head start : During the initial dataset cleaning, the frequent item were > found then discarded. Which resulted in a inefficient first iteration of the > genFreqPattern method. The algorithm new uses them if they are provided, and > uses the empty pattern in case they're not. Slight improvement of > performances were found. > - Sub-problem limit : When resulting item sequence can be very long and the > user disposes of a small number of very powerfull machine, this parameter > will allow a quick switch to local execution. Tremendously improving > performances. Outside of those conditions, the performance may be the > negatively affected. > - Item constraints : Allow the user to specify constraint on the occurences > of an item (=, >, <, >=, <=, !=). For user who are looking for some specific > results, the search space can be greatly reduced. Which also improve > performances. > Of course, all of theses added feature where tested for correctness. As you > can see on the github link. > Please take note that the afformentionned fonctionnalities didn't come into > play when testing the performance. The performance shown on the graph are > 'merely' the result of the remplacement of the local execution by a CP based > algorithm. (maxLocalProjDBSize was also kept to it's default value > (3200L), to make sure the performance couldn't be artificially improved) > Trade offs : > - The algorithm does have a slight trade off. Since the algorithm needs to > detect whether sequences can use the specialised single-item patterns CP > algorithm. It may be a bit slower when it is not needed. This trade of was > mitigated by introducing a round sequence cleaning before the local > execution. Thus improving the performance of multi-item local executions when > the cleaning is effective. In case the no item can be cleaned, the check will > appear in the performence, creating a slight drop in them. > - All other change provided shouldn't have any effect on efficiency or > complexity if left to their default value. (Where they are basically > desactivated). When activated, they may however reduce the search space and > thus improve performance. > Additionnal note : > - The performance improvement are mostly seen far datasets where all itemset > are of size one, since it is a necessary condition for the use of the CP > based algorithm. But as you can see in the two Slen datasets, performances > were also slightly improved for algorithms which have multiple items per > itemset. The algorithm was built to detect when CP can be used, and to use > it,
[jira] [Resolved] (SPARK-20165) Resolve state encoder's deserializer in driver in FlatMapGroupsWithStateExec
[ https://issues.apache.org/jira/browse/SPARK-20165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-20165. --- Resolution: Fixed Issue resolved by pull request 17488 [https://github.com/apache/spark/pull/17488] > Resolve state encoder's deserializer in driver in FlatMapGroupsWithStateExec > > > Key: SPARK-20165 > URL: https://issues.apache.org/jira/browse/SPARK-20165 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Tathagata Das >Assignee: Tathagata Das > Fix For: 2.2.0 > > > Encoder's deserializer must be resolved at the driver where the class is > defined. Otherwise there are corner cases using nested classes where > resolving at the executor can fail. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20160) Move ParquetConversions and OrcConversions Out Of HiveSessionCatalog
[ https://issues.apache.org/jira/browse/SPARK-20160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-20160. - Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17484 [https://github.com/apache/spark/pull/17484] > Move ParquetConversions and OrcConversions Out Of HiveSessionCatalog > > > Key: SPARK-20160 > URL: https://issues.apache.org/jira/browse/SPARK-20160 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li >Assignee: Xiao Li > Fix For: 2.2.0 > > > {{ParquetConversions}} and {{OrcConversions}} should be treated as regular > Analyzer rules. It is not reasonable to be part of {{HiveSessionCatalog}}. > After moving these two rules out of {{HiveSessionCatalog}}, the next step is > to rename {{HiveMetastoreCatalog}} because it is not related to the hive > package any more. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20179) Major improvements to Spark's Prefix span
[ https://issues.apache.org/jira/browse/SPARK-20179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15951269#comment-15951269 ] Joseph K. Bradley commented on SPARK-20179: --- Thanks for the thoughts & work. Sean's right that these practices are described in the contributing guide, as well as a lot of other helpful info. I'd recommend a few things: * Split the proposals up into smaller pieces. Putting everything into 1 JIRA and/or PR makes it hard for reviewers to understand what is being proposed and how the changes interact. * Make JIRA titles and descriptions very clear in terms of what the key change is. If it's multiple changes, can these be broken into separate parts and added incrementally? If the changes are related, it can be OK to create an umbrella JIRA which gives a holistic view; you can put the actual changes and PRs under subtasks. * Start with the smallest incremental changes you're interested in to get familiar with the contribution process. * Keep the perspective of reviewers in mind: If the code is long or complex to describe, it's going to be overwhelming to reviewers who have never seen it. Thanks! > Major improvements to Spark's Prefix span > - > > Key: SPARK-20179 > URL: https://issues.apache.org/jira/browse/SPARK-20179 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.1.0 > Environment: All >Reporter: Cyril de Vogelaere > Original Estimate: 0h > Remaining Estimate: 0h > > The code I would like to push allows major performances improvement for > single-item patterns through the use of a CP based solver (Namely OscaR, an > open-source solver). And slight perfomance improved for multi-item patterns. > As you can see in the log scale graph reachable from the link below, the > performance are at worse roughly the same but at best, up to 50x faster (FIFA > dataset) > Link to graph : http://i67.tinypic.com/t06lw7.jpg > Link to implementation : https://github.com/Syrux/spark > Link for datasets : > http://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php (the > two slen datasets used are the first two in the list of 9) > Performances were tested on the CECI servers, providing the driver with 10G > memory (more than needed) and running 4 slaves. > Additionnally to the performance improvements. I also added a bunch of new > fonctionnalities : > - Unlimited Max pattern length (with input 0) : No improvement in > perfomance here, simply allows the use of unlimited max pattern length. > - Min pattern length : Any item below that length won't be outputted. No > improvement in performance, just a new functionnality. > - Max Item per itemset : An itemset won't be grown further than the inputed > number, thus reducing the search space. > - Head start : During the initial dataset cleaning, the frequent item were > found then discarded. Which resulted in a inefficient first iteration of the > genFreqPattern method. The algorithm new uses them if they are provided, and > uses the empty pattern in case they're not. Slight improvement of > performances were found. > - Sub-problem limit : When resulting item sequence can be very long and the > user disposes of a small number of very powerfull machine, this parameter > will allow a quick switch to local execution. Tremendously improving > performances. Outside of those conditions, the performance may be the > negatively affected. > - Item constraints : Allow the user to specify constraint on the occurences > of an item (=, >, <, >=, <=, !=). For user who are looking for some specific > results, the search space can be greatly reduced. Which also improve > performances. > Of course, all of theses added feature where tested for correctness. As you > can see on the github link. > Please take note that the afformentionned fonctionnalities didn't come into > play when testing the performance. The performance shown on the graph are > 'merely' the result of the remplacement of the local execution by a CP based > algorithm. (maxLocalProjDBSize was also kept to it's default value > (3200L), to make sure the performance couldn't be artificially improved) > Trade offs : > - The algorithm does have a slight trade off. Since the algorithm needs to > detect whether sequences can use the specialised single-item patterns CP > algorithm. It may be a bit slower when it is not needed. This trade of was > mitigated by introducing a round sequence cleaning before the local > execution. Thus improving the performance of multi-item local executions when > the cleaning is effective. In case the no item can be cleaned, the check will > appear in the performence, creating a slight drop in them. > - All other change provided shouldn't have any effect on efficiency
[jira] [Resolved] (SPARK-20084) Remove internal.metrics.updatedBlockStatuses accumulator from history files
[ https://issues.apache.org/jira/browse/SPARK-20084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-20084. Resolution: Fixed Assignee: Ryan Blue Fix Version/s: 2.1.2 2.2.0 > Remove internal.metrics.updatedBlockStatuses accumulator from history files > --- > > Key: SPARK-20084 > URL: https://issues.apache.org/jira/browse/SPARK-20084 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Web UI >Affects Versions: 2.1.0 >Reporter: Ryan Blue >Assignee: Ryan Blue > Fix For: 2.2.0, 2.1.2 > > > History files for large jobs can be hundreds of GB. These history files take > too much space and create a backlog on the history server. > Most of the size is from Accumulables in SparkListenerTaskEnd. The largest > accumulable is internal.metrics.updatedBlockStatuses, which has a small > update (the blocks that were changed) but a huge value (all known blocks). > Nothing currently uses the accumulator value or update, so it is safe to > remove it. Information for any block updated during a task is also recorded > under Task Metrics / Updated Blocks. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20179) Major improvements to Spark's Prefix span
[ https://issues.apache.org/jira/browse/SPARK-20179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cyril de Vogelaere updated SPARK-20179: --- Description: The code I would like to push allows major performances improvement for single-item patterns through the use of a CP based solver (Namely OscaR, an open-source solver). And slight perfomance improved for multi-item patterns. As you can see in the log scale graph reachable from the link below, the performance are at worse roughly the same but at best, up to 50x faster (FIFA dataset) Link to graph : http://i67.tinypic.com/t06lw7.jpg Link to implementation : https://github.com/Syrux/spark Link for datasets : http://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php (the two slen datasets used are the first two in the list of 9) Performances were tested on the CECI servers, providing the driver with 10G memory (more than needed) and running 4 slaves. Additionnally to the performance improvements. I also added a bunch of new fonctionnalities : - Unlimited Max pattern length (with input 0) : No improvement in perfomance here, simply allows the use of unlimited max pattern length. - Min pattern length : Any item below that length won't be outputted. No improvement in performance, just a new functionnality. - Max Item per itemset : An itemset won't be grown further than the inputed number, thus reducing the search space. - Head start : During the initial dataset cleaning, the frequent item were found then discarded. Which resulted in a inefficient first iteration of the genFreqPattern method. The algorithm new uses them if they are provided, and uses the empty pattern in case they're not. Slight improvement of performances were found. - Sub-problem limit : When resulting item sequence can be very long and the user disposes of a small number of very powerfull machine, this parameter will allow a quick switch to local execution. Tremendously improving performances. Outside of those conditions, the performance may be the negatively affected. - Item constraints : Allow the user to specify constraint on the occurences of an item (=, >, <, >=, <=, !=). For user who are looking for some specific results, the search space can be greatly reduced. Which also improve performances. Of course, all of theses added feature where tested for correctness. As you can see on the github link. Please take note that the afformentionned fonctionnalities didn't come into play when testing the performance. The performance shown on the graph are 'merely' the result of the remplacement of the local execution by a CP based algorithm. (maxLocalProjDBSize was also kept to it's default value (3200L), to make sure the performance couldn't be artificially improved) Trade offs : - The algorithm does have a slight trade off. Since the algorithm needs to detect whether sequences can use the specialised single-item patterns CP algorithm. It may be a bit slower when it is not needed. This trade of was mitigated by introducing a round sequence cleaning before the local execution. Thus improving the performance of multi-item local executions when the cleaning is effective. In case the no item can be cleaned, the check will appear in the performence, creating a slight drop in them. - All other change provided shouldn't have any effect on efficiency or complexity if left to their default value. (Where they are basically desactivated). When activated, they may however reduce the search space and thus improve performance. Additionnal note : - The performance improvement are mostly seen far datasets where all itemset are of size one, since it is a necessary condition for the use of the CP based algorithm. But as you can see in the two Slen datasets, performances were also slightly improved for algorithms which have multiple items per itemset. The algorithm was built to detect when CP can be used, and to use it, given the opportunity. - The performance displayed here are the results of six months of work. Various other things were tested to improve performance, without as much success. I can thus say with a bit of confidence that the performances here attained will be very hard to improve further. - In case you want me to test the performance on a specific dataset or to provide additionnal informations, it would be my pleasure to do so :). Just it me up with my email below : Email : cyril.devogela...@gmail.com - I am a newbie contributor to Spark, and am not familliar with the whole procedure at all. In case, I did something incorrectly, I will fix it as soon as possible. was: The code I would like to push allows major performances improvement for single-item patterns through the use of a CP based solver (Namely OscaR, an open-source solver). And slight perfomance improved for multi-item patterns. As you can see in the log
[jira] [Resolved] (SPARK-20164) AnalysisException not tolerant of null query plan
[ https://issues.apache.org/jira/browse/SPARK-20164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-20164. - Resolution: Fixed Assignee: Kunal Khamar Fix Version/s: 2.2.0 2.1.2 > AnalysisException not tolerant of null query plan > - > > Key: SPARK-20164 > URL: https://issues.apache.org/jira/browse/SPARK-20164 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kunal Khamar >Assignee: Kunal Khamar > Fix For: 2.1.2, 2.2.0 > > > When someone throws an AnalysisException with a null query plan (which > ideally no one should), getMessage is not tolerant of this and throws a null > pointer exception, leading to loss of information about original exception. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20144) spark.read.parquet no long maintains ordering of the data
[ https://issues.apache.org/jira/browse/SPARK-20144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15951202#comment-15951202 ] Li Jin commented on SPARK-20144: Thanks Sean! I appreciate your time and help very much. > spark.read.parquet no long maintains ordering of the data > - > > Key: SPARK-20144 > URL: https://issues.apache.org/jira/browse/SPARK-20144 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Li Jin > > Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is > when we read parquet files in 2.0.2, the ordering of rows in the resulting > dataframe is not the same as the ordering of rows in the dataframe that the > parquet file was reproduced with. > This is because FileSourceStrategy.scala combines the parquet files into > fewer partitions and also reordered them. This breaks our workflows because > they assume the ordering of the data. > Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec > changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with > 2.1. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20178) Improve Scheduler fetch failures
[ https://issues.apache.org/jira/browse/SPARK-20178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15951194#comment-15951194 ] Imran Rashid commented on SPARK-20178: -- Thanks for writing this up Tom. The only way I see to have a pluggable interface in the current code is to abstract out the *entire* thing -- DAGScheduler, TSM, TSI. perhaps also CGSB and OCC. that would be pretty extreme though, I'd only consider that if we actually have some reason to think we'd come up with a better version (eg. new abstractions with less shared state). In addition to not destabilizing the current scheduler, we should also think of what the migration path would be for enabling these new changes. Will there be a way for spark to auto-tune? Or will we need to create a number of new confs? I know everyone hates having a huge set of configuration that needs to be tuned, but at some point I think its OK if spark works reasonably well on small clusters by default, and for large clusters you've just got to have somebody that knows how to configure it carefully. Another thing to keep in mind is that Spark is used on a huge variety of workloads. I feel like right now we're very focused on large jobs on big clusters with long tasks; but spark is also used with very small tasks, especially streaming. I think all the ideas we're thinking of only effect behavior after there is a failure, so hopefully it wouldn't matter. But we need to be careful that we don't introduce complexity which effects performance even before any failures. > Improve Scheduler fetch failures > > > Key: SPARK-20178 > URL: https://issues.apache.org/jira/browse/SPARK-20178 > Project: Spark > Issue Type: Epic > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Thomas Graves > > We have been having a lot of discussions around improving the handling of > fetch failures. There are 4 jira currently related to this. > We should try to get a list of things we want to improve and come up with one > cohesive design. > SPARK-20163, SPARK-20091, SPARK-14649 , and SPARK-19753 > I will put my initial thoughts in a follow on comment. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-20156) Local dependent library used for upper and lowercase conversions.
[ https://issues.apache.org/jira/browse/SPARK-20156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Serkan Taş updated SPARK-20156: --- Comment: was deleted (was: console log before setting locale) > Local dependent library used for upper and lowercase conversions. > - > > Key: SPARK-20156 > URL: https://issues.apache.org/jira/browse/SPARK-20156 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.1.0 > Environment: Ubunutu 16.04 > Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_121) >Reporter: Serkan Taş > Attachments: sprk_shell.txt > > > If the regional setting of the operation system is Turkish, the famous java > locale problem occurs (https://jira.atlassian.com/browse/CONF-5931 or > https://issues.apache.org/jira/browse/AVRO-1493). > e.g : > "SERDEINFO" lowers to "serdeınfo" > "uniquetable" uppers to "UNİQUETABLE" > work around : > add -Duser.country=US -Duser.language=en to the end of the line > SPARK_SUBMIT_OPTS="$SPARK_SUBMIT_OPTS -Dscala.usejavacp=true" > in spark-shell.sh -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20163) Kill all running tasks in a stage in case of fetch failure
[ https://issues.apache.org/jira/browse/SPARK-20163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15951136#comment-15951136 ] Imran Rashid commented on SPARK-20163: -- I think this is a duplicate of SPARK-2666, which has more discussion in it. Unless there is something which makes this distinct, can we close this as a duplicate? > Kill all running tasks in a stage in case of fetch failure > -- > > Key: SPARK-20163 > URL: https://issues.apache.org/jira/browse/SPARK-20163 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.0.1 >Reporter: Sital Kedia > > Currently, the scheduler does not kill the running tasks in a stage when it > encounters fetch failure, as a result, we might end up running many duplicate > tasks in the cluster. There is already a TODO in TaskSetManager to kill all > running tasks which has not been implemented. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20179) Major improvements to Spark's Prefix span
[ https://issues.apache.org/jira/browse/SPARK-20179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cyril de Vogelaere updated SPARK-20179: --- Description: The code I would like to push allows major performances improvement for single-item patterns through the use of a CP based solver (Namely OscaR, an open-source solver). And slight perfomance improved for multi-item patterns. As you can see in the log scale graph reachable from the link below, the performance are at worse roughly the same but at best, up to 50x faster (FIFA dataset) Link to graph : http://i67.tinypic.com/t06lw7.jpg Link to implementation : https://github.com/Syrux/spark Link for datasets : http://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php (the two slen datasets used are the first two in the list of 9) Performances were tested on the CECI servers, providing the driver with 10G memory (more than needed) and running 4 slaves. Additionnally to the performance improvements. I also added a bunch of new fonctionnalities : - Unlimited Max pattern length (with input 0) : No improvement in perfomance here, simply allows the use of unlimited max pattern length. - Min pattern length : Any item below that length won't be outputted. No improvement in performance, just a new functionnality. - Max Item per itemset : An itemset won't be grown further than the inputed number, thus reducing the search space. - Head start : During the initial dataset cleaning, the frequent item were found then discarded. Which resulted in a inefficient first iteration of the genFreqPattern method. The algorithm new uses them if they are provided, and uses the empty pattern in case they're not. Slight improvement of performances were found. - Sub-problem limit : When resulting item sequence can be very long and the user disposes of a small number of very powerfull machine, this parameter will allow a quick switch to local execution. Tremendously improving performances. Outside of those conditions, the performance may be the negatively affected. - Item constraints : Allow the user to specify constraint on the occurences of an item (=, >, <, >=, <=, !=). For user who are looking for some specific results, the search space can be greatly reduced. Which also improve performances. Of course, all of theses added feature where test for correctness. As you can see on the github link. Please take note that the afformentionned fonctionnalities didn't come into play when testing the performance. The performance shown on the graph are 'merely' the result of the remplacement of the local execution by a CP based algorithm. (maxLocalProjDBSize was also kept to it's default value (3200L), to make sure the performance couldn't be artificially improved) Trade offs : - The algorithm does have a slight trade off. Since the algorithm needs to detect whether sequences can use the specialised single-item patterns CP algorithm. It may be a bit slower when it is not needed. This trade of was mitigated by introducing a round sequence cleaning before the local execution. Thus improving the performance of multi-item local executions when the cleaning is effective. In case the no item can be cleaned, the check will appear in the performence, creating a slight drop in them. - All other change provided shouldn't have any effect on efficiency or complexity if left to their default value. (Where they are basically desactivated). When activated, they may however reduce the search space and thus improve performance. Additionnal note : - The performance improvement are mostly seen far datasets where all itemset are of size one, since it is a necessary condition for the use of the CP based algorithm. But as you can see in the two Slen datasets, performances were also slightly improved for algorithms which have multiple items per itemset. The algorithm was built to detect when CP can be used, and to use it, given the opportunity. - The performance displayed here are the results of six months of work. Various other things were tested to improve performance, without as much success. I can thus say with a bit of confidence that the performances here attained will be very hard to improve further. - In case you want me to test the performance on a specific dataset or to provide additionnal informations, it would be my pleasure to do so :). Just it me up with my email below : Email : cyril.devogela...@gmail.com - I am a newbie contributor to Spark, and am not familliar with the whole procedure at all. In case, I did something incorrectly, I will fix it as soon as possible. was: The code I would like to push allows major performances improvement for single-item patterns through the use of a CP based solver (Namely OscaR, an open-source solver). And slight perfomance improved for multi-item patterns. As you can see in the log
[jira] [Updated] (SPARK-20179) Major improvements to Spark's Prefix span
[ https://issues.apache.org/jira/browse/SPARK-20179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cyril de Vogelaere updated SPARK-20179: --- Description: The code I would like to push allows major performances improvement for single-item patterns through the use of a CP based solver (Namely OscaR, an open-source solver). And slight perfomance improved for multi-item patterns. As you can see in the log scale graph reachable from the link below, the performance are at worse roughly the same but at best, up to 50x faster (FIFA dataset) Link to graph : http://i67.tinypic.com/t06lw7.jpg Link to implementation : https://github.com/Syrux/spark Link for datasets : http://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php (the two slen datasets used are the first two in the list of 9) Performances were tested on the CECI servers, providing the driver with 10G memory (more than needed) and running 4 slaves. Additionnally to the performance improvements. I also added a bunch of new fonctionnalities : - Unlimited Max pattern length (with input 0) : No improvement in perfomance here, simply allows the use of unlimited max pattern length. - Min pattern length : Any item below that length won't be outputted. No improvement in performance, just a new functionnality. - Max Item per itemset : An itemset won't be grown further than the inputed number, thus reducing the search space. - Head start : During the initial dataset cleaning, the frequent item were found then discarded. Which resulted in a inefficient first iteration of the genFreqPattern method. The algorithm new uses them if they are provided, and uses the empty pattern in case they're not. Slight improvement of performances were found. - Sub-problem limit : When resulting item sequence can be very long and the user disposes of a small number of very powerfull machine, this parameter will allow a quick switch to local execution. Tremendously improving performances. Outside of those conditions, the performance may be the negatively affected. - Item constraints : Allow the user to specify constraint on the occurences of an item (=, >, <, >=, <=, !=). For user who are looking for some specific results, the search space can be greatly reduced. Which also improve performances. Please take note that the afformentionned fonctionnalities didn't come into play when testing the performance. The performance shown on the graph are 'merely' the result of the remplacement of the local execution by a CP based algorithm. (maxLocalProjDBSize was also kept to it's default value (3200L), to make sure the performance couldn't be artificially improved) Trade offs : - The algorithm does have a slight trade off. Since the algorithm needs to detect whether sequences can use the specialised single-item patterns CP algorithm. It may be a bit slower when it is not needed. This trade of was mitigated by introducing a round sequence cleaning before the local execution. Thus improving the performance of multi-item local executions when the cleaning is effective. In case the no item can be cleaned, the check will appear in the performence, creating a slight drop in them. - All other change provided shouldn't have any effect on efficiency or complexity if left to their default value. (Where they are basically desactivated). When activated, they may however reduce the search space and thus improve performance. Additionnal note : - The performance improvement are mostly seen far datasets where all itemset are of size one, since it is a necessary condition for the use of the CP based algorithm. But as you can see in the two Slen datasets, performances were also slightly improved for algorithms which have multiple items per itemset. The algorithm was built to detect when CP can be used, and to use it, given the opportunity. - The performance displayed here are the results of six months of work. Various other things were tested to improve performance, without as much success. I can thus say with a bit of confidence that the performances here attained will be very hard to improve further. - In case you want me to test the performance on a specific dataset or to provide additionnal informations, it would be my pleasure to do so :). Just it me up with my email below : Email : cyril.devogela...@gmail.com - I am a newbie contributor to Spark, and am not familliar with the whole procedure at all. In case, I did something incorrectly, I will fix it as soon as possible. was: The code I would like to push allows major performances improvement for single-item patterns through the use of a CP based solver (Namely OscaR, an open-source solver). And slight perfomance improved for multi-item patterns. As you can see in the log scale graph reachable from the link below, the performance are at worse roughly the same but at best, up
[jira] [Updated] (SPARK-20179) Major improvements to Spark's Prefix span
[ https://issues.apache.org/jira/browse/SPARK-20179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cyril de Vogelaere updated SPARK-20179: --- Priority: Major (was: Minor) > Major improvements to Spark's Prefix span > - > > Key: SPARK-20179 > URL: https://issues.apache.org/jira/browse/SPARK-20179 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.1.0 > Environment: All >Reporter: Cyril de Vogelaere > Original Estimate: 0h > Remaining Estimate: 0h > > The code I would like to push allows major performances improvement for > single-item patterns through the use of a CP based solver (Namely OscaR, an > open-source solver). And slight perfomance improved for multi-item patterns. > As you can see in the log scale graph reachable from the link below, the > performance are at worse roughly the same but at best, up to 50x faster (FIFA > dataset) > Link to graph : http://i67.tinypic.com/t06lw7.jpg > Link to implementation : https://github.com/Syrux/spark > Link for datasets : > http://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php (the > two slen datasets used are the first two in the list of 9) > Performances were tested on the CECI servers, providing the driver with 10G > memory (more than needed) and running 4 slaves. > Additionnally to the performance improvements. I also added a bunch of new > fonctionnalities : > - Unlimited Max pattern length (with input 0) : > - Min pattern length : Any item below that length won't be outputted > - Max Item per itemset : An itemset won't be grown further than the input > number, thus reducing the search space. > - Head start : During the initial dataset cleaning, the frequent item were > found then discarded. Which resulted in a inefficient first iteration of the > genFreqPattern method. The algorithm new uses them if they are provided, and > uses the empty pattern in case they're not. Slight improvement of > performances were found. > - Sub-problem limit : When resulting item sequence can be very long and the > user disposes of a small number of very powerfull machine, this parameter > will allow a quick switch to local execution. Tremendously improving > performances. Outside of those conditions, the performance may be the > negatively affected. > - Item constraints : Allow the user to specify constraint on the occurences > of an item (=, >, <, >=, <=, !=). For user who are looking for some specific > results, the search space can be greatly reduced. Which also improve > performances. > Please take note that the afformentionned fonctionnalities didn't come into > play when testing the performance. The performance shown on the graph are > 'merely' the result of the remplacement of the local execution by a CP based > algorithm. (maxLocalProjDBSize was also kept to it's default value > (3200L), to make sure the performance couldn't be artificially improved) > Trade offs : > - The algorithm does have a slight trade off. Since the algorithm needs to > detect whether sequences can use the specialised single-item patterns CP > algorithm. It may be a bit slower when it is not needed. This trade of was > mitigated by introducing a round sequence cleaning before the local > execution. Thus improving the performance of multi-item local executions when > the cleaning is effective. In case the no item can be cleaned, the check will > appear in the performence, creating a slight drop in them. > - All other change provided shouldn't have any effect on efficiency or > complexity if left to their default value. (Where they are basically > desactivated). When activated, they may however reduce the search space and > thus improve performance. > Additionnal note : > - The performance improvement are mostly seen far datasets where all itemset > are of size one, since it is a necessary condition for the use of the CP > based algorithm. But as you can see in the two Slen datasets, performances > were also slightly improved for algorithms which have multiple items per > itemset. The algorithm was built to detect when CP can be used, and to use > it, given the opportunity. > - The performance displayed here are the results of six months of work. > Various other things were tested to improve performance, without as much > success. I can thus say with a bit of confidence that the performances here > attained will be very hard to improve further. > - In case you want me to test the performance on a specific dataset or to > provide additionnal informations, it would be my pleasure to do so :). Just > it me up with my email below : > Email : cyril.devogela...@gmail.com > - I am a newbie contributor to Spark, and am not familliar with the whole > procedure at all. In case, I did
[jira] [Commented] (SPARK-20144) spark.read.parquet no long maintains ordering of the data
[ https://issues.apache.org/jira/browse/SPARK-20144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15951095#comment-15951095 ] Sean Owen commented on SPARK-20144: --- Probably best to wait for an informed opinion but I would assume for now you need to sort. I'm just saying that theoretically sorted data needs no data movement to become sorted because it is already. It may not actually even be expensive > spark.read.parquet no long maintains ordering of the data > - > > Key: SPARK-20144 > URL: https://issues.apache.org/jira/browse/SPARK-20144 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Li Jin > > Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is > when we read parquet files in 2.0.2, the ordering of rows in the resulting > dataframe is not the same as the ordering of rows in the dataframe that the > parquet file was reproduced with. > This is because FileSourceStrategy.scala combines the parquet files into > fewer partitions and also reordered them. This breaks our workflows because > they assume the ordering of the data. > Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec > changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with > 2.1. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20179) Major improvements to Spark's Prefix span
[ https://issues.apache.org/jira/browse/SPARK-20179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cyril de Vogelaere updated SPARK-20179: --- Description: The code I would like to push allows major performances improvement for single-item patterns through the use of a CP based solver (Namely OscaR, an open-source solver). And slight perfomance improved for multi-item patterns. As you can see in the log scale graph reachable from the link below, the performance are at worse roughly the same but at best, up to 50x faster (FIFA dataset) Link to graph : http://i67.tinypic.com/t06lw7.jpg Link to implementation : https://github.com/Syrux/spark Link for datasets : http://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php (the two slen datasets used are the first two in the list of 9) Performances were tested on the CECI servers, providing the driver with 10G memory (more than needed) and running 4 slaves. Additionnally to the performance improvements. I also added a bunch of new fonctionnalities : - Unlimited Max pattern length (with input 0) : - Min pattern length : Any item below that length won't be outputted - Max Item per itemset : An itemset won't be grown further than the input number, thus reducing the search space. - Head start : During the initial dataset cleaning, the frequent item were found then discarded. Which resulted in a inefficient first iteration of the genFreqPattern method. The algorithm new uses them if they are provided, and uses the empty pattern in case they're not. Slight improvement of performances were found. - Sub-problem limit : When resulting item sequence can be very long and the user disposes of a small number of very powerfull machine, this parameter will allow a quick switch to local execution. Tremendously improving performances. Outside of those conditions, the performance may be the negatively affected. - Item constraints : Allow the user to specify constraint on the occurences of an item (=, >, <, >=, <=, !=). For user who are looking for some specific results, the search space can be greatly reduced. Which also improve performances. Please take note that the afformentionned fonctionnalities didn't come into play when testing the performance. The performance shown on the graph are 'merely' the result of the remplacement of the local execution by a CP based algorithm. (maxLocalProjDBSize was also kept to it's default value (3200L), to make sure the performance couldn't be artificially improved) Trade offs : - The algorithm does have a slight trade off. Since the algorithm needs to detect whether sequences can use the specialised single-item patterns CP algorithm. It may be a bit slower when it is not needed. This trade of was mitigated by introducing a round sequence cleaning before the local execution. Thus improving the performance of multi-item local executions when the cleaning is effective. In case the no item can be cleaned, the check will appear in the performence, creating a slight drop in them. - All other change provided shouldn't have any effect on efficiency or complexity if left to their default value. (Where they are basically desactivated). When activated, they may however reduce the search space and thus improve performance. Additionnal note : - The performance improvement are mostly seen far datasets where all itemset are of size one, since it is a necessary condition for the use of the CP based algorithm. But as you can see in the two Slen datasets, performances were also slightly improved for algorithms which have multiple items per itemset. The algorithm was built to detect when CP can be used, and to use it, given the opportunity. - The performance displayed here are the results of six months of work. Various other things were tested to improve performance, without as much success. I can thus say with a bit of confidence that the performances here attained will be very hard to improve further. - In case you want me to test the performance on a specific dataset or to provide additionnal informations, it would be my pleasure to do so :). Just it me up with my email below : Email : cyril.devogela...@gmail.com - I am a newbie contributor to Spark, and am not familliar with the whole procedure at all. In case, I did something incorrectly, I will fix it as soon as possible. was: The code I would like to push allows major performances improvement through the use of a CP based solver (Namely OscaR, an open-source solver). As you can see in the log scale graph reachable from the link below, the performance are at worse roughly the same but at best, up to 50x faster (FIFA dataset) Link to graph : http://i67.tinypic.com/t06lw7.jpg Link to implementation : https://github.com/Syrux/spark Link for datasets : http://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php (the
[jira] [Updated] (SPARK-20179) Major improvements to Spark's Prefix span
[ https://issues.apache.org/jira/browse/SPARK-20179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cyril de Vogelaere updated SPARK-20179: --- Description: The code I would like to push allows major performances improvement through the use of a CP based solver (Namely OscaR, an open-source solver). As you can see in the log scale graph reachable from the link below, the performance are at worse roughly the same but at best, up to 50x faster (FIFA dataset) Link to graph : http://i67.tinypic.com/t06lw7.jpg Link to implementation : https://github.com/Syrux/spark Link for datasets : http://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php (the two slen datasets used are the first two in the list of 9) Performances were tested on the CECI servers, providing the driver with 10G memory (more than needed) and running 4 slaves. Additionnally to the performance improvements. I also added a bunch of new fonctionnalities : - Unlimited Max pattern length (with input 0) : - Min pattern length : Any item below that length won't be outputted - Max Item per itemset : An itemset won't be grown further than the input number, thus reducing the search space. - Head start : During the initial dataset cleaning, the frequent item were found then discarded. Which resulted in a inefficient first iteration of the genFreqPattern method. The algorithm new uses them if they are provided, and uses the empty pattern in case they're not. Slight improvement of performances were found. - Sub-problem limit : When resulting item sequence can be very long and the user disposes of a small number of very powerfull machine, this parameter will allow a quick switch to local execution. Tremendously improving performances. Outside of those conditions, the performance may be the negatively affected. - Item constraints : Allow the user to specify constraint on the occurences of an item (=, >, <, >=, <=, !=). For user who are looking for some specific results, the search space can be greatly reduced. Which also improve performances. Please take note that the afformentionned fonctionnalities didn't come into play when testing the performance. The performance shown on the graph are 'merely' the result of the remplacement of the local execution by a CP based algorithm. (maxLocalProjDBSize was also kept to it's default value (3200L), to make sure the performance couldn't be artificially improved) Trade offs : - The algorithm does have a slight trade off. Since the algorithm needs to detect whether sequences can use the specialised single-item patterns CP algorithm. It may be a bit slower when it is not needed. This trade of was mitigated by introducing a round sequence cleaning before the local execution. Thus improving the performance of multi-item local executions when the cleaning is effective. In case the no item can be cleaned, the check will appear in the performence, creating a slight drop in them. - All other change provided shouldn't have any effect on efficiency or complexity if left to their default value. (Where they are basically desactivated). When activated, they may however reduce the search space and thus improve performance. Additionnal note : - The performance improvement are mostly seen far datasets where all itemset are of size one, since it is a necessary condition for the use of the CP based algorithm. But as you can see in the two Slen datasets, performances were also slightly improved for algorithms which have multiple items per itemset. The algorithm was built to detect when CP can be used, and to use it, given the opportunity. - The performance displayed here are the results of six months of work. Various other things were tested to improve performance, without as much success. I can thus say with a bit of confidence that the performances here attained will be very hard to improve further. - In case you want me to test the performance on a specific dataset or to provide additionnal informations, it would be my pleasure to do so :). Just it me up with my email below : Email : cyril.devogela...@gmail.com - I am a newbie contributor to Spark, and am not familliar with the whole procedure at all. In case, I did something incorrectly, I will fix it as soon as possible. was: The code I would like to push allows major performances improvement through the use of a CP based solver (Namely OscaR, an open-source solver). As you can see in the log scale graph reachable from the link below, the performance are at worse roughly the same but at best, up to 50x faster (FIFA dataset) Link to graph : http://i67.tinypic.com/t06lw7.jpg Link to implementation : https://github.com/Syrux/spark Link for datasets : http://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php (the two slen datasets used are the first two in the list of 9) Performances were
[jira] [Updated] (SPARK-20179) Major improvements to Spark's Prefix span
[ https://issues.apache.org/jira/browse/SPARK-20179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cyril de Vogelaere updated SPARK-20179: --- Description: The code I would like to push allows major performances improvement through the use of a CP based solver (Namely OscaR, an open-source solver). As you can see in the log scale graph reachable from the link below, the performance are at worse roughly the same but at best, up to 50x faster (FIFA dataset) Link to graph : http://i67.tinypic.com/t06lw7.jpg Link to implementation : https://github.com/Syrux/spark Link for datasets : http://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php (the two slen datasets used are the first two in the list of 9) Performances were tested on the CECI servers, providing the driver with 10G memory (more than needed) and running 4 slaves. Trade offs : - The algorithm does have a slight trade off. Since the algorithm needs to detect whether sequences can use the specialised single-item patterns CP algorithm. It may be a bit slower when it is not needed. This trade of was mitigated by introducing a round sequence cleaning before the local execution. Thus improving the performance of multi-item local executions when the cleaning is effective. In case the no item can be cleaned, the check will appear in the performence, creating a slight drop in them. - All other change provided shouldn't have any effect on efficiency or complexity if left to their default value. (Where they are basically desactivated). Additionnally to the performance improvements. I also added a bunch of new fonctionnalities : - Unlimited Max pattern length (with input 0) : - Min pattern length : Any item below that length won't be outputted - Max Item per itemset : An itemset won't be grown further than the input number, thus reducing the search space. - Head start : During the initial dataset cleaning, the frequent item were found then discarded. Which resulted in a inefficient first iteration of the genFreqPattern method. The algorithm new uses them if they are provided, and uses the empty pattern in case they're not. Slight improvement of performances were found. - Sub-problem limit : When resulting item sequence can be very long and the user disposes of a small number of very powerfull machine, this parameter will allow a quick switch to local execution. Tremendously improving performances. Outside of those conditions, the performance may be the negatively affected. - Item constraints : Allow the user to specify constraint on the occurences of an item (=, >, <, >=, <=, !=). For user who are looking for some specific results, the search space can be greatly reduced. Which also improve performances. Please take note that the afformentionned fonctionnalities didn't come into play when testing the performance. The performance shown on the graph are 'merely' the result of the remplacement of the local execution by a CP based algorithm. (maxLocalProjDBSize was also kept to it's default value (3200L), to make sure the performance couldn't be artificially improved) Additionnal note : - The performance improvement are mostly seen far datasets where all itemset are of size one, since it is a necessary condition for the use of the CP based algorithm. But as you can see in the two Slen datasets, performances were also slightly improved for algorithms which have multiple items per itemset. The algorithm was built to detect when CP can be used, and to use it, given the opportunity. - The performance displayed here are the results of six months of work. Various other things were tested to improve performance, without as much success. I can thus say with a bit of confidence that the performances here attained will be very hard to improve further. - In case you want me to test the performance on a specific dataset or to provide additionnal informations, it would be my pleasure to do so :). Just it me up with my email below : Email : cyril.devogela...@gmail.com - I am a newbie contributor to Spark, and am not familliar with the whole procedure at all. In case, I did something incorrectly, I will fix it as soon as possible. was: The code I would like to push allows major performances improvement through the use of a CP based solver (Namely OscaR, an open-source solver). As you can see in the log scale graph reachable from the link below, the performance are at worse roughly the same but at best, up to 50x faster (FIFA dataset) Link to graph : http://i67.tinypic.com/t06lw7.jpg Link to implementation : https://github.com/Syrux/spark Link for datasets : http://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php (the two slen datasets used are the first two in the list of 9) Performances were tested on the CECI servers, providing the driver with 10G memory (more than needed) and
[jira] [Commented] (SPARK-20144) spark.read.parquet no long maintains ordering of the data
[ https://issues.apache.org/jira/browse/SPARK-20144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15951084#comment-15951084 ] Li Jin commented on SPARK-20144: Also, I am not sure about "If the data were sorted, sorting would be pretty cheap". Can you explain more on this? > spark.read.parquet no long maintains ordering of the data > - > > Key: SPARK-20144 > URL: https://issues.apache.org/jira/browse/SPARK-20144 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Li Jin > > Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is > when we read parquet files in 2.0.2, the ordering of rows in the resulting > dataframe is not the same as the ordering of rows in the dataframe that the > parquet file was reproduced with. > This is because FileSourceStrategy.scala combines the parquet files into > fewer partitions and also reordered them. This breaks our workflows because > they assume the ordering of the data. > Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec > changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with > 2.1. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20179) Major improvements to Spark's Prefix span
[ https://issues.apache.org/jira/browse/SPARK-20179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15951077#comment-15951077 ] Cyril de Vogelaere commented on SPARK-20179: Hello Sean, I did have a look at the contributing page, I don't really understand what you mean exatcly. Do you mean I should go over every bit of code I changed ? Because that may take a while ^^' For trade offs in performances, I wil add them to the ticket. Starting right now > Major improvements to Spark's Prefix span > - > > Key: SPARK-20179 > URL: https://issues.apache.org/jira/browse/SPARK-20179 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.1.0 > Environment: All >Reporter: Cyril de Vogelaere >Priority: Minor > Original Estimate: 0h > Remaining Estimate: 0h > > The code I would like to push allows major performances improvement through > the use of a CP based solver (Namely OscaR, an open-source solver). As you > can see in the log scale graph reachable from the link below, the performance > are at worse roughly the same but at best, up to 50x faster (FIFA dataset) > Link to graph : http://i67.tinypic.com/t06lw7.jpg > Link to implementation : https://github.com/Syrux/spark > Link for datasets : > http://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php (the > two slen datasets used are the first two in the list of 9) > Performances were tested on the CECI servers, providing the driver with 10G > memory (more than needed) and running 4 slaves. > Additionnally to the performance improvements. I also added a bunch of new > fonctionnalities : > - Unlimited Max pattern length (with input 0) : > - Min pattern length : Any item below that length won't be outputted > - Max Item per itemset : An itemset won't be grown further than the input > number, thus reducing the search space. > - Head start : During the initial dataset cleaning, the frequent item were > found then discarded. Which resulted in a inefficient first iteration of the > genFreqPattern method. The algorithm new uses them if they are provided, and > uses the empty pattern in case they're not. Slight improvement of > performances were found. > - Sub-problem limit : When resulting item sequence can be very long and the > user disposes of a small number of very powerfull machine, this parameter > will allow a quick switch to local execution. Tremendously improving > performances. Outside of those conditions, the performance may be the > negatively affected. > - Item constraints : Allow the user to specify constraint on the occurences > of an item (=, >, <, >=, <=, !=). For user who are looking for some specific > results, the search space can be greatly reduced. Which also improve > performances. > Please take note that the afformentionned fonctionnalities didn't come into > play when testing the performance. The performance shown on the graph are > 'merely' the result of the remplacement of the local execution by a CP based > algorithm. (maxLocalProjDBSize was also kept to it's default value > (3200L), to make sure the performance couldn't be artificially improved) > Additionnal note : > - The performance improvement are mostly seen far datasets where all itemset > are of size one, since it is a necessary condition for the use of the CP > based algorithm. But as you can see in the two Slen datasets, performances > were also slightly improved for algorithms which have multiple items per > itemset. The algorithm was built to detect when CP can be used, and to use > it, given the opportunity. > - The performance displayed here are the results of six months of work. > Various other things were tested to improve performance, without as much > success. I can thus say with a bit of confidence that the performances here > attained will be very hard to improve further. > - In case you want me to test the performance on a specific dataset or to > provide additionnal informations, it would be my pleasure to do so :). Just > it me up with my email below : > Email : cyril.devogela...@gmail.com > - I am a newbie contributor to Spark, and am not familliar with the whole > precedure at all. In case, I did something incorrectly, I will fix it as soon > as possible. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20144) spark.read.parquet no long maintains ordering of the data
[ https://issues.apache.org/jira/browse/SPARK-20144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15951073#comment-15951073 ] Li Jin commented on SPARK-20144: I totally agree Correctness takes precedence. If sorting is the only way, we will do that, but I think there is way we can maintain ordering in parquet format. Parquet itself doesn't change the ordering, data in parquet is stored with parquet_file_0, parquet_file_1 ... and data are ordered within those files. However, it is FileSourceStrategy (https://github.com/apache/spark/blob/v2.0.2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala#L168) that resorts parquet files and end up changing the ordering. If the expected semantics of Parquet doesn't maintain order, I won't complain the behavior of spark.read.parquet, but it seems it's Catalyst that is changing the ordering here. > spark.read.parquet no long maintains ordering of the data > - > > Key: SPARK-20144 > URL: https://issues.apache.org/jira/browse/SPARK-20144 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Li Jin > > Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is > when we read parquet files in 2.0.2, the ordering of rows in the resulting > dataframe is not the same as the ordering of rows in the dataframe that the > parquet file was reproduced with. > This is because FileSourceStrategy.scala combines the parquet files into > fewer partitions and also reordered them. This breaks our workflows because > they assume the ordering of the data. > Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec > changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with > 2.1. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20179) Major improvements to Spark's Prefix span
[ https://issues.apache.org/jira/browse/SPARK-20179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-20179: -- Shepherd: (was: Joseph K. Bradley) Flags: (was: Important) Target Version/s: (was: 2.1.0) Labels: (was: newbie performance test) Priority: Minor (was: Major) Please start by reading the link I posted then. This is not how changes are proposed. > Major improvements to Spark's Prefix span > - > > Key: SPARK-20179 > URL: https://issues.apache.org/jira/browse/SPARK-20179 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.1.0 > Environment: All >Reporter: Cyril de Vogelaere >Priority: Minor > Original Estimate: 0h > Remaining Estimate: 0h > > The code I would like to push allows major performances improvement through > the use of a CP based solver (Namely OscaR, an open-source solver). As you > can see in the log scale graph reachable from the link below, the performance > are at worse roughly the same but at best, up to 50x faster (FIFA dataset) > Link to graph : http://i67.tinypic.com/t06lw7.jpg > Link to implementation : https://github.com/Syrux/spark > Link for datasets : > http://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php (the > two slen datasets used are the first two in the list of 9) > Performances were tested on the CECI servers, providing the driver with 10G > memory (more than needed) and running 4 slaves. > Additionnally to the performance improvements. I also added a bunch of new > fonctionnalities : > - Unlimited Max pattern length (with input 0) : > - Min pattern length : Any item below that length won't be outputted > - Max Item per itemset : An itemset won't be grown further than the input > number, thus reducing the search space. > - Head start : During the initial dataset cleaning, the frequent item were > found then discarded. Which resulted in a inefficient first iteration of the > genFreqPattern method. The algorithm new uses them if they are provided, and > uses the empty pattern in case they're not. Slight improvement of > performances were found. > - Sub-problem limit : When resulting item sequence can be very long and the > user disposes of a small number of very powerfull machine, this parameter > will allow a quick switch to local execution. Tremendously improving > performances. Outside of those conditions, the performance may be the > negatively affected. > - Item constraints : Allow the user to specify constraint on the occurences > of an item (=, >, <, >=, <=, !=). For user who are looking for some specific > results, the search space can be greatly reduced. Which also improve > performances. > Please take note that the afformentionned fonctionnalities didn't come into > play when testing the performance. The performance shown on the graph are > 'merely' the result of the remplacement of the local execution by a CP based > algorithm. (maxLocalProjDBSize was also kept to it's default value > (3200L), to make sure the performance couldn't be artificially improved) > Additionnal note : > - The performance improvement are mostly seen far datasets where all itemset > are of size one, since it is a necessary condition for the use of the CP > based algorithm. But as you can see in the two Slen datasets, performances > were also slightly improved for algorithms which have multiple items per > itemset. The algorithm was built to detect when CP can be used, and to use > it, given the opportunity. > - The performance displayed here are the results of six months of work. > Various other things were tested to improve performance, without as much > success. I can thus say with a bit of confidence that the performances here > attained will be very hard to improve further. > - In case you want me to test the performance on a specific dataset or to > provide additionnal informations, it would be my pleasure to do so :). Just > it me up with my email below : > Email : cyril.devogela...@gmail.com > - I am a newbie contributor to Spark, and am not familliar with the whole > precedure at all. In case, I did something incorrectly, I will fix it as soon > as possible. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20179) Major improvements to Spark's Prefix span
[ https://issues.apache.org/jira/browse/SPARK-20179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15951051#comment-15951051 ] Cyril de Vogelaere commented on SPARK-20179: I forgot to mention, I am ready to push the code anytime now. But I heard that it needed to be reviewed and corrected first. I am not very familliar on the procedure, so it would be helpfull if someone could advice me on what to do. > Major improvements to Spark's Prefix span > - > > Key: SPARK-20179 > URL: https://issues.apache.org/jira/browse/SPARK-20179 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.1.0 > Environment: All >Reporter: Cyril de Vogelaere > Labels: newbie, performance, test > Original Estimate: 0h > Remaining Estimate: 0h > > The code I would like to push allows major performances improvement through > the use of a CP based solver (Namely OscaR, an open-source solver). As you > can see in the log scale graph reachable from the link below, the performance > are at worse roughly the same but at best, up to 50x faster (FIFA dataset) > Link to graph : http://i67.tinypic.com/t06lw7.jpg > Link to implementation : https://github.com/Syrux/spark > Link for datasets : > http://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php (the > two slen datasets used are the first two in the list of 9) > Performances were tested on the CECI servers, providing the driver with 10G > memory (more than needed) and running 4 slaves. > Additionnally to the performance improvements. I also added a bunch of new > fonctionnalities : > - Unlimited Max pattern length (with input 0) : > - Min pattern length : Any item below that length won't be outputted > - Max Item per itemset : An itemset won't be grown further than the input > number, thus reducing the search space. > - Head start : During the initial dataset cleaning, the frequent item were > found then discarded. Which resulted in a inefficient first iteration of the > genFreqPattern method. The algorithm new uses them if they are provided, and > uses the empty pattern in case they're not. Slight improvement of > performances were found. > - Sub-problem limit : When resulting item sequence can be very long and the > user disposes of a small number of very powerfull machine, this parameter > will allow a quick switch to local execution. Tremendously improving > performances. Outside of those conditions, the performance may be the > negatively affected. > - Item constraints : Allow the user to specify constraint on the occurences > of an item (=, >, <, >=, <=, !=). For user who are looking for some specific > results, the search space can be greatly reduced. Which also improve > performances. > Please take note that the afformentionned fonctionnalities didn't come into > play when testing the performance. The performance shown on the graph are > 'merely' the result of the remplacement of the local execution by a CP based > algorithm. (maxLocalProjDBSize was also kept to it's default value > (3200L), to make sure the performance couldn't be artificially improved) > Additionnal note : > - The performance improvement are mostly seen far datasets where all itemset > are of size one, since it is a necessary condition for the use of the CP > based algorithm. But as you can see in the two Slen datasets, performances > were also slightly improved for algorithms which have multiple items per > itemset. The algorithm was built to detect when CP can be used, and to use > it, given the opportunity. > - The performance displayed here are the results of six months of work. > Various other things were tested to improve performance, without as much > success. I can thus say with a bit of confidence that the performances here > attained will be very hard to improve further. > - In case you want me to test the performance on a specific dataset or to > provide additionnal informations, it would be my pleasure to do so :). Just > it me up with my email below : > Email : cyril.devogela...@gmail.com > - I am a newbie contributor to Spark, and am not familliar with the whole > precedure at all. In case, I did something incorrectly, I will fix it as soon > as possible. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20179) Major improvements to Spark's Prefix span
[ https://issues.apache.org/jira/browse/SPARK-20179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15951044#comment-15951044 ] Sean Owen commented on SPARK-20179: --- It's not clear what you are proposing _for Spark_. You're describing some modifications you made in your own build, but not what changed, what the complexity or tradeoffs are. Have a look at http://spark.apache.org/contributing.html first please. I think this is indeed a duplicate of SPARK-10678. > Major improvements to Spark's Prefix span > - > > Key: SPARK-20179 > URL: https://issues.apache.org/jira/browse/SPARK-20179 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.1.0 > Environment: All >Reporter: Cyril de Vogelaere > Labels: newbie, performance, test > Original Estimate: 0h > Remaining Estimate: 0h > > The code I would like to push allows major performances improvement through > the use of a CP based solver (Namely OscaR, an open-source solver). As you > can see in the log scale graph reachable from the link below, the performance > are at worse roughly the same but at best, up to 50x faster (FIFA dataset) > Link to graph : http://i67.tinypic.com/t06lw7.jpg > Link to implementation : https://github.com/Syrux/spark > Link for datasets : > http://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php (the > two slen datasets used are the first two in the list of 9) > Performances were tested on the CECI servers, providing the driver with 10G > memory (more than needed) and running 4 slaves. > Additionnally to the performance improvements. I also added a bunch of new > fonctionnalities : > - Unlimited Max pattern length (with input 0) : > - Min pattern length : Any item below that length won't be outputted > - Max Item per itemset : An itemset won't be grown further than the input > number, thus reducing the search space. > - Head start : During the initial dataset cleaning, the frequent item were > found then discarded. Which resulted in a inefficient first iteration of the > genFreqPattern method. The algorithm new uses them if they are provided, and > uses the empty pattern in case they're not. Slight improvement of > performances were found. > - Sub-problem limit : When resulting item sequence can be very long and the > user disposes of a small number of very powerfull machine, this parameter > will allow a quick switch to local execution. Tremendously improving > performances. Outside of those conditions, the performance may be the > negatively affected. > - Item constraints : Allow the user to specify constraint on the occurences > of an item (=, >, <, >=, <=, !=). For user who are looking for some specific > results, the search space can be greatly reduced. Which also improve > performances. > Please take note that the afformentionned fonctionnalities didn't come into > play when testing the performance. The performance shown on the graph are > 'merely' the result of the remplacement of the local execution by a CP based > algorithm. (maxLocalProjDBSize was also kept to it's default value > (3200L), to make sure the performance couldn't be artificially improved) > Additionnal note : > - The performance improvement are mostly seen far datasets where all itemset > are of size one, since it is a necessary condition for the use of the CP > based algorithm. But as you can see in the two Slen datasets, performances > were also slightly improved for algorithms which have multiple items per > itemset. The algorithm was built to detect when CP can be used, and to use > it, given the opportunity. > - The performance displayed here are the results of six months of work. > Various other things were tested to improve performance, without as much > success. I can thus say with a bit of confidence that the performances here > attained will be very hard to improve further. > - In case you want me to test the performance on a specific dataset or to > provide additionnal informations, it would be my pleasure to do so :). Just > it me up with my email below : > Email : cyril.devogela...@gmail.com > - I am a newbie contributor to Spark, and am not familliar with the whole > precedure at all. In case, I did something incorrectly, I will fix it as soon > as possible. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10678) Specialize PrefixSpan for single-item patterns
[ https://issues.apache.org/jira/browse/SPARK-10678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15951040#comment-15951040 ] Cyril de Vogelaere commented on SPARK-10678: I hadn't seen this issue as I created mine. I have finished an implementation which specialise Prefix-span for single-item patterns, using a CP solver. Here is the link the the issue I created, which proposes other improvement along side this particular one. https://issues.apache.org/jira/browse/SPARK-20179 Any advice you have for me would be welcome, since I'm a newbie at contributing to Spark > Specialize PrefixSpan for single-item patterns > -- > > Key: SPARK-10678 > URL: https://issues.apache.org/jira/browse/SPARK-10678 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.6.0 >Reporter: Xiangrui Meng > > We assume the input itemsets are multi-item in PrefixSpan, e.g., (ab)(cd). In > some use cases, all itemsets are single-item, e.g., abcd. In this case, our > implementation has overhead remembering the boundaries between itemsets. We > could detect it and put specialized implementation for this use case. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20179) Major improvements to Spark's Prefix span
Cyril de Vogelaere created SPARK-20179: -- Summary: Major improvements to Spark's Prefix span Key: SPARK-20179 URL: https://issues.apache.org/jira/browse/SPARK-20179 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 2.1.0 Environment: All Reporter: Cyril de Vogelaere The code I would like to push allows major performances improvement through the use of a CP based solver (Namely OscaR, an open-source solver). As you can see in the log scale graph reachable from the link below, the performance are at worse roughly the same but at best, up to 50x faster (FIFA dataset) Link to graph : http://i67.tinypic.com/t06lw7.jpg Link to implementation : https://github.com/Syrux/spark Link for datasets : http://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php (the two slen datasets used are the first two in the list of 9) Performances were tested on the CECI servers, providing the driver with 10G memory (more than needed) and running 4 slaves. Additionnally to the performance improvements. I also added a bunch of new fonctionnalities : - Unlimited Max pattern length (with input 0) : - Min pattern length : Any item below that length won't be outputted - Max Item per itemset : An itemset won't be grown further than the input number, thus reducing the search space. - Head start : During the initial dataset cleaning, the frequent item were found then discarded. Which resulted in a inefficient first iteration of the genFreqPattern method. The algorithm new uses them if they are provided, and uses the empty pattern in case they're not. Slight improvement of performances were found. - Sub-problem limit : When resulting item sequence can be very long and the user disposes of a small number of very powerfull machine, this parameter will allow a quick switch to local execution. Tremendously improving performances. Outside of those conditions, the performance may be the negatively affected. - Item constraints : Allow the user to specify constraint on the occurences of an item (=, >, <, >=, <=, !=). For user who are looking for some specific results, the search space can be greatly reduced. Which also improve performances. Please take note that the afformentionned fonctionnalities didn't come into play when testing the performance. The performance shown on the graph are 'merely' the result of the remplacement of the local execution by a CP based algorithm. (maxLocalProjDBSize was also kept to it's default value (3200L), to make sure the performance couldn't be artificially improved) Additionnal note : - The performance improvement are mostly seen far datasets where all itemset are of size one, since it is a necessary condition for the use of the CP based algorithm. But as you can see in the two Slen datasets, performances were also slightly improved for algorithms which have multiple items per itemset. The algorithm was built to detect when CP can be used, and to use it, given the opportunity. - The performance displayed here are the results of six months of work. Various other things were tested to improve performance, without as much success. I can thus say with a bit of confidence that the performances here attained will be very hard to improve further. - In case you want me to test the performance on a specific dataset or to provide additionnal informations, it would be my pleasure to do so :). Just it me up with my email below : Email : cyril.devogela...@gmail.com - I am a newbie contributor to Spark, and am not familliar with the whole precedure at all. In case, I did something incorrectly, I will fix it as soon as possible. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20144) spark.read.parquet no long maintains ordering of the data
[ https://issues.apache.org/jira/browse/SPARK-20144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15950988#comment-15950988 ] Sean Owen commented on SPARK-20144: --- If the data were sorted, sorting would be pretty cheap, in general. Correctness has to take precedence in any event, if you're describing this as a blocker for you. I don't believe projection can change ordering, no. I am saying that I would not necessarily expect that to extend to external serialization. I don't see that being tabular or on HDFS matters. I think some serializations would naturally preserve order and others would not. I am still not 100% sure what the expected semantics of Parquet are here, but you have de facto evidence it is not guaranteed. > spark.read.parquet no long maintains ordering of the data > - > > Key: SPARK-20144 > URL: https://issues.apache.org/jira/browse/SPARK-20144 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Li Jin > > Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is > when we read parquet files in 2.0.2, the ordering of rows in the resulting > dataframe is not the same as the ordering of rows in the dataframe that the > parquet file was reproduced with. > This is because FileSourceStrategy.scala combines the parquet files into > fewer partitions and also reordered them. This breaks our workflows because > they assume the ordering of the data. > Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec > changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with > 2.1. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20144) spark.read.parquet no long maintains ordering of the data
[ https://issues.apache.org/jira/browse/SPARK-20144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15950979#comment-15950979 ] Li Jin edited comment on SPARK-20144 at 3/31/17 2:14 PM: - Thanks for getting back to me. Sorting in this case will just add extra cost to in our workflow and we are trying to avoid it in the first place. Because DataFrame presents the data in a tabular format, it is very surprising that the ordering of rows in the table changes after going through hdfs. In any other tabular format that I know of, ordering of rows is a property of the data and it is surprising that reading/writing changes properties of the data. This is also a bit scary because if ordering were not a property of a DataFrame, can things like cache or select("col") change ordering of rows in the future? was (Author: icexelloss): Thanks for getting back to me. Sorting in this case will just add extra cost to in our workflow and we are trying to avoid it in the first place. Because DataFrame presents the data in a tabular format, it is very surprising that the table changes after going through hdfs. In any other tabular format that I know of, ordering of rows is a property of the data and it is surprising that reading/writing changes properties of the data. This is also a bit scary because if ordering were not a property of a DataFrame, can things like cache or select("col") change ordering of rows in the future? > spark.read.parquet no long maintains ordering of the data > - > > Key: SPARK-20144 > URL: https://issues.apache.org/jira/browse/SPARK-20144 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Li Jin > > Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is > when we read parquet files in 2.0.2, the ordering of rows in the resulting > dataframe is not the same as the ordering of rows in the dataframe that the > parquet file was reproduced with. > This is because FileSourceStrategy.scala combines the parquet files into > fewer partitions and also reordered them. This breaks our workflows because > they assume the ordering of the data. > Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec > changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with > 2.1. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20144) spark.read.parquet no long maintains ordering of the data
[ https://issues.apache.org/jira/browse/SPARK-20144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15950979#comment-15950979 ] Li Jin commented on SPARK-20144: Thanks for getting back to me. Sorting in this case will just add extra cost to in our workflow and we are trying to avoid it in the first place. Because DataFrame presents the data in a tabular format, it is very surprising that the table changes after going through hdfs. In any other tabular format that I know of, ordering of rows is a property of the data and it is surprising that reading/writing changes properties of the data. This is also a bit scary because if ordering were not a property of a DataFrame, can things like cache or select("col") change ordering of rows in the future? > spark.read.parquet no long maintains ordering of the data > - > > Key: SPARK-20144 > URL: https://issues.apache.org/jira/browse/SPARK-20144 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Li Jin > > Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is > when we read parquet files in 2.0.2, the ordering of rows in the resulting > dataframe is not the same as the ordering of rows in the dataframe that the > parquet file was reproduced with. > This is because FileSourceStrategy.scala combines the parquet files into > fewer partitions and also reordered them. This breaks our workflows because > they assume the ordering of the data. > Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec > changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with > 2.1. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20177) Document about compression way has some little detail changes.
[ https://issues.apache.org/jira/browse/SPARK-20177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] guoxiaolongzte updated SPARK-20177: --- Description: Document compression way little detail changes. 1.spark.eventLog.compress add 'Compression will use spark.io.compression.codec.' 2.spark.broadcast.compress add 'Compression will use spark.io.compression.codec.' 3,spark.rdd.compress add 'Compression will use spark.io.compression.codec.' 4.spark.io.compression.codec add 'event log describe' eg Through the documents, I don't know what is compression mode about 'event log'. was: Document compression way little detail changes. 1.spark.eventLog.compress add 'Compression will use spark.io.compression.codec.' 2.spark.broadcast.compress add 'Compression will use spark.io.compression.codec.' 3,spark.rdd.compress add 'Compression will use spark.io.compression.codec.' 4.spark.io.compression.codec add 'event log describe' > Document about compression way has some little detail changes. > -- > > Key: SPARK-20177 > URL: https://issues.apache.org/jira/browse/SPARK-20177 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.1.0 >Reporter: guoxiaolongzte >Priority: Minor > > Document compression way little detail changes. > 1.spark.eventLog.compress add 'Compression will use > spark.io.compression.codec.' > 2.spark.broadcast.compress add 'Compression will use > spark.io.compression.codec.' > 3,spark.rdd.compress add 'Compression will use spark.io.compression.codec.' > 4.spark.io.compression.codec add 'event log describe' > eg > Through the documents, I don't know what is compression mode about 'event > log'. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20177) Document about compression way has some little detail changes.
[ https://issues.apache.org/jira/browse/SPARK-20177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15950939#comment-15950939 ] Apache Spark commented on SPARK-20177: -- User 'guoxiaolongzte' has created a pull request for this issue: https://github.com/apache/spark/pull/17498 > Document about compression way has some little detail changes. > -- > > Key: SPARK-20177 > URL: https://issues.apache.org/jira/browse/SPARK-20177 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.1.0 >Reporter: guoxiaolongzte >Priority: Minor > > Document compression way little detail changes. > 1.spark.eventLog.compress add 'Compression will use > spark.io.compression.codec.' > 2.spark.broadcast.compress add 'Compression will use > spark.io.compression.codec.' > 3,spark.rdd.compress add 'Compression will use spark.io.compression.codec.' > 4.spark.io.compression.codec add 'event log describe' -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20178) Improve Scheduler fetch failures
[ https://issues.apache.org/jira/browse/SPARK-20178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15950917#comment-15950917 ] Thomas Graves edited comment on SPARK-20178 at 3/31/17 1:53 PM: Overall what I would like to accomplish is not throwing away work and making the failure case very performant. More and more people are running spark on larger clusters, this means failures are going to occur more. We need those failures to be as fast as possible. We need to be careful here and make sure we handle the node totally down case, the nodemanager totally down, and the nodemanager or node is just having intermittent issue. Generally I see the last where the issue is just intermittent but some people recently have had more of the nodemanager totally down case in which case you want to fail all maps on that node quickly. The decision on what to rerun is hard now because it could be very costly to rerun more, but at the same time it could be very costly to not rerun all immediately because you can fail all 4 stage attempts. This really depends on how long the maps and reduces run. A lot of discussion on https://github.com/apache/spark/pull/17088 related to that. - We should not kill the Reduce tasks on fetch failure. Leave the Reduce tasks running since it could have done useful work already like fetching X number of map outputs. It can simply fail that map output which would cause the map to be rerun and only that specific map output would need to be refetched. This does require checking to make sure there are enough resource to run the map and if not possibly killing a reducer or getting more resources if dynamic allocation. - Improve logic around deciding which node is actually bad when you get a fetch failures. Was it really the node the reduce was on or the node the map was on. You can do something here like a % of reducers failed to fetch from map output node. - We should only rerun the maps that are necessary. Other maps could have already been fetched (with bullet one) so no need to rerun those immediately. Since the reduce tasks keep running, other fetch failures can happen in parallel and that would just cause other maps to be rerun. At some point based on bullet 2 above we can decide entire node is bad or to invalidate all output on that node. Make sure to think about intermittent failures vs shuffle handler totally down and not coming back. Use that in determining logic - Improve the blacklisting based on the above improvements - make sure to think about how this plays into the stage attempt max failures (4, now settable) - try to not waste resources. ie right now we can have 2 of the same reduce tasks running which is using twice the resources and there are a bunch of different conditions that can occur as to whether this work is actually useful. Question: - should we consider having it fetch all map output from a host at once (rather then per executor). This could improve fetching times (but would have to test) as well as fetch failure handling. This could cause it to fail more maps which is somewhat contradictory to bullet 3 above, need to think about this more. - Do we need pluggable interface or how do we not destabilize current scheduler? Bonus or future: - Decision on when and how many maps to rerun is cost based estimate. If maps only take a few seconds to run could rerun all maps on the host immediately - option to prestart reduce tasks so that they can start fetching while last few maps are failing (if you have long tail maps) was (Author: tgraves): Overall what I would like to accomplish is not throwing away work and making the failure case very performant. More and more people are running spark on larger clusters, this means failures are going to occur more. We need those failures to be as fast as possible. We need to be careful here and make sure we handle the node totally down case, the nodemanager totally down, and the nodemanager or node is just having intermittent issue. Generally I see the last where the issue is just intermittent but some people recently have had more of the nodemanager totally down case in which case you want to fail all maps on that node quickly. The decision on what to rerun is hard now because it could be very costly to rerun more, but at the same time it could be very costly to not rerun all immediately because you can fail all 4 stage attempts. This really depends on how long the maps and reduces run. A lot of discussion on https://github.com/apache/spark/pull/17088 related to that. - We should not kill the Reduce tasks on fetch failure. Leave the Reduce tasks running since it could have done useful work already like fetching X number of map outputs. It can simply fail that map output which would cause the map to be rerun and only that specific map output would need to be
[jira] [Commented] (SPARK-20178) Improve Scheduler fetch failures
[ https://issues.apache.org/jira/browse/SPARK-20178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15950917#comment-15950917 ] Thomas Graves commented on SPARK-20178: --- Overall what I would like to accomplish is not throwing away work and making the failure case very performant. More and more people are running spark on larger clusters, this means failures are going to occur more. We need those failures to be as fast as possible. We need to be careful here and make sure we handle the node totally down case, the nodemanager totally down, and the nodemanager or node is just having intermittent issue. Generally I see the last where the issue is just intermittent but some people recently have had more of the nodemanager totally down case in which case you want to fail all maps on that node quickly. The decision on what to rerun is hard now because it could be very costly to rerun more, but at the same time it could be very costly to not rerun all immediately because you can fail all 4 stage attempts. This really depends on how long the maps and reduces run. A lot of discussion on https://github.com/apache/spark/pull/17088 related to that. - We should not kill the Reduce tasks on fetch failure. Leave the Reduce tasks running since it could have done useful work already like fetching X number of map outputs. It can simply fail that map output which would cause the map to be rerun and only that specific map output would need to be refetched. This does require checking to make sure there are enough resource to run the map and if not possibly killing a reducer or getting more resources if dynamic allocation. - Improve logic around deciding which node is actually bad when you get a fetch failures. Was it really the node the reduce was on or the node the map was on. You can do something here like a % of reducers failed to fetch from map output node. - We should only rerun the maps that failed (or have been logic around how to make this decision), other maps could have already been fetch (with bullet one) so no need to rerun if all reducers already fetched. Since the reduce tasks keep running, other fetch failures can happen in parallel and that would just cause other maps to be rerun. At some point based on bullet 2 above we can decide entire node is bad. - Improve the blacklisting based on the above improvements - make sure to think about how this plays into the stage attempt max failures (4, now settable) - try to not waste resources. ie right now we can have 2 of the same reduce tasks running which is using twice the resources and there are a bunch of different conditions that can occur as to whether this work is actually useful. Question: - should we consider having it fetch all map output from a host at once (rather then per executor). This could improve fetching times (but would have to test) as well as fetch failure handling. This could cause it to fail more maps which is somewhat contradictory to bullet 3 above, need to think about this more. - Do we need pluggable interface or how do we not destabilize current scheduler? Bonus or future: - Decision on when and how many maps to rerun is cost based estimate. If maps only take a few seconds to run could rerun all maps on the host immediately - option to prestart reduce tasks so that they can start fetching while last few maps are failing (if you have long tail maps) > Improve Scheduler fetch failures > > > Key: SPARK-20178 > URL: https://issues.apache.org/jira/browse/SPARK-20178 > Project: Spark > Issue Type: Epic > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Thomas Graves > > We have been having a lot of discussions around improving the handling of > fetch failures. There are 4 jira currently related to this. > We should try to get a list of things we want to improve and come up with one > cohesive design. > SPARK-20163, SPARK-20091, SPARK-14649 , and SPARK-19753 > I will put my initial thoughts in a follow on comment. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-19443) The function to generate constraints takes too long when the query plan grows continuously
[ https://issues.apache.org/jira/browse/SPARK-19443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh closed SPARK-19443. --- Resolution: Won't Fix > The function to generate constraints takes too long when the query plan grows > continuously > -- > > Key: SPARK-19443 > URL: https://issues.apache.org/jira/browse/SPARK-19443 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Liang-Chi Hsieh > > This issue is originally reported and discussed at > http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-ML-Pipeline-performance-regression-between-1-6-and-2-x-tc20803.html > When run a ML `Pipeline` with many stages, during the iterative updating to > `Dataset` , it is observed the it takes longer time to finish the fit and > transform as the query plan grows continuously. > Specially, the time spent on preparing optimized plan in current branch > (74294 ms) is much higher than 1.6 (292 ms). Actually, the time is spent > mostly on generating query plan's constraints during few optimization rules. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-19665) Improve constraint propagation
[ https://issues.apache.org/jira/browse/SPARK-19665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh closed SPARK-19665. --- Resolution: Won't Fix > Improve constraint propagation > -- > > Key: SPARK-19665 > URL: https://issues.apache.org/jira/browse/SPARK-19665 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Liang-Chi Hsieh > > If there are aliased expression in the projection, we propagate constraints > by completely expanding the original constraints with aliases. > This expanding costs much computation time when the number of aliases > increases. > Another issue is we actually don't need the additional constraints at most of > time. For example, if there is a constraint "a > b", and "a" is aliased to > "c" and "d". When we use this constraint in filtering, we don't need all > constraints "a > b", "c > b", "d > b". We only need "a > b" because if it is > false, it is guaranteed that all other constraints are false too. > Fully expanding all constraints at all the time makes iterative ML algorithms > where a ML pipeline with many stages runs very slow. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20178) Improve Scheduler fetch failures
Thomas Graves created SPARK-20178: - Summary: Improve Scheduler fetch failures Key: SPARK-20178 URL: https://issues.apache.org/jira/browse/SPARK-20178 Project: Spark Issue Type: Epic Components: Scheduler Affects Versions: 2.1.0 Reporter: Thomas Graves We have been having a lot of discussions around improving the handling of fetch failures. There are 4 jira currently related to this. We should try to get a list of things we want to improve and come up with one cohesive design. SPARK-20163, SPARK-20091, SPARK-14649 , and SPARK-19753 I will put my initial thoughts in a follow on comment. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20177) Document about compression way has some little detail changes.
[ https://issues.apache.org/jira/browse/SPARK-20177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20177: Assignee: Apache Spark > Document about compression way has some little detail changes. > -- > > Key: SPARK-20177 > URL: https://issues.apache.org/jira/browse/SPARK-20177 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.1.0 >Reporter: guoxiaolongzte >Assignee: Apache Spark >Priority: Minor > > Document compression way little detail changes. > 1.spark.eventLog.compress add 'Compression will use > spark.io.compression.codec.' > 2.spark.broadcast.compress add 'Compression will use > spark.io.compression.codec.' > 3,spark.rdd.compress add 'Compression will use spark.io.compression.codec.' > 4.spark.io.compression.codec add 'event log describe' -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20177) Document about compression way has some little detail changes.
[ https://issues.apache.org/jira/browse/SPARK-20177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20177: Assignee: (was: Apache Spark) > Document about compression way has some little detail changes. > -- > > Key: SPARK-20177 > URL: https://issues.apache.org/jira/browse/SPARK-20177 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.1.0 >Reporter: guoxiaolongzte >Priority: Minor > > Document compression way little detail changes. > 1.spark.eventLog.compress add 'Compression will use > spark.io.compression.codec.' > 2.spark.broadcast.compress add 'Compression will use > spark.io.compression.codec.' > 3,spark.rdd.compress add 'Compression will use spark.io.compression.codec.' > 4.spark.io.compression.codec add 'event log describe' -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20177) Document about compression way has some little detail changes.
[ https://issues.apache.org/jira/browse/SPARK-20177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15950826#comment-15950826 ] Apache Spark commented on SPARK-20177: -- User 'guoxiaolongzte' has created a pull request for this issue: https://github.com/apache/spark/pull/17497 > Document about compression way has some little detail changes. > -- > > Key: SPARK-20177 > URL: https://issues.apache.org/jira/browse/SPARK-20177 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.1.0 >Reporter: guoxiaolongzte >Priority: Minor > > Document compression way little detail changes. > 1.spark.eventLog.compress add 'Compression will use > spark.io.compression.codec.' > 2.spark.broadcast.compress add 'Compression will use > spark.io.compression.codec.' > 3,spark.rdd.compress add 'Compression will use spark.io.compression.codec.' > 4.spark.io.compression.codec add 'event log describe' -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20177) Document about compression way has some little detail changes.
guoxiaolongzte created SPARK-20177: -- Summary: Document about compression way has some little detail changes. Key: SPARK-20177 URL: https://issues.apache.org/jira/browse/SPARK-20177 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 2.1.0 Reporter: guoxiaolongzte Priority: Minor Document compression way little detail changes. 1.spark.eventLog.compress add 'Compression will use spark.io.compression.codec.' 2.spark.broadcast.compress add 'Compression will use spark.io.compression.codec.' 3,spark.rdd.compress add 'Compression will use spark.io.compression.codec.' 4.spark.io.compression.codec add 'event log describe' -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14492) Spark SQL 1.6.0 does not work with Hive version lower than 1.2.0; its not backwards compatible with earlier version
[ https://issues.apache.org/jira/browse/SPARK-14492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15950778#comment-15950778 ] Sunil Rangwani edited comment on SPARK-14492 at 3/31/17 12:27 PM: -- My problem exactly was a) Interacting with Hive metastore of an older version. I set it up with the various spark.sql.hive.metastore.* config options but that didn't work. I had to do a messy upgrade of the external hive metastore database and service to get it to work. was (Author: sunil.rangwani): My problem exactly was a) Interacting with Hive metastore of an older version. I set it up with the various config options spark.sql.hive.metastore.* config options but that didn't work. I had to do a messy upgrade of the external hive metastore database and service to get it to work. > Spark SQL 1.6.0 does not work with Hive version lower than 1.2.0; its not > backwards compatible with earlier version > --- > > Key: SPARK-14492 > URL: https://issues.apache.org/jira/browse/SPARK-14492 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Sunil Rangwani >Priority: Critical > > Spark SQL when configured with a Hive version lower than 1.2.0 throws a > java.lang.NoSuchFieldError for the field METASTORE_CLIENT_SOCKET_LIFETIME > because this field was introduced in Hive 1.2.0 so its not possible to use > Hive metastore version lower than 1.2.0 with Spark. The details of the Hive > changes can be found here: https://issues.apache.org/jira/browse/HIVE-9508 > {code:java} > Exception in thread "main" java.lang.NoSuchFieldError: > METASTORE_CLIENT_SOCKET_LIFETIME > at > org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:500) > at > org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:250) > at > org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:237) > at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:441) > at > org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:272) > at > org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:271) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at org.apache.spark.sql.SQLContext.(SQLContext.scala:271) > at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:90) > at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:101) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:58) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:267) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:139) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14492) Spark SQL 1.6.0 does not work with Hive version lower than 1.2.0; its not backwards compatible with earlier version
[ https://issues.apache.org/jira/browse/SPARK-14492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15950778#comment-15950778 ] Sunil Rangwani commented on SPARK-14492: My problem exactly was a) Interacting with Hive metastore of an older version. I set it up with the various config options spark.sql.hive.metastore.* config options but that didn't work. I had to do a messy upgrade of the external hive metastore database and service to get it to work. > Spark SQL 1.6.0 does not work with Hive version lower than 1.2.0; its not > backwards compatible with earlier version > --- > > Key: SPARK-14492 > URL: https://issues.apache.org/jira/browse/SPARK-14492 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Sunil Rangwani >Priority: Critical > > Spark SQL when configured with a Hive version lower than 1.2.0 throws a > java.lang.NoSuchFieldError for the field METASTORE_CLIENT_SOCKET_LIFETIME > because this field was introduced in Hive 1.2.0 so its not possible to use > Hive metastore version lower than 1.2.0 with Spark. The details of the Hive > changes can be found here: https://issues.apache.org/jira/browse/HIVE-9508 > {code:java} > Exception in thread "main" java.lang.NoSuchFieldError: > METASTORE_CLIENT_SOCKET_LIFETIME > at > org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:500) > at > org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:250) > at > org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:237) > at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:441) > at > org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:272) > at > org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:271) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at org.apache.spark.sql.SQLContext.(SQLContext.scala:271) > at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:90) > at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:101) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:58) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:267) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:139) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14492) Spark SQL 1.6.0 does not work with Hive version lower than 1.2.0; its not backwards compatible with earlier version
[ https://issues.apache.org/jira/browse/SPARK-14492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15950768#comment-15950768 ] Sean Owen commented on SPARK-14492: --- You are still describing two different things I think: a) interacting with Hive metastore X and b) building Spark with Hive X. a) should work as documented. What you describe in this JIRA is b) though. You do not need to, and cannot in fact, build Spark versus older Hive. > Spark SQL 1.6.0 does not work with Hive version lower than 1.2.0; its not > backwards compatible with earlier version > --- > > Key: SPARK-14492 > URL: https://issues.apache.org/jira/browse/SPARK-14492 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Sunil Rangwani >Priority: Critical > > Spark SQL when configured with a Hive version lower than 1.2.0 throws a > java.lang.NoSuchFieldError for the field METASTORE_CLIENT_SOCKET_LIFETIME > because this field was introduced in Hive 1.2.0 so its not possible to use > Hive metastore version lower than 1.2.0 with Spark. The details of the Hive > changes can be found here: https://issues.apache.org/jira/browse/HIVE-9508 > {code:java} > Exception in thread "main" java.lang.NoSuchFieldError: > METASTORE_CLIENT_SOCKET_LIFETIME > at > org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:500) > at > org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:250) > at > org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:237) > at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:441) > at > org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:272) > at > org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:271) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at org.apache.spark.sql.SQLContext.(SQLContext.scala:271) > at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:90) > at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:101) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:58) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:267) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:139) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14492) Spark SQL 1.6.0 does not work with Hive version lower than 1.2.0; its not backwards compatible with earlier version
[ https://issues.apache.org/jira/browse/SPARK-14492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15950760#comment-15950760 ] Sunil Rangwani commented on SPARK-14492: [~sowen] Can you please explain why is it not a problem? Interacting with different version of Hive metastore doesn't work as described in the documentation. I have met other people who have the same use case; they have legacy data in Hive and want to use spark to interact with it. > Spark SQL 1.6.0 does not work with Hive version lower than 1.2.0; its not > backwards compatible with earlier version > --- > > Key: SPARK-14492 > URL: https://issues.apache.org/jira/browse/SPARK-14492 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Sunil Rangwani >Priority: Critical > > Spark SQL when configured with a Hive version lower than 1.2.0 throws a > java.lang.NoSuchFieldError for the field METASTORE_CLIENT_SOCKET_LIFETIME > because this field was introduced in Hive 1.2.0 so its not possible to use > Hive metastore version lower than 1.2.0 with Spark. The details of the Hive > changes can be found here: https://issues.apache.org/jira/browse/HIVE-9508 > {code:java} > Exception in thread "main" java.lang.NoSuchFieldError: > METASTORE_CLIENT_SOCKET_LIFETIME > at > org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:500) > at > org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:250) > at > org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:237) > at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:441) > at > org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:272) > at > org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:271) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at org.apache.spark.sql.SQLContext.(SQLContext.scala:271) > at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:90) > at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:101) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:58) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:267) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:139) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18936) Infrastructure for session local timezone support
[ https://issues.apache.org/jira/browse/SPARK-18936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15950748#comment-15950748 ] Navya Krishnappa edited comment on SPARK-18936 at 3/31/17 11:51 AM: I think this fix helps us to set the time zone in the spark configurations. If it's so Can we set "UTC" as my time zone?? And let me know if I misunderstood the document. was (Author: navya krishnappa): I think this fix helps us to set the time zone in the spark configurations. If it's so Can we set "UTC" time zone?? And let me know if I misunderstood the document. > Infrastructure for session local timezone support > - > > Key: SPARK-18936 > URL: https://issues.apache.org/jira/browse/SPARK-18936 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Takuya Ueshin > Fix For: 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18936) Infrastructure for session local timezone support
[ https://issues.apache.org/jira/browse/SPARK-18936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15950748#comment-15950748 ] Navya Krishnappa commented on SPARK-18936: -- I think this fix helps us to set the time zone in the spark configurations. If it's so Can we set "UTC" time zone?? And let me know if I misunderstood the document. > Infrastructure for session local timezone support > - > > Key: SPARK-18936 > URL: https://issues.apache.org/jira/browse/SPARK-18936 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Takuya Ueshin > Fix For: 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20152) Time zone is not respected while parsing csv for timeStampFormat "MM-dd-yyyy'T'HH:mm:ss.SSSZZ"
[ https://issues.apache.org/jira/browse/SPARK-20152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15950745#comment-15950745 ] Navya Krishnappa commented on SPARK-20152: -- [~srowen] & [~hyukjin.kwon] Thank you for your comments. > Time zone is not respected while parsing csv for timeStampFormat > "MM-dd-'T'HH:mm:ss.SSSZZ" > -- > > Key: SPARK-20152 > URL: https://issues.apache.org/jira/browse/SPARK-20152 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Navya Krishnappa > > When reading the below mentioned time value by specifying the > "timestampFormat": "MM-dd-'T'HH:mm:ss.SSSZZ", time zone is ignored. > Source File: > TimeColumn > 03-21-2017T03:30:02Z > Source code1: > Dataset dataset = getSqlContext().read() > .option(PARSER_LIB, "commons") > .option(INFER_SCHEMA, "true") > .option(DELIMITER, ",") > .option(QUOTE, "\"") > .option(ESCAPE, "\\") > .option("timestampFormat" , "MM-dd-'T'HH:mm:ss.SSSZZ") > .option(MODE, Mode.PERMISSIVE) > .csv(sourceFile); > Result: TimeColumn [ StringType] and value is "03-21-2017T03:30:02Z", but > expected result is TimeCoumn should be of "TimestampType" and should > consider time zone for manipulation > Source code2: > Dataset dataset = getSqlContext().read() > .option(PARSER_LIB, "commons") > .option(INFER_SCHEMA, "true") > .option(DELIMITER, ",") > .option(QUOTE, "\"") > .option(ESCAPE, "\\") > .option("timestampFormat" , "MM-dd-'T'HH:mm:ss") > .option(MODE, Mode.PERMISSIVE) > .csv(sourceFile); > Result: TimeColumn [ TimestampType] and value is "2017-03-21 03:30:02.0", but > expected result is TimeCoumn should consider time zone for manipulation -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20176) Spark Dataframe UDAF issue
Dinesh Man Amatya created SPARK-20176: - Summary: Spark Dataframe UDAF issue Key: SPARK-20176 URL: https://issues.apache.org/jira/browse/SPARK-20176 Project: Spark Issue Type: IT Help Components: Spark Core Affects Versions: 2.0.2 Reporter: Dinesh Man Amatya Getting following error in custom UDAF Error while decoding: java.util.concurrent.ExecutionException: java.lang.Exception: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 58, Column 33: Incompatible expression types "boolean" and "java.lang.Boolean" /* 001 */ public java.lang.Object generate(Object[] references) { /* 002 */ return new SpecificSafeProjection(references); /* 003 */ } /* 004 */ /* 005 */ class SpecificSafeProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { /* 006 */ /* 007 */ private Object[] references; /* 008 */ private MutableRow mutableRow; /* 009 */ private Object[] values; /* 010 */ private Object[] values1; /* 011 */ private org.apache.spark.sql.types.StructType schema; /* 012 */ private org.apache.spark.sql.types.StructType schema1; /* 013 */ /* 014 */ /* 015 */ public SpecificSafeProjection(Object[] references) { /* 016 */ this.references = references; /* 017 */ mutableRow = (MutableRow) references[references.length - 1]; /* 018 */ /* 019 */ /* 020 */ this.schema = (org.apache.spark.sql.types.StructType) references[0]; /* 021 */ this.schema1 = (org.apache.spark.sql.types.StructType) references[1]; /* 022 */ } /* 023 */ /* 024 */ public java.lang.Object apply(java.lang.Object _i) { /* 025 */ InternalRow i = (InternalRow) _i; /* 026 */ /* 027 */ values = new Object[2]; /* 028 */ /* 029 */ boolean isNull2 = i.isNullAt(0); /* 030 */ UTF8String value2 = isNull2 ? null : (i.getUTF8String(0)); /* 031 */ /* 032 */ boolean isNull1 = isNull2; /* 033 */ final java.lang.String value1 = isNull1 ? null : (java.lang.String) value2.toString(); /* 034 */ isNull1 = value1 == null; /* 035 */ if (isNull1) { /* 036 */ values[0] = null; /* 037 */ } else { /* 038 */ values[0] = value1; /* 039 */ } /* 040 */ /* 041 */ boolean isNull5 = i.isNullAt(1); /* 042 */ InternalRow value5 = isNull5 ? null : (i.getStruct(1, 2)); /* 043 */ boolean isNull3 = false; /* 044 */ org.apache.spark.sql.Row value3 = null; /* 045 */ if (!false && isNull5) { /* 046 */ /* 047 */ final org.apache.spark.sql.Row value6 = null; /* 048 */ isNull3 = true; /* 049 */ value3 = value6; /* 050 */ } else { /* 051 */ /* 052 */ values1 = new Object[2]; /* 053 */ /* 054 */ boolean isNull10 = i.isNullAt(1); /* 055 */ InternalRow value10 = isNull10 ? null : (i.getStruct(1, 2)); /* 056 */ /* 057 */ boolean isNull9 = isNull10 || false; /* 058 */ final boolean value9 = isNull9 ? false : (Boolean) value10.isNullAt(0); /* 059 */ boolean isNull8 = false; /* 060 */ double value8 = -1.0; /* 061 */ if (!isNull9 && value9) { /* 062 */ /* 063 */ final double value12 = -1.0; /* 064 */ isNull8 = true; /* 065 */ value8 = value12; /* 066 */ } else { /* 067 */ /* 068 */ boolean isNull14 = i.isNullAt(1); /* 069 */ InternalRow value14 = isNull14 ? null : (i.getStruct(1, 2)); /* 070 */ boolean isNull13 = isNull14; /* 071 */ double value13 = -1.0; /* 072 */ /* 073 */ if (!isNull14) { /* 074 */ /* 075 */ if (value14.isNullAt(0)) { /* 076 */ isNull13 = true; /* 077 */ } else { /* 078 */ value13 = value14.getDouble(0); /* 079 */ } /* 080 */ /* 081 */ } /* 082 */ isNull8 = isNull13; /* 083 */ value8 = value13; /* 084 */ } /* 085 */ if (isNull8) { /* 086 */ values1[0] = null; /* 087 */ } else { /* 088 */ values1[0] = value8; /* 089 */ } /* 090 */ /* 091 */ boolean isNull17 = i.isNullAt(1); /* 092 */ InternalRow value17 = isNull17 ? null : (i.getStruct(1, 2)); /* 093 */ /* 094 */ boolean isNull16 = isNull17 || false; /* 095 */ final boolean value16 = isNull16 ? false : (Boolean) value17.isNullAt(1); /* 096 */ boolean isNull15 = false; /* 097 */ double value15 = -1.0; /* 098 */ if (!isNull16 && value16) { /* 099 */ /* 100 */ final double value19 = -1.0; /* 101 */ isNull15 = true; /* 102 */ value15 = value19; /* 103 */ } else { /* 104 */ /* 105 */ boolean isNull21 = i.isNullAt(1); /* 106 */ InternalRow value21 = isNull21 ? null : (i.getStruct(1, 2)); /* 107 */ boolean isNull20 = isNull21; /* 108 */ double value20 = -1.0; /* 109 */ /* 110 */ if (!isNull21) { /* 111 */ /* 112 */ if (value21.isNullAt(1)) { /* 113 */
[jira] [Commented] (SPARK-20173) Throw NullPointerException when HiveThriftServer2 is shutdown
[ https://issues.apache.org/jira/browse/SPARK-20173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15950683#comment-15950683 ] Apache Spark commented on SPARK-20173: -- User 'zuotingbing' has created a pull request for this issue: https://github.com/apache/spark/pull/17496 > Throw NullPointerException when HiveThriftServer2 is shutdown > - > > Key: SPARK-20173 > URL: https://issues.apache.org/jira/browse/SPARK-20173 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: zuotingbing > > Throw NullPointerException when HiveThriftServer2 is shutdown: > > 2017-03-30 11:52:56,355 ERROR Utils: Uncaught exception in thread Thread-2 > java.lang.NullPointerException > at > org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$$anonfun$main$1.apply$mcV$sp(HiveThriftServer2.scala:85) > at > org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:215) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:187) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:187) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:187) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1953) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:187) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:187) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:187) > at scala.util.Try$.apply(Try.scala:192) > at > org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:187) > at > org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:177) > at > org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54) > 2017-03-30 11:52:56,357 INFO ShutdownHookManager: Shutdown hook called -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20139) Spark UI reports partial success for completed stage while log shows all tasks are finished
[ https://issues.apache.org/jira/browse/SPARK-20139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15950640#comment-15950640 ] Sean Owen commented on SPARK-20139: --- So is the lesson here that the driver can't keep up at this scale with all of the event messages -- is it just cosmetic? > Spark UI reports partial success for completed stage while log shows all > tasks are finished > --- > > Key: SPARK-20139 > URL: https://issues.apache.org/jira/browse/SPARK-20139 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.1.0 >Reporter: Etti Gur > Attachments: screenshot-1.png > > > Spark UI reports partial success for completed stage while log shows all > tasks are finished - i.e.: > We have a stage that is presented under completed stages on spark UI, > but the successful tasks are shown like so: (146372/524964) not as you'd > expect (524964/524964) > Looking at the application master log shows all tasks in that stage are > successful: > 17/03/29 09:45:49 INFO TaskSetManager: Finished task 522973.0 in stage 0.0 > (TID 522973) in 1163910 ms on ip-10-1-15-34.ec2.internal (executor 116) > (524963/524964) > 17/03/29 09:45:49 INFO TaskSetManager: Finished task 12508.0 in stage 2.0 > (TID 537472) in 241250 ms on ip-10-1-15-14.ec2.internal (executor 38) > (20234/20262) > 17/03/29 09:45:49 INFO TaskSetManager: Finished task 12465.0 in stage 2.0 > (TID 537429) in 241994 ms on ip-10-1-15-106.ec2.internal (executor 133) > (20235/20262) > 17/03/29 09:45:49 INFO TaskSetManager: Finished task 15079.0 in stage 2.0 > (TID 540043) in 202889 ms on ip-10-1-15-173.ec2.internal (executor 295) > (20236/20262) > 17/03/29 09:45:49 INFO TaskSetManager: Finished task 19828.0 in stage 2.0 > (TID 544792) in 137845 ms on ip-10-1-15-147.ec2.internal (executor 43) > (20237/20262) > 17/03/29 09:45:50 INFO TaskSetManager: Finished task 19072.0 in stage 2.0 > (TID 544036) in 147363 ms on ip-10-1-15-19.ec2.internal (executor 175) > (20238/20262) > 17/03/29 09:45:50 INFO TaskSetManager: Finished task 524146.0 in stage 0.0 > (TID 524146) in 889950 ms on ip-10-1-15-72.ec2.internal (executor 74) > (524964/524964) > Also in the log we get an error: > 17/03/29 08:24:16 ERROR LiveListenerBus: Dropping SparkListenerEvent because > no remaining room in event queue. This likely means one of the SparkListeners > is too slow and cannot keep up with the rate at which tasks are being started > by the scheduler. > This looks like the stage is indeed completed with all its tasks but UI shows > like not all tasks really finished. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14492) Spark SQL 1.6.0 does not work with Hive version lower than 1.2.0; its not backwards compatible with earlier version
[ https://issues.apache.org/jira/browse/SPARK-14492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-14492. --- Resolution: Not A Problem > Spark SQL 1.6.0 does not work with Hive version lower than 1.2.0; its not > backwards compatible with earlier version > --- > > Key: SPARK-14492 > URL: https://issues.apache.org/jira/browse/SPARK-14492 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Sunil Rangwani >Priority: Critical > > Spark SQL when configured with a Hive version lower than 1.2.0 throws a > java.lang.NoSuchFieldError for the field METASTORE_CLIENT_SOCKET_LIFETIME > because this field was introduced in Hive 1.2.0 so its not possible to use > Hive metastore version lower than 1.2.0 with Spark. The details of the Hive > changes can be found here: https://issues.apache.org/jira/browse/HIVE-9508 > {code:java} > Exception in thread "main" java.lang.NoSuchFieldError: > METASTORE_CLIENT_SOCKET_LIFETIME > at > org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:500) > at > org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:250) > at > org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:237) > at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:441) > at > org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:272) > at > org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:271) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at org.apache.spark.sql.SQLContext.(SQLContext.scala:271) > at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:90) > at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:101) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:58) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:267) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:139) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-19862) In SparkEnv.scala,shortShuffleMgrNames tungsten-sort can be deleted.
[ https://issues.apache.org/jira/browse/SPARK-19862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] guoxiaolong updated SPARK-19862: Comment: was deleted (was: @srowen In spark2.1.0,"tungsten-sort" -> classOf[org.apache.spark.shuffle.sort.SortShuffleManager].getName has been deleted,but you didn't agree with my issue SPARK-19862.why?) > In SparkEnv.scala,shortShuffleMgrNames tungsten-sort can be deleted. > - > > Key: SPARK-19862 > URL: https://issues.apache.org/jira/browse/SPARK-19862 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.2 >Reporter: guoxiaolong >Priority: Trivial > > "tungsten-sort" -> > classOf[org.apache.spark.shuffle.sort.SortShuffleManager].getName can be > deleted. Because it is the same of "sort" -> > classOf[org.apache.spark.shuffle.sort.SortShuffleManager].getName. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19690) Join a streaming DataFrame with a batch DataFrame may not work
[ https://issues.apache.org/jira/browse/SPARK-19690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-19690: -- Target Version/s: 2.2.0 (was: 2.1.1, 2.2.0) > Join a streaming DataFrame with a batch DataFrame may not work > -- > > Key: SPARK-19690 > URL: https://issues.apache.org/jira/browse/SPARK-19690 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.0.3, 2.1.0, 2.1.1 >Reporter: Shixiong Zhu >Priority: Critical > > When joining a streaming DataFrame with a batch DataFrame, if the batch > DataFrame has an aggregation, it will be converted to a streaming physical > aggregation. Then the query will crash. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20167) In SqlBase.g4,some of the comments is not correct.
[ https://issues.apache.org/jira/browse/SPARK-20167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-20167. --- Resolution: Not A Problem > In SqlBase.g4,some of the comments is not correct. > -- > > Key: SPARK-20167 > URL: https://issues.apache.org/jira/browse/SPARK-20167 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: guoxiaolongzte >Priority: Minor > > In SqlBase.g4,some of the comments is not correct. > eg. > | DROP TABLE (IF EXISTS)? tableIdentifier PURGE? #dropTable > | DROP VIEW (IF EXISTS)? tableIdentifier > #dropTable > the comments of ‘DROP VIEW (IF EXISTS)? tableIdentifier ’should be dropView -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20144) spark.read.parquet no long maintains ordering of the data
[ https://issues.apache.org/jira/browse/SPARK-20144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15950627#comment-15950627 ] Sean Owen commented on SPARK-20144: --- If you need a particular ordering, I think you need to sort. I am not sure ordering is particularly guaranteed in the format or the reading of it. > spark.read.parquet no long maintains ordering of the data > - > > Key: SPARK-20144 > URL: https://issues.apache.org/jira/browse/SPARK-20144 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Li Jin > > Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is > when we read parquet files in 2.0.2, the ordering of rows in the resulting > dataframe is not the same as the ordering of rows in the dataframe that the > parquet file was reproduced with. > This is because FileSourceStrategy.scala combines the parquet files into > fewer partitions and also reordered them. This breaks our workflows because > they assume the ordering of the data. > Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec > changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with > 2.1. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20175) Exists should not be evaluated in Join operator and can be converted to ScalarSubquery if no correlated reference
[ https://issues.apache.org/jira/browse/SPARK-20175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20175: Assignee: Apache Spark > Exists should not be evaluated in Join operator and can be converted to > ScalarSubquery if no correlated reference > - > > Key: SPARK-20175 > URL: https://issues.apache.org/jira/browse/SPARK-20175 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Liang-Chi Hsieh >Assignee: Apache Spark > > Similar to ListQuery, Exists should not be evaluated in Join operator too. > Otherwise, a query like following will fail: > sql("select * from l, r where l.a = r.c + 1 AND (exists (select * from r) OR > l.a = r.c)") > For the Exists subquery without correlated reference, this patch converts it > to scalar subquery with a count Aggregate operator. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20175) Exists should not be evaluated in Join operator and can be converted to ScalarSubquery if no correlated reference
[ https://issues.apache.org/jira/browse/SPARK-20175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20175: Assignee: (was: Apache Spark) > Exists should not be evaluated in Join operator and can be converted to > ScalarSubquery if no correlated reference > - > > Key: SPARK-20175 > URL: https://issues.apache.org/jira/browse/SPARK-20175 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Liang-Chi Hsieh > > Similar to ListQuery, Exists should not be evaluated in Join operator too. > Otherwise, a query like following will fail: > sql("select * from l, r where l.a = r.c + 1 AND (exists (select * from r) OR > l.a = r.c)") > For the Exists subquery without correlated reference, this patch converts it > to scalar subquery with a count Aggregate operator. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20175) Exists should not be evaluated in Join operator and can be converted to ScalarSubquery if no correlated reference
[ https://issues.apache.org/jira/browse/SPARK-20175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15950624#comment-15950624 ] Apache Spark commented on SPARK-20175: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/17491 > Exists should not be evaluated in Join operator and can be converted to > ScalarSubquery if no correlated reference > - > > Key: SPARK-20175 > URL: https://issues.apache.org/jira/browse/SPARK-20175 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Liang-Chi Hsieh > > Similar to ListQuery, Exists should not be evaluated in Join operator too. > Otherwise, a query like following will fail: > sql("select * from l, r where l.a = r.c + 1 AND (exists (select * from r) OR > l.a = r.c)") > For the Exists subquery without correlated reference, this patch converts it > to scalar subquery with a count Aggregate operator. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20175) Exists should not be evaluated in Join operator and can be converted to ScalarSubquery if no correlated reference
Liang-Chi Hsieh created SPARK-20175: --- Summary: Exists should not be evaluated in Join operator and can be converted to ScalarSubquery if no correlated reference Key: SPARK-20175 URL: https://issues.apache.org/jira/browse/SPARK-20175 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.1.0 Reporter: Liang-Chi Hsieh Similar to ListQuery, Exists should not be evaluated in Join operator too. Otherwise, a query like following will fail: sql("select * from l, r where l.a = r.c + 1 AND (exists (select * from r) OR l.a = r.c)") For the Exists subquery without correlated reference, this patch converts it to scalar subquery with a count Aggregate operator. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20173) Throw NullPointerException when HiveThriftServer2 is shutdown
[ https://issues.apache.org/jira/browse/SPARK-20173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15950619#comment-15950619 ] Xiaochen Ouyang commented on SPARK-20173: - +1 > Throw NullPointerException when HiveThriftServer2 is shutdown > - > > Key: SPARK-20173 > URL: https://issues.apache.org/jira/browse/SPARK-20173 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: zuotingbing > > Throw NullPointerException when HiveThriftServer2 is shutdown: > > 2017-03-30 11:52:56,355 ERROR Utils: Uncaught exception in thread Thread-2 > java.lang.NullPointerException > at > org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$$anonfun$main$1.apply$mcV$sp(HiveThriftServer2.scala:85) > at > org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:215) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:187) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:187) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:187) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1953) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:187) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:187) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:187) > at scala.util.Try$.apply(Try.scala:192) > at > org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:187) > at > org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:177) > at > org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54) > 2017-03-30 11:52:56,357 INFO ShutdownHookManager: Shutdown hook called -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20172) Event log without read permission should be filtered out before actually reading it
[ https://issues.apache.org/jira/browse/SPARK-20172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20172: Assignee: (was: Apache Spark) > Event log without read permission should be filtered out before actually > reading it > --- > > Key: SPARK-20172 > URL: https://issues.apache.org/jira/browse/SPARK-20172 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Saisai Shao >Priority: Minor > > In the current Spark's HistoryServer, we expected to check file permission > when listing all the files, and filter out this files with no read > permission. That was not worked because we actually doesn't check the access > permission, so we defer this permission check until reading files, that is > not necessary and the exception is printed out in every 10 seconds by default. > So to avoid this problem we should add a access check logic in listing files. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20172) Event log without read permission should be filtered out before actually reading it
[ https://issues.apache.org/jira/browse/SPARK-20172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20172: Assignee: Apache Spark > Event log without read permission should be filtered out before actually > reading it > --- > > Key: SPARK-20172 > URL: https://issues.apache.org/jira/browse/SPARK-20172 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Saisai Shao >Assignee: Apache Spark >Priority: Minor > > In the current Spark's HistoryServer, we expected to check file permission > when listing all the files, and filter out this files with no read > permission. That was not worked because we actually doesn't check the access > permission, so we defer this permission check until reading files, that is > not necessary and the exception is printed out in every 10 seconds by default. > So to avoid this problem we should add a access check logic in listing files. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20172) Event log without read permission should be filtered out before actually reading it
[ https://issues.apache.org/jira/browse/SPARK-20172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15950616#comment-15950616 ] Apache Spark commented on SPARK-20172: -- User 'jerryshao' has created a pull request for this issue: https://github.com/apache/spark/pull/17495 > Event log without read permission should be filtered out before actually > reading it > --- > > Key: SPARK-20172 > URL: https://issues.apache.org/jira/browse/SPARK-20172 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Saisai Shao >Priority: Minor > > In the current Spark's HistoryServer, we expected to check file permission > when listing all the files, and filter out this files with no read > permission. That was not worked because we actually doesn't check the access > permission, so we defer this permission check until reading files, that is > not necessary and the exception is printed out in every 10 seconds by default. > So to avoid this problem we should add a access check logic in listing files. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org