[jira] [Commented] (SPARK-12278) Move the shuffle related test case from Yarn module to Core module
[ https://issues.apache.org/jira/browse/SPARK-12278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15675998#comment-15675998 ] Ferdinand Xu commented on SPARK-12278: -- Thanks [~srowen] for pointing this out. The main consideration here is that shuffle file encryption should support all modes not only limited to Yarn mode. We can lower the priority since now it's not supporting Yarn mode. > Move the shuffle related test case from Yarn module to Core module > -- > > Key: SPARK-12278 > URL: https://issues.apache.org/jira/browse/SPARK-12278 > Project: Spark > Issue Type: Test > Components: Shuffle >Reporter: Ferdinand Xu > > The test in _YarnShuffleEncryptionSuite_ requires _YarnSparkHadoopUtil_ which > is in Yarn module. So we have to leave it in Yarn module instead of Core > module. After SPARK-11807 is resolved, we will move the test back to Core > module. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17932) Failed to run SQL "show table extended like table_name" in Spark2.0.0
[ https://issues.apache.org/jira/browse/SPARK-17932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15675985#comment-15675985 ] Jiang Xingbo commented on SPARK-17932: -- I’m working on this, thanks! > Failed to run SQL "show table extended like table_name" in Spark2.0.0 > --- > > Key: SPARK-17932 > URL: https://issues.apache.org/jira/browse/SPARK-17932 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: pin_zhang > > SQL "show table extended like table_name " doesn't work in spark 2.0.0 > that works in spark1.5.2 > Error: org.apache.spark.sql.catalyst.parser.ParseException: > missing 'FUNCTIONS' at 'extended'(line 1, pos 11) > == SQL == > show table extended like test > ---^^^ (state=,code=0) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18500) Make GenericStrategy be able to prune plans by itself after placeholders are replaced.
[ https://issues.apache.org/jira/browse/SPARK-18500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18500: Assignee: Apache Spark > Make GenericStrategy be able to prune plans by itself after placeholders are > replaced. > -- > > Key: SPARK-18500 > URL: https://issues.apache.org/jira/browse/SPARK-18500 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Takuya Ueshin >Assignee: Apache Spark > > Add a functionality to {{GenericStrategy}} to be able to prune bad physical > plans by itself after their placeholders are replaced. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18500) Make GenericStrategy be able to prune plans by itself after placeholders are replaced.
[ https://issues.apache.org/jira/browse/SPARK-18500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15675978#comment-15675978 ] Apache Spark commented on SPARK-18500: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/15927 > Make GenericStrategy be able to prune plans by itself after placeholders are > replaced. > -- > > Key: SPARK-18500 > URL: https://issues.apache.org/jira/browse/SPARK-18500 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Takuya Ueshin > > Add a functionality to {{GenericStrategy}} to be able to prune bad physical > plans by itself after their placeholders are replaced. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18500) Make GenericStrategy be able to prune plans by itself after placeholders are replaced.
[ https://issues.apache.org/jira/browse/SPARK-18500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18500: Assignee: (was: Apache Spark) > Make GenericStrategy be able to prune plans by itself after placeholders are > replaced. > -- > > Key: SPARK-18500 > URL: https://issues.apache.org/jira/browse/SPARK-18500 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Takuya Ueshin > > Add a functionality to {{GenericStrategy}} to be able to prune bad physical > plans by itself after their placeholders are replaced. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)
[ https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15675966#comment-15675966 ] Nathan Howell commented on SPARK-18352: --- Sounds good to me. I have an implementation that's passing basic tests but needs to be cleaned up a bit. I'll get a pull request up in the next few days. > Parse normal, multi-line JSON files (not just JSON Lines) > - > > Key: SPARK-18352 > URL: https://issues.apache.org/jira/browse/SPARK-18352 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > Spark currently can only parse JSON files that are JSON lines, i.e. each > record has an entire line and records are separated by new line. In reality, > a lot of users want to use Spark to parse actual JSON files, and are > surprised to learn that it doesn't do that. > We can introduce a new mode (wholeJsonFile?) in which we don't split the > files, and rather stream through them to parse the JSON files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18500) Make GenericStrategy be able to prune plans by itself after placeholders are replaced.
Takuya Ueshin created SPARK-18500: - Summary: Make GenericStrategy be able to prune plans by itself after placeholders are replaced. Key: SPARK-18500 URL: https://issues.apache.org/jira/browse/SPARK-18500 Project: Spark Issue Type: Improvement Components: SQL Reporter: Takuya Ueshin Add a functionality to {{GenericStrategy}} to be able to prune bad physical plans by itself after their placeholders are replaced. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9478) Add sample weights to Random Forest
[ https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15675770#comment-15675770 ] Seth Hendrickson commented on SPARK-9478: - I'm going to work on submitting a PR for adding sample weights for 2.2. That pr is for adding class weights, which I think we decided against. > Add sample weights to Random Forest > --- > > Key: SPARK-9478 > URL: https://issues.apache.org/jira/browse/SPARK-9478 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.4.1 >Reporter: Patrick Crenshaw > > Currently, this implementation of random forest does not support class > weights. Class weights are important when there is imbalanced training data > or the evaluation metric of a classifier is imbalanced (e.g. true positive > rate at some false positive threshold). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9478) Add sample weights to Random Forest
[ https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seth Hendrickson updated SPARK-9478: Summary: Add sample weights to Random Forest (was: Add class weights to Random Forest) > Add sample weights to Random Forest > --- > > Key: SPARK-9478 > URL: https://issues.apache.org/jira/browse/SPARK-9478 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.4.1 >Reporter: Patrick Crenshaw > > Currently, this implementation of random forest does not support class > weights. Class weights are important when there is imbalanced training data > or the evaluation metric of a classifier is imbalanced (e.g. true positive > rate at some false positive threshold). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18499) Add back support for custom Spark SQL dialects
[ https://issues.apache.org/jira/browse/SPARK-18499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15675714#comment-15675714 ] Andrew Ash commented on SPARK-18499: Specifically what I'm most interested in is a strict ANSI SQL dialect, not bending Spark SQL to support a proprietary dialect. > Add back support for custom Spark SQL dialects > -- > > Key: SPARK-18499 > URL: https://issues.apache.org/jira/browse/SPARK-18499 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Andrew Ash > > Point 5 from the parent task: > {quote} > 5. I want to be able to use my own customized SQL constructs. An example of > this would supporting my own dialect, or be able to add constructs to the > current SQL language. I should not have to implement a complete parse, and > should be able to delegate to an underlying parser. > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18499) Add back support for custom Spark SQL dialects
Andrew Ash created SPARK-18499: -- Summary: Add back support for custom Spark SQL dialects Key: SPARK-18499 URL: https://issues.apache.org/jira/browse/SPARK-18499 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Andrew Ash Point 5 from the parent task: {quote} 5. I want to be able to use my own customized SQL constructs. An example of this would supporting my own dialect, or be able to add constructs to the current SQL language. I should not have to implement a complete parse, and should be able to delegate to an underlying parser. {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9478) Add class weights to Random Forest
[ https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15675699#comment-15675699 ] German Eduardo Melo commented on SPARK-9478: [~sethah] I was wondering if you are working on this request...the current PR for the improvement is https://github.com/apache/spark/pull/13851, right? Due my research I am looking forward this feature, thanks a lot in advance for any update! > Add class weights to Random Forest > -- > > Key: SPARK-9478 > URL: https://issues.apache.org/jira/browse/SPARK-9478 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.4.1 >Reporter: Patrick Crenshaw > > Currently, this implementation of random forest does not support class > weights. Class weights are important when there is imbalanced training data > or the evaluation metric of a classifier is imbalanced (e.g. true positive > rate at some false positive threshold). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17450) spark sql rownumber OOM
[ https://issues.apache.org/jira/browse/SPARK-17450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15675639#comment-15675639 ] cen yuhai commented on SPARK-17450: --- I will upgrade to 2.x, please close this issue > spark sql rownumber OOM > --- > > Key: SPARK-17450 > URL: https://issues.apache.org/jira/browse/SPARK-17450 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2 >Reporter: cen yuhai > > spark sql will be OOM when using row_number() over too much sorted records... > There will be only 1 task to handle all records > This sql group by passenger_id, we have 100 million passengers. > {code} > SELECT > passenger_id, > total_order, > (CASE WHEN row_number() over (ORDER BY total_order DESC) BETWEEN 0 AND > 670800 THEN 'V3' END) AS order_rank > FROM > ( > SELECT > passenger_id, > 1 as total_order > FROM table > GROUP BY passenger_id > ) dd1 > {code} > {code} > java.lang.OutOfMemoryError: Java heap space > at > org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:536) > at > org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:93) > at > org.apache.spark.sql.execution.Window$$anonfun$8$$anon$1.fetchNextPartition(Window.scala:278) > at > org.apache.spark.sql.execution.Window$$anonfun$8$$anon$1.next(Window.scala:304) > at > org.apache.spark.sql.execution.Window$$anonfun$8$$anon$1.next(Window.scala:246) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:512) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.(TungstenAggregationIterator.scala:686) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:95) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:74) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:104) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:247) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > physical plan: > {code} > == Physical Plan == > Project [passenger_id#7L,total_order#0,CASE WHEN ((_we0#20 >= 0) && (_we1#21 > <= 670800)) THEN V3 AS order_rank#1] > +- Window [passenger_id#7L,total_order#0], > [HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRowNumber() > windowspecdefinition(total_order#0 DESC,ROWS BETWEEN UNBOUNDED PRECEDING AND > UNBOUNDED FOLLOWING) AS > _we0#20,HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRowNumber() > windowspecdefinition(total_order#0 DESC,ROWS BETWEEN UNBOUNDED PRECEDING AND > UNBOUNDED FOLLOWING) AS _we1#21], [total_order#0 DESC] >+- Sort [total_order#0 DESC], false, 0 > +- TungstenExchange SinglePartition, None > +- Project [passenger_id#7L,total_order#0] > +- TungstenAggregate(key=[passenger_id#7L], functions=[], > output=[passenger_id#7L,total_order#0]) >+- TungstenExchange hashpartitioning(passenger_id#7L,1000), > None > +- TungstenAggregate(key=[passenger_id#7L], functions=[], > output=[passenger_id#7L]) > +- Project [passenger_id#7L] > +- Filter product#9 IN (kuai,gulf) >+- HiveTableScan [passenger_id#7L,product#9], > MetastoreRelation pbs_dw, dwv_order_whole_day, None, > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) -
[jira] [Resolved] (SPARK-18462) SparkListenerDriverAccumUpdates event does not deserialize properly in history server
[ https://issues.apache.org/jira/browse/SPARK-18462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-18462. - Resolution: Fixed Fix Version/s: 2.1.0 2.0.3 > SparkListenerDriverAccumUpdates event does not deserialize properly in > history server > - > > Key: SPARK-18462 > URL: https://issues.apache.org/jira/browse/SPARK-18462 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 2.0.3, 2.1.0 > > > The following test fails with a ClassCastException due to oddities in how > Jackson object mapping works, breaking the SQL tab in the history server: > {code} > +++ > b/sql/core/src/test/scala/org/apache/spark/sql/execution/ui/SQLListenerSuite.scala > @@ -19,6 +19,7 @@ package org.apache.spark.sql.execution.ui > import java.util.Properties > +import org.json4s.jackson.JsonMethods._ > import org.mockito.Mockito.mock > import org.apache.spark._ > @@ -35,7 +36,7 @@ import org.apache.spark.sql.execution.{LeafExecNode, > QueryExecution, SparkPlanIn > import org.apache.spark.sql.execution.metric.{SQLMetric, SQLMetrics} > import org.apache.spark.sql.test.SharedSQLContext > import org.apache.spark.ui.SparkUI > -import org.apache.spark.util.{AccumulatorMetadata, LongAccumulator} > +import org.apache.spark.util.{AccumulatorMetadata, JsonProtocol, > LongAccumulator} > class SQLListenerSuite extends SparkFunSuite with SharedSQLContext { > @@ -416,6 +417,20 @@ class SQLListenerSuite extends SparkFunSuite with > SharedSQLContext { > assert(driverUpdates(physicalPlan.longMetric("dummy").id) == > expectedAccumValue) >} > + test("roundtripping SparkListenerDriverAccumUpdates through JsonProtocol") > { > +val event = SparkListenerDriverAccumUpdates(1L, Seq((2L, 3L))) > +val actualJsonString = > compact(render(JsonProtocol.sparkEventToJson(event))) > +val newEvent = JsonProtocol.sparkEventFromJson(parse(actualJsonString)) > +newEvent match { > + case SparkListenerDriverAccumUpdates(executionId, accums) => > +assert(executionId == 1L) > +accums.foreach { case (a, b) => > + assert(a == 2L) > + assert(b == 3L) > +} > +} > + } > + > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)
[ https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15675517#comment-15675517 ] Reynold Xin commented on SPARK-18352: - Actually just talked to [~marmbrus] and now I understand more how JSON reader works. I'd say we always turn the top level array into multiple records, and then have only one option: wholeFile. This same option can be used in json and text. > Parse normal, multi-line JSON files (not just JSON Lines) > - > > Key: SPARK-18352 > URL: https://issues.apache.org/jira/browse/SPARK-18352 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > Spark currently can only parse JSON files that are JSON lines, i.e. each > record has an entire line and records are separated by new line. In reality, > a lot of users want to use Spark to parse actual JSON files, and are > surprised to learn that it doesn't do that. > We can introduce a new mode (wholeJsonFile?) in which we don't split the > files, and rather stream through them to parse the JSON files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16803) SaveAsTable does not work when source DataFrame is built on a Hive Table
[ https://issues.apache.org/jira/browse/SPARK-16803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15675507#comment-15675507 ] Apache Spark commented on SPARK-16803: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/15926 > SaveAsTable does not work when source DataFrame is built on a Hive Table > > > Key: SPARK-16803 > URL: https://issues.apache.org/jira/browse/SPARK-16803 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > {noformat} > scala> sql("create table sample.sample stored as SEQUENCEFILE as select 1 as > key, 'abc' as value") > res2: org.apache.spark.sql.DataFrame = [] > scala> val df = sql("select key, value as value from sample.sample") > df: org.apache.spark.sql.DataFrame = [key: int, value: string] > scala> df.write.mode("append").saveAsTable("sample.sample") > scala> sql("select * from sample.sample").show() > +---+-+ > |key|value| > +---+-+ > | 1| abc| > | 1| abc| > +---+-+ > {noformat} > In Spark 1.6, it works, but Spark 2.0 does not work. The error message from > Spark 2.0 is > {noformat} > scala> df.write.mode("append").saveAsTable("sample.sample") > org.apache.spark.sql.AnalysisException: Saving data in MetastoreRelation > sample, sample > is not supported.; > {noformat} > So far, we do not plan to support it in Spark 2.0. Spark 1.6 works because it > internally uses {{insertInto}}. But, if we change it back it will break the > semantic of {{saveAsTable}} (this method uses by-name resolution instead of > using by-position resolution used by {{insertInto}}). > Instead, users should use {{insertInto}} API. We should correct the error > messages. Users can understand how to bypass it before we support it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18487) Add task completion listener to HashAggregate to avoid memory leak
[ https://issues.apache.org/jira/browse/SPARK-18487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh updated SPARK-18487: Summary: Add task completion listener to HashAggregate to avoid memory leak (was: Consume all elements for Dataset.show/take to avoid memory leak) > Add task completion listener to HashAggregate to avoid memory leak > -- > > Key: SPARK-18487 > URL: https://issues.apache.org/jira/browse/SPARK-18487 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Liang-Chi Hsieh > > The methods such as Dataset.show and take use Limit (CollectLimitExec) which > leverages SparkPlan.executeTake to efficiently collect required number of > elements back to the driver. > However, under wholestage codege, we usually release resources after all > elements are consumed (e.g., HashAggregate). In this case, we will not > release the resources and cause memory leak with Dataset.show, for example. > We can add task completion listener to HashAggregate to avoid the memory leak. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18487) Consume all elements for Dataset.show/take to avoid memory leak
[ https://issues.apache.org/jira/browse/SPARK-18487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh updated SPARK-18487: Description: The methods such as Dataset.show and take use Limit (CollectLimitExec) which leverages SparkPlan.executeTake to efficiently collect required number of elements back to the driver. However, under wholestage codege, we usually release resources after all elements are consumed (e.g., HashAggregate). In this case, we will not release the resources and cause memory leak with Dataset.show, for example. We can add task completion listener to HashAggregate to avoid the memory leak. was: The methods such as Dataset.show and take use Limit (CollectLimitExec) which leverages SparkPlan.executeTake to efficiently collect required number of elements back to the driver. However, under wholestage codege, we usually release resources after all elements are consumed (e.g., HashAggregate). In this case, we will not release the resources and cause memory leak with Dataset.show, for example. We should consume all elements in the iterator to avoid memory leak. > Consume all elements for Dataset.show/take to avoid memory leak > --- > > Key: SPARK-18487 > URL: https://issues.apache.org/jira/browse/SPARK-18487 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Liang-Chi Hsieh > > The methods such as Dataset.show and take use Limit (CollectLimitExec) which > leverages SparkPlan.executeTake to efficiently collect required number of > elements back to the driver. > However, under wholestage codege, we usually release resources after all > elements are consumed (e.g., HashAggregate). In this case, we will not > release the resources and cause memory leak with Dataset.show, for example. > We can add task completion listener to HashAggregate to avoid the memory leak. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18436) isin causing SQL syntax error with JDBC
[ https://issues.apache.org/jira/browse/SPARK-18436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18436: Assignee: Apache Spark > isin causing SQL syntax error with JDBC > --- > > Key: SPARK-18436 > URL: https://issues.apache.org/jira/browse/SPARK-18436 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 > Environment: Linux, SQL Server 2012 >Reporter: Dan >Assignee: Apache Spark > Labels: jdbc, sql > > When using a JDBC data source, the "isin" function generates invalid SQL > syntax when called with an empty array, which causes the JDBC driver to throw > an exception. > If the array is not empty, it works fine. > In the below example you can assume that SOURCE_CONNECTION, SQL_DRIVER and > TABLE are all correctly defined. > {noformat} > scala> val filter = Array[String]() > filter: Array[String] = Array() > scala> val sortDF = spark.read.format("jdbc").options(Map("url" -> > SOURCE_CONNECTION, "driver" -> SQL_DRIVER, "dbtable" -> > TABLE)).load().filter($"cl_ult".isin(filter:_*)) > sortDF: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = > [ibi_bulk_id: bigint, ibi_row_id: int ... 174 more fields] > scala> sortDF.show() > 16/11/14 15:35:46 ERROR Executor: Exception in task 0.0 in stage 6.0 (TID 205) > com.microsoft.sqlserver.jdbc.SQLServerException: Incorrect syntax near ')'. > at > com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:216) > at > com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1515) > at > com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.doExecutePreparedStatement(SQLServerPreparedStatement.java:404) > at > com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement$PrepStmtExecCmd.doExecute(SQLServerPreparedStatement.java:350) > at com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:5696) > at > com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:1715) > at > com.microsoft.sqlserver.jdbc.SQLServerStatement.executeCommand(SQLServerStatement.java:180) > at > com.microsoft.sqlserver.jdbc.SQLServerStatement.executeStatement(SQLServerStatement.java:155) > at > com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.executeQuery(SQLServerPreparedStatement.java:285) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.(JDBCRDD.scala:408) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:379) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18436) isin causing SQL syntax error with JDBC
[ https://issues.apache.org/jira/browse/SPARK-18436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18436: Assignee: (was: Apache Spark) > isin causing SQL syntax error with JDBC > --- > > Key: SPARK-18436 > URL: https://issues.apache.org/jira/browse/SPARK-18436 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 > Environment: Linux, SQL Server 2012 >Reporter: Dan > Labels: jdbc, sql > > When using a JDBC data source, the "isin" function generates invalid SQL > syntax when called with an empty array, which causes the JDBC driver to throw > an exception. > If the array is not empty, it works fine. > In the below example you can assume that SOURCE_CONNECTION, SQL_DRIVER and > TABLE are all correctly defined. > {noformat} > scala> val filter = Array[String]() > filter: Array[String] = Array() > scala> val sortDF = spark.read.format("jdbc").options(Map("url" -> > SOURCE_CONNECTION, "driver" -> SQL_DRIVER, "dbtable" -> > TABLE)).load().filter($"cl_ult".isin(filter:_*)) > sortDF: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = > [ibi_bulk_id: bigint, ibi_row_id: int ... 174 more fields] > scala> sortDF.show() > 16/11/14 15:35:46 ERROR Executor: Exception in task 0.0 in stage 6.0 (TID 205) > com.microsoft.sqlserver.jdbc.SQLServerException: Incorrect syntax near ')'. > at > com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:216) > at > com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1515) > at > com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.doExecutePreparedStatement(SQLServerPreparedStatement.java:404) > at > com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement$PrepStmtExecCmd.doExecute(SQLServerPreparedStatement.java:350) > at com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:5696) > at > com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:1715) > at > com.microsoft.sqlserver.jdbc.SQLServerStatement.executeCommand(SQLServerStatement.java:180) > at > com.microsoft.sqlserver.jdbc.SQLServerStatement.executeStatement(SQLServerStatement.java:155) > at > com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.executeQuery(SQLServerPreparedStatement.java:285) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.(JDBCRDD.scala:408) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:379) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18436) isin causing SQL syntax error with JDBC
[ https://issues.apache.org/jira/browse/SPARK-18436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15675473#comment-15675473 ] Apache Spark commented on SPARK-18436: -- User 'windpiger' has created a pull request for this issue: https://github.com/apache/spark/pull/15925 > isin causing SQL syntax error with JDBC > --- > > Key: SPARK-18436 > URL: https://issues.apache.org/jira/browse/SPARK-18436 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 > Environment: Linux, SQL Server 2012 >Reporter: Dan > Labels: jdbc, sql > > When using a JDBC data source, the "isin" function generates invalid SQL > syntax when called with an empty array, which causes the JDBC driver to throw > an exception. > If the array is not empty, it works fine. > In the below example you can assume that SOURCE_CONNECTION, SQL_DRIVER and > TABLE are all correctly defined. > {noformat} > scala> val filter = Array[String]() > filter: Array[String] = Array() > scala> val sortDF = spark.read.format("jdbc").options(Map("url" -> > SOURCE_CONNECTION, "driver" -> SQL_DRIVER, "dbtable" -> > TABLE)).load().filter($"cl_ult".isin(filter:_*)) > sortDF: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = > [ibi_bulk_id: bigint, ibi_row_id: int ... 174 more fields] > scala> sortDF.show() > 16/11/14 15:35:46 ERROR Executor: Exception in task 0.0 in stage 6.0 (TID 205) > com.microsoft.sqlserver.jdbc.SQLServerException: Incorrect syntax near ')'. > at > com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:216) > at > com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1515) > at > com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.doExecutePreparedStatement(SQLServerPreparedStatement.java:404) > at > com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement$PrepStmtExecCmd.doExecute(SQLServerPreparedStatement.java:350) > at com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:5696) > at > com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:1715) > at > com.microsoft.sqlserver.jdbc.SQLServerStatement.executeCommand(SQLServerStatement.java:180) > at > com.microsoft.sqlserver.jdbc.SQLServerStatement.executeStatement(SQLServerStatement.java:155) > at > com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.executeQuery(SQLServerPreparedStatement.java:285) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.(JDBCRDD.scala:408) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:379) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)
[ https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15675466#comment-15675466 ] Hyukjin Kwon commented on SPARK-18352: -- Ah, you meant producing each row while parsing the whole text in iteration. I see. > Parse normal, multi-line JSON files (not just JSON Lines) > - > > Key: SPARK-18352 > URL: https://issues.apache.org/jira/browse/SPARK-18352 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > Spark currently can only parse JSON files that are JSON lines, i.e. each > record has an entire line and records are separated by new line. In reality, > a lot of users want to use Spark to parse actual JSON files, and are > surprised to learn that it doesn't do that. > We can introduce a new mode (wholeJsonFile?) in which we don't split the > files, and rather stream through them to parse the JSON files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18356) Issue + Resolution: Kmeans Spark Performances (ML package)
[ https://issues.apache.org/jira/browse/SPARK-18356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15675460#comment-15675460 ] yuhao yang commented on SPARK-18356: I assume the performance improvement depends on the computation costs of uncached RDD lineage. Do you plan to send a PR for the improvement? > Issue + Resolution: Kmeans Spark Performances (ML package) > -- > > Key: SPARK-18356 > URL: https://issues.apache.org/jira/browse/SPARK-18356 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.0.0, 2.0.1 >Reporter: zakaria hili >Priority: Minor > Labels: easyfix > > Hello, > I'm newbie in spark, but I think that I found a small problem that can affect > spark Kmeans performances. > Before starting to explain the problem, I want to explain the warning that I > faced. > I tried to use Spark Kmeans with Dataframes to cluster my data > df_Part = assembler.transform(df_Part) > df_Part.cache() > while (k<=max_cluster) and (wssse > seuilStop): > kmeans = KMeans().setK(k) > model = kmeans.fit(df_Part) > wssse = model.computeCost(df_Part) > k=k+1 > but when I run the code I receive the warning : > WARN KMeans: The input data is not directly cached, which may hurt > performance if its parent RDDs are also uncached. > I searched in spark source code to find the source of this problem, then I > realized there is two classes responsible for this warning: > (mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala ) > (mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala ) > > When my dataframe is cached, the fit method transform my dataframe into an > internally rdd which is not cached. > Dataframe -> rdd -> run Training Kmeans Algo(rdd) > -> The first class (ml package) responsible for converting the dataframe into > rdd then call Kmeans Algorithm > ->The second class (mllib package) implements Kmeans Algorithm, and here > spark verify if the rdd is cached, if not a warning will be generated. > So, the solution of this problem is to cache the rdd before running Kmeans > Algorithm. > https://github.com/ZakariaHili/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala > All what we need is to add two lines: > Cache rdd just after dataframe transformation, then uncached it after > training algorithm. > I hope that I was clear. > If you think that I was wrong, please let me know. > Sincerely, > Zakaria HILI -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)
[ https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15675446#comment-15675446 ] Reynold Xin commented on SPARK-18352: - No that's not sufficient. It doesn't do streaming. > Parse normal, multi-line JSON files (not just JSON Lines) > - > > Key: SPARK-18352 > URL: https://issues.apache.org/jira/browse/SPARK-18352 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > Spark currently can only parse JSON files that are JSON lines, i.e. each > record has an entire line and records are separated by new line. In reality, > a lot of users want to use Spark to parse actual JSON files, and are > surprised to learn that it doesn't do that. > We can introduce a new mode (wholeJsonFile?) in which we don't split the > files, and rather stream through them to parse the JSON files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)
[ https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15675441#comment-15675441 ] Hyukjin Kwon edited comment on SPARK-18352 at 11/18/16 1:53 AM: Hi [~rxin], I think it seems this can be simply done after https://github.com/apache/spark/pull/14151 and https://github.com/apache/spark/pull/15813 are merged. I guess we could just add another option in `JSONOptions` which sets `wholetext` internally (and of course resembling https://github.com/apache/spark/pull/14151). Would this be what you think in your mind already? If so, I can work on this if anyone is not supposed to do this. (I am fine if anyone is assigned to this internally). was (Author: hyukjin.kwon): Hi [~rxin], I think it seems this can be simply done after https://github.com/apache/spark/pull/14151 and https://github.com/apache/spark/pull/15813 are merged. I guess we could just add another option in `JSONOptions` which sets `wholetext` internally. Would this be what you think in your mind already? If so, I can work on this if anyone is not supposed to do this. (I am fine if anyone is assigned to this internally). > Parse normal, multi-line JSON files (not just JSON Lines) > - > > Key: SPARK-18352 > URL: https://issues.apache.org/jira/browse/SPARK-18352 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > Spark currently can only parse JSON files that are JSON lines, i.e. each > record has an entire line and records are separated by new line. In reality, > a lot of users want to use Spark to parse actual JSON files, and are > surprised to learn that it doesn't do that. > We can introduce a new mode (wholeJsonFile?) in which we don't split the > files, and rather stream through them to parse the JSON files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)
[ https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15675441#comment-15675441 ] Hyukjin Kwon commented on SPARK-18352: -- Hi [~rxin], I think it seems this can be simply done after https://github.com/apache/spark/pull/14151 and https://github.com/apache/spark/pull/15813 are merged. I guess we could just add another option in `JSONOptions` which sets `wholetext` internally. Would this be what you think in your mind already? If so, I can work on this if anyone is not supposed to do this. (I am fine if anyone is assigned to this internally). > Parse normal, multi-line JSON files (not just JSON Lines) > - > > Key: SPARK-18352 > URL: https://issues.apache.org/jira/browse/SPARK-18352 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > Spark currently can only parse JSON files that are JSON lines, i.e. each > record has an entire line and records are separated by new line. In reality, > a lot of users want to use Spark to parse actual JSON files, and are > surprised to learn that it doesn't do that. > We can introduce a new mode (wholeJsonFile?) in which we don't split the > files, and rather stream through them to parse the JSON files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)
[ https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15675437#comment-15675437 ] Reynold Xin commented on SPARK-18352: - I guess maybe it should be a user-configurable option? Otherwise Spark on its own don't have enough information to disambiguate. > Parse normal, multi-line JSON files (not just JSON Lines) > - > > Key: SPARK-18352 > URL: https://issues.apache.org/jira/browse/SPARK-18352 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > Spark currently can only parse JSON files that are JSON lines, i.e. each > record has an entire line and records are separated by new line. In reality, > a lot of users want to use Spark to parse actual JSON files, and are > surprised to learn that it doesn't do that. > We can introduce a new mode (wholeJsonFile?) in which we don't split the > files, and rather stream through them to parse the JSON files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13767) py4j.protocol.Py4JNetworkError: An error occurred while trying to connect to the Java server
[ https://issues.apache.org/jira/browse/SPARK-13767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15675434#comment-15675434 ] Narayanan Nachiappan commented on SPARK-13767: -- [~rahul.bhati...@gmail.com] Were you able to figure out the root cause for that issue ? > py4j.protocol.Py4JNetworkError: An error occurred while trying to connect to > the Java server > > > Key: SPARK-13767 > URL: https://issues.apache.org/jira/browse/SPARK-13767 > Project: Spark > Issue Type: Bug > Components: PySpark >Reporter: Poonam Agrawal > > I am trying to create spark context object with the following commands on > pyspark: > from pyspark import SparkContext, SparkConf > conf = > SparkConf().setAppName('App_name').setMaster("spark://local-or-remote-ip:7077").set('spark.cassandra.connection.host', > 'cassandra-machine-ip').set('spark.storage.memoryFraction', > '0.2').set('spark.rdd.compress', 'true').set('spark.streaming.blockInterval', > 500).set('spark.serializer', > 'org.apache.spark.serializer.KryoSerializer').set('spark.scheduler.mode', > 'FAIR').set('spark.mesos.coarse', 'true') > sc = SparkContext(conf=conf) > but I am getting the following error: > Traceback (most recent call last): > File "", line 1, in > File "/usr/local/lib/spark-1.4.1/python/pyspark/conf.py", line 106, in > __init__ > self._jconf = _jvm.SparkConf(loadDefaults) > File > "/usr/local/lib/spark-1.4.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", > line 766, in __getattr__ > File > "/usr/local/lib/spark-1.4.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", > line 362, in send_command > File > "/usr/local/lib/spark-1.4.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", > line 318, in _get_connection > File > "/usr/local/lib/spark-1.4.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", > line 325, in _create_connection > File > "/usr/local/lib/spark-1.4.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", > line 432, in start > py4j.protocol.Py4JNetworkError: An error occurred while trying to connect to > the Java server > I am getting the same error executing the command : conf = > SparkConf().setAppName("App_name").setMaster("spark://127.0.0.1:7077") -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)
[ https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15675421#comment-15675421 ] Nathan Howell commented on SPARK-18352: --- Do you have any ideas how to support this? {{DataFrameReader.schema}} currently takes a {{StructType}} and the existing row level json reader flattens arrays out to support this restriction. > Parse normal, multi-line JSON files (not just JSON Lines) > - > > Key: SPARK-18352 > URL: https://issues.apache.org/jira/browse/SPARK-18352 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > Spark currently can only parse JSON files that are JSON lines, i.e. each > record has an entire line and records are separated by new line. In reality, > a lot of users want to use Spark to parse actual JSON files, and are > surprised to learn that it doesn't do that. > We can introduce a new mode (wholeJsonFile?) in which we don't split the > files, and rather stream through them to parse the JSON files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18498) Clean up HDFSMetadataLog API for better testing
[ https://issues.apache.org/jira/browse/SPARK-18498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18498: Assignee: (was: Apache Spark) > Clean up HDFSMetadataLog API for better testing > --- > > Key: SPARK-18498 > URL: https://issues.apache.org/jira/browse/SPARK-18498 > Project: Spark > Issue Type: Improvement > Components: SQL, Structured Streaming >Affects Versions: 2.0.1 >Reporter: Tyson Condie >Priority: Minor > Labels: test > Original Estimate: 168h > Remaining Estimate: 168h > > HDFSMetadataLog current conflates metadata log serialization and file writes. > The goal is to separate these two steps to enable more thorough testing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18498) Clean up HDFSMetadataLog API for better testing
[ https://issues.apache.org/jira/browse/SPARK-18498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15675412#comment-15675412 ] Apache Spark commented on SPARK-18498: -- User 'tcondie' has created a pull request for this issue: https://github.com/apache/spark/pull/15924 > Clean up HDFSMetadataLog API for better testing > --- > > Key: SPARK-18498 > URL: https://issues.apache.org/jira/browse/SPARK-18498 > Project: Spark > Issue Type: Improvement > Components: SQL, Structured Streaming >Affects Versions: 2.0.1 >Reporter: Tyson Condie >Priority: Minor > Labels: test > Original Estimate: 168h > Remaining Estimate: 168h > > HDFSMetadataLog current conflates metadata log serialization and file writes. > The goal is to separate these two steps to enable more thorough testing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18498) Clean up HDFSMetadataLog API for better testing
[ https://issues.apache.org/jira/browse/SPARK-18498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18498: Assignee: Apache Spark > Clean up HDFSMetadataLog API for better testing > --- > > Key: SPARK-18498 > URL: https://issues.apache.org/jira/browse/SPARK-18498 > Project: Spark > Issue Type: Improvement > Components: SQL, Structured Streaming >Affects Versions: 2.0.1 >Reporter: Tyson Condie >Assignee: Apache Spark >Priority: Minor > Labels: test > Original Estimate: 168h > Remaining Estimate: 168h > > HDFSMetadataLog current conflates metadata log serialization and file writes. > The goal is to separate these two steps to enable more thorough testing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)
[ https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15675405#comment-15675405 ] Reynold Xin commented on SPARK-18352: - Are these actually record delimiters? If the top level structure is an array, would we want to parse a single file as multiple records? > Parse normal, multi-line JSON files (not just JSON Lines) > - > > Key: SPARK-18352 > URL: https://issues.apache.org/jira/browse/SPARK-18352 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > Spark currently can only parse JSON files that are JSON lines, i.e. each > record has an entire line and records are separated by new line. In reality, > a lot of users want to use Spark to parse actual JSON files, and are > surprised to learn that it doesn't do that. > We can introduce a new mode (wholeJsonFile?) in which we don't split the > files, and rather stream through them to parse the JSON files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18360) default table path of tables in default database should depend on the location of default database
[ https://issues.apache.org/jira/browse/SPARK-18360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-18360. -- Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 15812 [https://github.com/apache/spark/pull/15812] > default table path of tables in default database should depend on the > location of default database > -- > > Key: SPARK-18360 > URL: https://issues.apache.org/jira/browse/SPARK-18360 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Labels: release_notes, releasenotes > Fix For: 2.1.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18360) default table path of tables in default database should depend on the location of default database
[ https://issues.apache.org/jira/browse/SPARK-18360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-18360: - Labels: release_notes releasenotes (was: ) > default table path of tables in default database should depend on the > location of default database > -- > > Key: SPARK-18360 > URL: https://issues.apache.org/jira/browse/SPARK-18360 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Labels: release_notes, releasenotes > Fix For: 2.1.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18483) spark on yarn always connect to yarn resourcemanager at 0.0.0.0:8032
[ https://issues.apache.org/jira/browse/SPARK-18483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15675388#comment-15675388 ] inred commented on SPARK-18483: --- it failed even when i set HADOOP_CONF_DIR=%HADOOP_HOME%\etc\hadoop, i submit it from windows development node to remote Linux yarn cluster > spark on yarn always connect to yarn resourcemanager at 0.0.0.0:8032 > -- > > Key: SPARK-18483 > URL: https://issues.apache.org/jira/browse/SPARK-18483 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.1 > Environment: java8 > SBT0.13 > scala2.11.8 > spark-2.0.1-bin-hadoop2.6 >Reporter: inred > > I have installed the yarn resource manager at 192.168.13.159:8032 and have > set YARN_CONF_DIR environment var and have the yarn-site.xml > configured as the following, but it always connects to 0.0.0.0:8032 instead > of > 192.168.13.159:8032 > set environment > E:\app>set yarn > YARN_CONF_DIR=D:\Documents\download\hadoop > E:\app>set had > HADOOP_HOME=D:\Documents\download\hadoop > E:\app>cat D:\Documents\download\hadoop\yarn-site.xml > > yarn.resourcemanager.address > 192.168.13.159:8032 > > "C:\Program Files\Java\jdk1.8.0_92\bin\java" -Didea.launcher.port=7532 > "-Didea.launcher.bin.path=C:\Program Files (x86)\JetBrains\IntelliJ IDEA > Community Edition 2016.2.5\bin" -Dfile.encoding=UTF-8 -classpath "C:\Program > Files\Java\jdk1.8.0_92\jre\lib\charsets.jar;C:\Program > Files\Java\jdk1.8.0_92\jre\lib\deploy.jar;C:\Program > Files\Java\jdk1.8.0_92\jre\lib\ext\access-bridge-64.jar;C:\Program > Files\Java\jdk1.8.0_92\jre\lib\ext\cldrdata.jar;C:\Program > Files\Java\jdk1.8.0_92\jre\lib\ext\dnsns.jar;C:\Program > Files\Java\jdk1.8.0_92\jre\lib\ext\jaccess.jar;C:\Program > Files\Java\jdk1.8.0_92\jre\lib\ext\jfxrt.jar;C:\Program > Files\Java\jdk1.8.0_92\jre\lib\ext\localedata.jar;C:\Program > Files\Java\jdk1.8.0_92\jre\lib\ext\nashorn.jar;C:\Program > Files\Java\jdk1.8.0_92\jre\lib\ext\sunec.jar;C:\Program > Files\Java\jdk1.8.0_92\jre\lib\ext\sunjce_provider.jar;C:\Program > Files\Java\jdk1.8.0_92\jre\lib\ext\sunmscapi.jar;C:\Program > Files\Java\jdk1.8.0_92\jre\lib\ext\sunpkcs11.jar;C:\Program > Files\Java\jdk1.8.0_92\jre\lib\ext\zipfs.jar;C:\Program > Files\Java\jdk1.8.0_92\jre\lib\javaws.jar;C:\Program > Files\Java\jdk1.8.0_92\jre\lib\jce.jar;C:\Program > Files\Java\jdk1.8.0_92\jre\lib\jfr.jar;C:\Program > Files\Java\jdk1.8.0_92\jre\lib\jfxswt.jar;C:\Program > Files\Java\jdk1.8.0_92\jre\lib\jsse.jar;C:\Program > Files\Java\jdk1.8.0_92\jre\lib\management-agent.jar;C:\Program > Files\Java\jdk1.8.0_92\jre\lib\plugin.jar;C:\Program > Files\Java\jdk1.8.0_92\jre\lib\resources.jar;C:\Program > Files\Java\jdk1.8.0_92\jre\lib\rt.jar;E:\app\adam-ano\target\scala-2.11\classes;C:\Users\feiwu\.ivy2\cache\org.scala-lang\scala-library\jars\scala-library-2.11.8.jar;D:\Documents\download\spark-2.0.1-bin-hadoop2.6\jars\xz-1.0.jar;D:\Documents\download\spark-2.0.1-bin-hadoop2.6\jars\jta-1.1.jar;D:\Documents\download\spark-2.0.1-bin-hadoop2.6\jars\jpam-1.1.jar;D:\Documents\download\spark-2.0.1-bin-hadoop2.6\jars\guice-3.0.jar;D:\Documents\download\spark-2.0.1-bin-hadoop2.6\jars\ivy-2.4.0.jar;D:\Documents\download\spark-2.0.1-bin-hadoop2.6\jars\lz4-1.3.0.jar;D:\Documents\download\spark-2.0.1-bin-hadoop2.6\jars\oro-2.0.8.jar;D:\Documents\download\spark-2.0.1-bin-hadoop2.6\jars\ST4-4.0.4.jar;D:\Documents\download\spark-2.0.1-bin-hadoop2.6\jars\avro-1.7.7.jar;D:\Documents\download\spark-2.0.1-bin-hadoop2.6\jars\core-1.1.2.jar;D:\Documents\download\spark-2.0.1-bin-hadoop2.6\jars\gson-2.2.4.jar;D:\Documents\download\spark-2.0.1-bin-hadoop2.6\jars\mail-1.4.7.jar;D:\Documents\download\spark-2.0.1-bin-hadoop2.6\jars\mx4j-3.0.2.jar;D:\Documents\download\spark-2.0.1-bin-hadoop2.6\jars\snappy-0.2.jar;D:\Documents\download\spark-2.0.1-bin-hadoop2.6\jars\antlr-2.7.7.jar;D:\Documents\download\spark-2.0.1-bin-hadoop2.6\jars\opencsv-2.3.jar;D:\Documents\download\spark-2.0.1-bin-hadoop2.6\jars\py4j-0.10.3.jar;D:\Documents\download\spark-2.0.1-bin-hadoop2.6\jars\xmlenc-0.52.jar;D:\Documents\download\spark-2.0.1-bin-hadoop2.6\jars\base64-2.3.8.jar;D:\Documents\download\spark-2.0.1-bin-hadoop2.6\jars\guava-14.0.1.jar;D:\Documents\download\spark-2.0.1-bin-hadoop2.6\jars\janino-2.7.8.jar;D:\Documents\download\spark-2.0.1-bin-hadoop2.6\jars\jets3t-0.9.3.jar;D:\Documents\download\spark-2.0.1-bin-hadoop2.6\jars\jetty-6.1.26.jar;D:\Documents\download\spark-2.0.1-bin-hadoop2.6\jars\jline-2.12.1.jar;D:\Documents\download\spark-2.0.1-bin-hadoop2.6\jars\jsr305-1.3.9.jar;D:\Documents\download\spark-2.0.1-bin-hadoop2.6\jars\log4j-1.2.17.jar;D:\Documents\download\spark-2.0.1-bin-hadoop2.6\jars\minlog-1.3.0.jar;D:\Documents\download\spark-2.0.1-bin-hadoop2.6\jars\pyrolite-4.9.j
[jira] [Commented] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)
[ https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15675386#comment-15675386 ] Nathan Howell commented on SPARK-18352: --- Any opinions on configuring this with an option instead of a creating a new data source? It looks fairly straightforward to support this as an option. E.g.: {code} // parse one json value per line // this would be the default behavior, for backwards compatibility spark.read.option("recordDelimiter", "line").json(???) // parse one json value per file spark.read.option("recordDelimiter", "file").json(???) {code} The refactoring work would be the same in either case, but it would require less plumbing for Python/Java/etc to enable this with an option. As an aside... it also is straightforward to extend this to support {{Text}} and {{UTF8String}} values directly, avoiding a string conversion of the entire column prior to parsing. > Parse normal, multi-line JSON files (not just JSON Lines) > - > > Key: SPARK-18352 > URL: https://issues.apache.org/jira/browse/SPARK-18352 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > Spark currently can only parse JSON files that are JSON lines, i.e. each > record has an entire line and records are separated by new line. In reality, > a lot of users want to use Spark to parse actual JSON files, and are > surprised to learn that it doesn't do that. > We can introduce a new mode (wholeJsonFile?) in which we don't split the > files, and rather stream through them to parse the JSON files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18498) Clean up HDFSMetadataLog API for better testing
Tyson Condie created SPARK-18498: Summary: Clean up HDFSMetadataLog API for better testing Key: SPARK-18498 URL: https://issues.apache.org/jira/browse/SPARK-18498 Project: Spark Issue Type: Improvement Components: SQL, Structured Streaming Affects Versions: 2.0.1 Reporter: Tyson Condie Priority: Minor HDFSMetadataLog current conflates metadata log serialization and file writes. The goal is to separate these two steps to enable more thorough testing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18497) ForeachSink fails with "assertion failed: No plan for EventTimeWatermark"
[ https://issues.apache.org/jira/browse/SPARK-18497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-18497: - Target Version/s: 2.1.0 > ForeachSink fails with "assertion failed: No plan for EventTimeWatermark" > - > > Key: SPARK-18497 > URL: https://issues.apache.org/jira/browse/SPARK-18497 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Aaron Davidson > > I have a pretty standard stream. I call ".writeStream.foreach(...).start()" > and get > {code} > java.lang.AssertionError: assertion failed: No plan for EventTimeWatermark > timestamp#39688: timestamp, interval 1 days > at scala.Predef$.assert(Predef.scala:170) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:77) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:74) > at > scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) > at > scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157) > at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:74) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:66) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:77) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:74) > at > scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) > at > scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157) > at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:74) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:66) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92) > at > org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:79) > at > org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:75) > at > org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:84) > at > org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:84) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$toString$3.apply(QueryExecution.scala:232) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$toString$3.apply(QueryExecution.scala:232) > at > org.apache.spark.sql.execution.QueryExecution.stringOrError(QueryExecution.scala:107) > at > org.apache.spark.sql.execution.QueryExecution.toString(QueryExecution.scala:232) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:54) > at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2751) > at org.apache.spark.sql.Dataset.foreachPartition(Dataset.scala:2290) > at > org.apache.spark.sql.execution.streaming.ForeachSink.addBatch(ForeachSink.scala:70) > at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch(StreamExecution.scala:449) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$1.apply$mcZ$sp(StreamExecution.scala:227) >
[jira] [Updated] (SPARK-18497) ForeachSink fails with "assertion failed: No plan for EventTimeWatermark"
[ https://issues.apache.org/jira/browse/SPARK-18497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-18497: - Priority: Critical (was: Major) > ForeachSink fails with "assertion failed: No plan for EventTimeWatermark" > - > > Key: SPARK-18497 > URL: https://issues.apache.org/jira/browse/SPARK-18497 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Aaron Davidson >Priority: Critical > > I have a pretty standard stream. I call ".writeStream.foreach(...).start()" > and get > {code} > java.lang.AssertionError: assertion failed: No plan for EventTimeWatermark > timestamp#39688: timestamp, interval 1 days > at scala.Predef$.assert(Predef.scala:170) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:77) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:74) > at > scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) > at > scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157) > at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:74) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:66) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:77) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:74) > at > scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) > at > scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157) > at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:74) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:66) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92) > at > org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:79) > at > org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:75) > at > org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:84) > at > org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:84) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$toString$3.apply(QueryExecution.scala:232) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$toString$3.apply(QueryExecution.scala:232) > at > org.apache.spark.sql.execution.QueryExecution.stringOrError(QueryExecution.scala:107) > at > org.apache.spark.sql.execution.QueryExecution.toString(QueryExecution.scala:232) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:54) > at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2751) > at org.apache.spark.sql.Dataset.foreachPartition(Dataset.scala:2290) > at > org.apache.spark.sql.execution.streaming.ForeachSink.addBatch(ForeachSink.scala:70) > at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch(StreamExecution.scala:449) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$1.
[jira] [Created] (SPARK-18497) ForeachSink fails with "assertion failed: No plan for EventTimeWatermark"
Aaron Davidson created SPARK-18497: -- Summary: ForeachSink fails with "assertion failed: No plan for EventTimeWatermark" Key: SPARK-18497 URL: https://issues.apache.org/jira/browse/SPARK-18497 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.1.0 Reporter: Aaron Davidson I have a pretty standard stream. I call ".writeStream.foreach(...).start()" and get {code} java.lang.AssertionError: assertion failed: No plan for EventTimeWatermark timestamp#39688: timestamp, interval 1 days at scala.Predef$.assert(Predef.scala:170) at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:77) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:74) at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157) at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:74) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:66) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:77) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:74) at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157) at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:74) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:66) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92) at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:79) at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:75) at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:84) at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:84) at org.apache.spark.sql.execution.QueryExecution$$anonfun$toString$3.apply(QueryExecution.scala:232) at org.apache.spark.sql.execution.QueryExecution$$anonfun$toString$3.apply(QueryExecution.scala:232) at org.apache.spark.sql.execution.QueryExecution.stringOrError(QueryExecution.scala:107) at org.apache.spark.sql.execution.QueryExecution.toString(QueryExecution.scala:232) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:54) at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2751) at org.apache.spark.sql.Dataset.foreachPartition(Dataset.scala:2290) at org.apache.spark.sql.execution.streaming.ForeachSink.addBatch(ForeachSink.scala:70) at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch(StreamExecution.scala:449) at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$1.apply$mcZ$sp(StreamExecution.scala:227) at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$1.apply(StreamExecution.scala:215) at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sq
[jira] [Updated] (SPARK-18454) Changes to fix Nearest Neighbor Search for LSH
[ https://issues.apache.org/jira/browse/SPARK-18454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yun Ni updated SPARK-18454: --- Description: We all agree to do the following improvement to Multi-Probe NN Search: (1) Use approxQuantile to get the {{hashDistance}} threshold instead of doing full sort on the whole dataset Currently we are still discussing the following: (1) What {{hashDistance}} (or Probing Sequence) we should use for {{MinHash}} (2) How we should change the current Nearest Neighbor implementation to make it align with the MultiProbe NN Search from the origin paper: http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf was: We all agree to do the following improvement to Multi-Probe NN Search: (1) Use approxQuantile to get the {{hashDistance}} threshold instead of doing full sort on the whole dataset Currently we are still discussing the following: (1) What {{hashDistance}} (or Probing Sequence) we should use for {{MinHash}} (2) How we should change the current MultiProbe implementation to make it align with the MultiProbe NN Search from the origin paper: http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf > Changes to fix Nearest Neighbor Search for LSH > -- > > Key: SPARK-18454 > URL: https://issues.apache.org/jira/browse/SPARK-18454 > Project: Spark > Issue Type: Improvement >Reporter: Yun Ni > > We all agree to do the following improvement to Multi-Probe NN Search: > (1) Use approxQuantile to get the {{hashDistance}} threshold instead of doing > full sort on the whole dataset > Currently we are still discussing the following: > (1) What {{hashDistance}} (or Probing Sequence) we should use for {{MinHash}} > (2) How we should change the current Nearest Neighbor implementation to make > it align with the MultiProbe NN Search from the origin paper: > http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18454) Changes to fix Nearest Neighbor Search for LSH
[ https://issues.apache.org/jira/browse/SPARK-18454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yun Ni updated SPARK-18454: --- Description: We all agree to do the following improvement to Multi-Probe NN Search: (1) Use approxQuantile to get the {{hashDistance}} threshold instead of doing full sort on the whole dataset Currently we are still discussing the following: (1) What {{hashDistance}} (or Probing Sequence) we should use for {{MinHash}} (2) What are the issues and how we should change the current Nearest Neighbor implementation was: We all agree to do the following improvement to Multi-Probe NN Search: (1) Use approxQuantile to get the {{hashDistance}} threshold instead of doing full sort on the whole dataset Currently we are still discussing the following: (1) What {{hashDistance}} (or Probing Sequence) we should use for {{MinHash}} (2) How we should change the current Nearest Neighbor implementation to make it align with the MultiProbe NN Search from the origin paper: http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf > Changes to fix Nearest Neighbor Search for LSH > -- > > Key: SPARK-18454 > URL: https://issues.apache.org/jira/browse/SPARK-18454 > Project: Spark > Issue Type: Improvement >Reporter: Yun Ni > > We all agree to do the following improvement to Multi-Probe NN Search: > (1) Use approxQuantile to get the {{hashDistance}} threshold instead of doing > full sort on the whole dataset > Currently we are still discussing the following: > (1) What {{hashDistance}} (or Probing Sequence) we should use for {{MinHash}} > (2) What are the issues and how we should change the current Nearest Neighbor > implementation -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18454) Changes to fix Nearest Neighbor Search for LSH
[ https://issues.apache.org/jira/browse/SPARK-18454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yun Ni updated SPARK-18454: --- Summary: Changes to fix Nearest Neighbor Search for LSH (was: Changes to fix Multi-Probe Nearest Neighbor Search for LSH) > Changes to fix Nearest Neighbor Search for LSH > -- > > Key: SPARK-18454 > URL: https://issues.apache.org/jira/browse/SPARK-18454 > Project: Spark > Issue Type: Improvement >Reporter: Yun Ni > > We all agree to do the following improvement to Multi-Probe NN Search: > (1) Use approxQuantile to get the {{hashDistance}} threshold instead of doing > full sort on the whole dataset > Currently we are still discussing the following: > (1) What {{hashDistance}} (or Probing Sequence) we should use for {{MinHash}} > (2) How we should change the current MultiProbe implementation to make it > align with the MultiProbe NN Search from the origin paper: > http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18321) ML 2.1 QA: API: Java compatibility, docs
[ https://issues.apache.org/jira/browse/SPARK-18321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15675345#comment-15675345 ] Joseph K. Bradley commented on SPARK-18321: --- I just noticed there are also problems with having 2 copies of many methods in the Java doc, one static and one not. This too is a long-standing problem. > ML 2.1 QA: API: Java compatibility, docs > > > Key: SPARK-18321 > URL: https://issues.apache.org/jira/browse/SPARK-18321 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Java API, ML, MLlib >Reporter: Joseph K. Bradley >Priority: Blocker > > Check Java compatibility for this release: > * APIs in {{spark.ml}} > * New APIs in {{spark.mllib}} (There should be few, if any.) > Checking compatibility means: > * Checking for differences in how Scala and Java handle types. Some items to > look out for are: > ** Check for generic "Object" types where Java cannot understand complex > Scala types. > *** *Note*: The Java docs do not always match the bytecode. If you find a > problem, please verify it using {{javap}}. > ** Check Scala objects (especially with nesting!) carefully. These may not > be understood in Java, or they may be accessible only via the weirdly named > Java types (with "$" or "#") which are generated by the Scala compiler. > ** Check for uses of Scala and Java enumerations, which can show up oddly in > the other language's doc. (In {{spark.ml}}, we have largely tried to avoid > using enumerations, and have instead favored plain strings.) > * Check for differences in generated Scala vs Java docs. E.g., one past > issue was that Javadocs did not respect Scala's package private modifier. > If you find issues, please comment here, or for larger items, create separate > JIRAs and link here as "requires". > * Remember that we should not break APIs from previous releases. If you find > a problem, check if it was introduced in this Spark release (in which case we > can fix it) or in a previous one (in which case we can create a java-friendly > version of the API). > * If needed for complex issues, create small Java unit tests which execute > each method. (Algorithmic correctness can be checked in Scala.) > Recommendations for how to complete this task: > * There are not great tools. In the past, this task has been done by: > ** Generating API docs > ** Building JAR and outputting the Java class signatures for MLlib > ** Manually inspecting and searching the docs and class signatures for issues > * If you do have ideas for better tooling, please say so we can make this > task easier in the future! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4105) FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle
[ https://issues.apache.org/jira/browse/SPARK-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15675306#comment-15675306 ] Apache Spark commented on SPARK-4105: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/15923 > FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based > shuffle > - > > Key: SPARK-4105 > URL: https://issues.apache.org/jira/browse/SPARK-4105 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 1.2.0, 1.2.1, 1.3.0, 1.4.1, 1.5.1, 1.6.1, 2.0.0 >Reporter: Josh Rosen >Assignee: Davies Liu >Priority: Critical > Attachments: JavaObjectToSerialize.java, > SparkFailedToUncompressGenerator.scala > > > We have seen non-deterministic {{FAILED_TO_UNCOMPRESS(5)}} errors during > shuffle read. Here's a sample stacktrace from an executor: > {code} > 14/10/23 18:34:11 ERROR Executor: Exception in task 1747.3 in stage 11.0 (TID > 33053) > java.io.IOException: FAILED_TO_UNCOMPRESS(5) > at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78) > at org.xerial.snappy.SnappyNative.rawUncompress(Native Method) > at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:391) > at org.xerial.snappy.Snappy.uncompress(Snappy.java:427) > at > org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127) > at > org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88) > at org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:58) > at > org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128) > at > org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1090) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:116) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:115) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:243) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at > org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:129) > at > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159) > at > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) > at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at > org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at > org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
[jira] [Commented] (SPARK-18321) ML 2.1 QA: API: Java compatibility, docs
[ https://issues.apache.org/jira/browse/SPARK-18321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15675287#comment-15675287 ] Joseph K. Bradley commented on SPARK-18321: --- I'm guessing it's because it's a private class within a file with a different name containing other classes. I think it's OK to leave and hopefully fix in the unidoc gen later on. > ML 2.1 QA: API: Java compatibility, docs > > > Key: SPARK-18321 > URL: https://issues.apache.org/jira/browse/SPARK-18321 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Java API, ML, MLlib >Reporter: Joseph K. Bradley >Priority: Blocker > > Check Java compatibility for this release: > * APIs in {{spark.ml}} > * New APIs in {{spark.mllib}} (There should be few, if any.) > Checking compatibility means: > * Checking for differences in how Scala and Java handle types. Some items to > look out for are: > ** Check for generic "Object" types where Java cannot understand complex > Scala types. > *** *Note*: The Java docs do not always match the bytecode. If you find a > problem, please verify it using {{javap}}. > ** Check Scala objects (especially with nesting!) carefully. These may not > be understood in Java, or they may be accessible only via the weirdly named > Java types (with "$" or "#") which are generated by the Scala compiler. > ** Check for uses of Scala and Java enumerations, which can show up oddly in > the other language's doc. (In {{spark.ml}}, we have largely tried to avoid > using enumerations, and have instead favored plain strings.) > * Check for differences in generated Scala vs Java docs. E.g., one past > issue was that Javadocs did not respect Scala's package private modifier. > If you find issues, please comment here, or for larger items, create separate > JIRAs and link here as "requires". > * Remember that we should not break APIs from previous releases. If you find > a problem, check if it was introduced in this Spark release (in which case we > can fix it) or in a previous one (in which case we can create a java-friendly > version of the API). > * If needed for complex issues, create small Java unit tests which execute > each method. (Algorithmic correctness can be checked in Scala.) > Recommendations for how to complete this task: > * There are not great tools. In the past, this task has been done by: > ** Generating API docs > ** Building JAR and outputting the Java class signatures for MLlib > ** Manually inspecting and searching the docs and class signatures for issues > * If you do have ideas for better tooling, please say so we can make this > task easier in the future! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-4105) FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle
[ https://issues.apache.org/jira/browse/SPARK-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reassigned SPARK-4105: - Assignee: Davies Liu (was: Josh Rosen) > FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based > shuffle > - > > Key: SPARK-4105 > URL: https://issues.apache.org/jira/browse/SPARK-4105 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 1.2.0, 1.2.1, 1.3.0, 1.4.1, 1.5.1, 1.6.1, 2.0.0 >Reporter: Josh Rosen >Assignee: Davies Liu >Priority: Critical > Attachments: JavaObjectToSerialize.java, > SparkFailedToUncompressGenerator.scala > > > We have seen non-deterministic {{FAILED_TO_UNCOMPRESS(5)}} errors during > shuffle read. Here's a sample stacktrace from an executor: > {code} > 14/10/23 18:34:11 ERROR Executor: Exception in task 1747.3 in stage 11.0 (TID > 33053) > java.io.IOException: FAILED_TO_UNCOMPRESS(5) > at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78) > at org.xerial.snappy.SnappyNative.rawUncompress(Native Method) > at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:391) > at org.xerial.snappy.Snappy.uncompress(Snappy.java:427) > at > org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127) > at > org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88) > at org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:58) > at > org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128) > at > org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1090) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:116) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:115) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:243) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at > org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:129) > at > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159) > at > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) > at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at > org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at > org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > Here's another occurrence of a similar error: > {co
[jira] [Commented] (SPARK-18462) SparkListenerDriverAccumUpdates event does not deserialize properly in history server
[ https://issues.apache.org/jira/browse/SPARK-18462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15675234#comment-15675234 ] Apache Spark commented on SPARK-18462: -- User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/15922 > SparkListenerDriverAccumUpdates event does not deserialize properly in > history server > - > > Key: SPARK-18462 > URL: https://issues.apache.org/jira/browse/SPARK-18462 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Josh Rosen >Assignee: Josh Rosen > > The following test fails with a ClassCastException due to oddities in how > Jackson object mapping works, breaking the SQL tab in the history server: > {code} > +++ > b/sql/core/src/test/scala/org/apache/spark/sql/execution/ui/SQLListenerSuite.scala > @@ -19,6 +19,7 @@ package org.apache.spark.sql.execution.ui > import java.util.Properties > +import org.json4s.jackson.JsonMethods._ > import org.mockito.Mockito.mock > import org.apache.spark._ > @@ -35,7 +36,7 @@ import org.apache.spark.sql.execution.{LeafExecNode, > QueryExecution, SparkPlanIn > import org.apache.spark.sql.execution.metric.{SQLMetric, SQLMetrics} > import org.apache.spark.sql.test.SharedSQLContext > import org.apache.spark.ui.SparkUI > -import org.apache.spark.util.{AccumulatorMetadata, LongAccumulator} > +import org.apache.spark.util.{AccumulatorMetadata, JsonProtocol, > LongAccumulator} > class SQLListenerSuite extends SparkFunSuite with SharedSQLContext { > @@ -416,6 +417,20 @@ class SQLListenerSuite extends SparkFunSuite with > SharedSQLContext { > assert(driverUpdates(physicalPlan.longMetric("dummy").id) == > expectedAccumValue) >} > + test("roundtripping SparkListenerDriverAccumUpdates through JsonProtocol") > { > +val event = SparkListenerDriverAccumUpdates(1L, Seq((2L, 3L))) > +val actualJsonString = > compact(render(JsonProtocol.sparkEventToJson(event))) > +val newEvent = JsonProtocol.sparkEventFromJson(parse(actualJsonString)) > +newEvent match { > + case SparkListenerDriverAccumUpdates(executionId, accums) => > +assert(executionId == 1L) > +accums.foreach { case (a, b) => > + assert(a == 2L) > + assert(b == 3L) > +} > +} > + } > + > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18462) SparkListenerDriverAccumUpdates event does not deserialize properly in history server
[ https://issues.apache.org/jira/browse/SPARK-18462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18462: Assignee: Josh Rosen (was: Apache Spark) > SparkListenerDriverAccumUpdates event does not deserialize properly in > history server > - > > Key: SPARK-18462 > URL: https://issues.apache.org/jira/browse/SPARK-18462 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Josh Rosen >Assignee: Josh Rosen > > The following test fails with a ClassCastException due to oddities in how > Jackson object mapping works, breaking the SQL tab in the history server: > {code} > +++ > b/sql/core/src/test/scala/org/apache/spark/sql/execution/ui/SQLListenerSuite.scala > @@ -19,6 +19,7 @@ package org.apache.spark.sql.execution.ui > import java.util.Properties > +import org.json4s.jackson.JsonMethods._ > import org.mockito.Mockito.mock > import org.apache.spark._ > @@ -35,7 +36,7 @@ import org.apache.spark.sql.execution.{LeafExecNode, > QueryExecution, SparkPlanIn > import org.apache.spark.sql.execution.metric.{SQLMetric, SQLMetrics} > import org.apache.spark.sql.test.SharedSQLContext > import org.apache.spark.ui.SparkUI > -import org.apache.spark.util.{AccumulatorMetadata, LongAccumulator} > +import org.apache.spark.util.{AccumulatorMetadata, JsonProtocol, > LongAccumulator} > class SQLListenerSuite extends SparkFunSuite with SharedSQLContext { > @@ -416,6 +417,20 @@ class SQLListenerSuite extends SparkFunSuite with > SharedSQLContext { > assert(driverUpdates(physicalPlan.longMetric("dummy").id) == > expectedAccumValue) >} > + test("roundtripping SparkListenerDriverAccumUpdates through JsonProtocol") > { > +val event = SparkListenerDriverAccumUpdates(1L, Seq((2L, 3L))) > +val actualJsonString = > compact(render(JsonProtocol.sparkEventToJson(event))) > +val newEvent = JsonProtocol.sparkEventFromJson(parse(actualJsonString)) > +newEvent match { > + case SparkListenerDriverAccumUpdates(executionId, accums) => > +assert(executionId == 1L) > +accums.foreach { case (a, b) => > + assert(a == 2L) > + assert(b == 3L) > +} > +} > + } > + > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18462) SparkListenerDriverAccumUpdates event does not deserialize properly in history server
[ https://issues.apache.org/jira/browse/SPARK-18462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18462: Assignee: Apache Spark (was: Josh Rosen) > SparkListenerDriverAccumUpdates event does not deserialize properly in > history server > - > > Key: SPARK-18462 > URL: https://issues.apache.org/jira/browse/SPARK-18462 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Josh Rosen >Assignee: Apache Spark > > The following test fails with a ClassCastException due to oddities in how > Jackson object mapping works, breaking the SQL tab in the history server: > {code} > +++ > b/sql/core/src/test/scala/org/apache/spark/sql/execution/ui/SQLListenerSuite.scala > @@ -19,6 +19,7 @@ package org.apache.spark.sql.execution.ui > import java.util.Properties > +import org.json4s.jackson.JsonMethods._ > import org.mockito.Mockito.mock > import org.apache.spark._ > @@ -35,7 +36,7 @@ import org.apache.spark.sql.execution.{LeafExecNode, > QueryExecution, SparkPlanIn > import org.apache.spark.sql.execution.metric.{SQLMetric, SQLMetrics} > import org.apache.spark.sql.test.SharedSQLContext > import org.apache.spark.ui.SparkUI > -import org.apache.spark.util.{AccumulatorMetadata, LongAccumulator} > +import org.apache.spark.util.{AccumulatorMetadata, JsonProtocol, > LongAccumulator} > class SQLListenerSuite extends SparkFunSuite with SharedSQLContext { > @@ -416,6 +417,20 @@ class SQLListenerSuite extends SparkFunSuite with > SharedSQLContext { > assert(driverUpdates(physicalPlan.longMetric("dummy").id) == > expectedAccumValue) >} > + test("roundtripping SparkListenerDriverAccumUpdates through JsonProtocol") > { > +val event = SparkListenerDriverAccumUpdates(1L, Seq((2L, 3L))) > +val actualJsonString = > compact(render(JsonProtocol.sparkEventToJson(event))) > +val newEvent = JsonProtocol.sparkEventFromJson(parse(actualJsonString)) > +newEvent match { > + case SparkListenerDriverAccumUpdates(executionId, accums) => > +assert(executionId == 1L) > +accums.foreach { case (a, b) => > + assert(a == 2L) > + assert(b == 3L) > +} > +} > + } > + > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18462) SparkListenerDriverAccumUpdates event does not deserialize properly in history server
[ https://issues.apache.org/jira/browse/SPARK-18462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-18462: --- Target Version/s: 2.0.3, 2.1.0 > SparkListenerDriverAccumUpdates event does not deserialize properly in > history server > - > > Key: SPARK-18462 > URL: https://issues.apache.org/jira/browse/SPARK-18462 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Josh Rosen >Assignee: Josh Rosen > > The following test fails with a ClassCastException due to oddities in how > Jackson object mapping works, breaking the SQL tab in the history server: > {code} > +++ > b/sql/core/src/test/scala/org/apache/spark/sql/execution/ui/SQLListenerSuite.scala > @@ -19,6 +19,7 @@ package org.apache.spark.sql.execution.ui > import java.util.Properties > +import org.json4s.jackson.JsonMethods._ > import org.mockito.Mockito.mock > import org.apache.spark._ > @@ -35,7 +36,7 @@ import org.apache.spark.sql.execution.{LeafExecNode, > QueryExecution, SparkPlanIn > import org.apache.spark.sql.execution.metric.{SQLMetric, SQLMetrics} > import org.apache.spark.sql.test.SharedSQLContext > import org.apache.spark.ui.SparkUI > -import org.apache.spark.util.{AccumulatorMetadata, LongAccumulator} > +import org.apache.spark.util.{AccumulatorMetadata, JsonProtocol, > LongAccumulator} > class SQLListenerSuite extends SparkFunSuite with SharedSQLContext { > @@ -416,6 +417,20 @@ class SQLListenerSuite extends SparkFunSuite with > SharedSQLContext { > assert(driverUpdates(physicalPlan.longMetric("dummy").id) == > expectedAccumValue) >} > + test("roundtripping SparkListenerDriverAccumUpdates through JsonProtocol") > { > +val event = SparkListenerDriverAccumUpdates(1L, Seq((2L, 3L))) > +val actualJsonString = > compact(render(JsonProtocol.sparkEventToJson(event))) > +val newEvent = JsonProtocol.sparkEventFromJson(parse(actualJsonString)) > +newEvent match { > + case SparkListenerDriverAccumUpdates(executionId, accums) => > +assert(executionId == 1L) > +accums.foreach { case (a, b) => > + assert(a == 2L) > + assert(b == 3L) > +} > +} > + } > + > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18496) java.lang.AssertionError: assertion failed
[ https://issues.apache.org/jira/browse/SPARK-18496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harish updated SPARK-18496: --- Affects Version/s: 2.0.2 > java.lang.AssertionError: assertion failed > -- > > Key: SPARK-18496 > URL: https://issues.apache.org/jira/browse/SPARK-18496 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.2 > Environment: 2.0.2 snapshot >Reporter: Harish > > I am getting this error when i store the estimates from Julia output to a DF > and then i do df.cache() > py4j.protocol.Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.runJob. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 177.0 failed 4 times, most recent failure: Lost task 0.3 in stage > 177.0 (TID 9722, 10.63.136.108): java.lang.AssertionError: assertion failed > at scala.Predef$.assert(Predef.scala:156) > at > org.apache.spark.storage.BlockInfo.checkInvariants(BlockInfoManager.scala:84) > at > org.apache.spark.storage.BlockInfo.readerCount_$eq(BlockInfoManager.scala:66) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:362) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:361) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:361) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:356) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:356) > at > org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:646) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1441) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1441) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1667) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1622) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1611) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1890) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1903) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1916) > at org.apache.spark.api.python.PythonRDD$.runJob(PythonRDD.scala:441) > at org.apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala) > at sun.reflect.GeneratedMethodAccessor37.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:280) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(Gatewa
[jira] [Created] (SPARK-18496) java.lang.AssertionError: assertion failed
Harish created SPARK-18496: -- Summary: java.lang.AssertionError: assertion failed Key: SPARK-18496 URL: https://issues.apache.org/jira/browse/SPARK-18496 Project: Spark Issue Type: Bug Environment: 2.0.2 snapshot Reporter: Harish I am getting this error when i store the estimates from Julia output to a DF and then i do df.cache() py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 177.0 failed 4 times, most recent failure: Lost task 0.3 in stage 177.0 (TID 9722, 10.63.136.108): java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:156) at org.apache.spark.storage.BlockInfo.checkInvariants(BlockInfoManager.scala:84) at org.apache.spark.storage.BlockInfo.readerCount_$eq(BlockInfoManager.scala:66) at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:362) at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:361) at scala.Option.foreach(Option.scala:257) at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:361) at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:356) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:356) at org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:646) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1441) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1441) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1667) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1622) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1611) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1890) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1903) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1916) at org.apache.spark.api.python.PythonRDD$.runJob(PythonRDD.scala:441) at org.apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala) at sun.reflect.GeneratedMethodAccessor37.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:280) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:156) at org.apache.spark.storage.BlockInfo.checkInvariants(BlockInfoManager.scala:84) at
[jira] [Commented] (SPARK-18495) Web UI should document meaning of green dot in DAG visualization
[ https://issues.apache.org/jira/browse/SPARK-18495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674980#comment-15674980 ] Nicholas Chammas commented on SPARK-18495: -- cc [~andrewor14] > Web UI should document meaning of green dot in DAG visualization > > > Key: SPARK-18495 > URL: https://issues.apache.org/jira/browse/SPARK-18495 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.0.2 >Reporter: Nicholas Chammas >Priority: Trivial > > A green dot in the DAG visualization apparently means that the referenced RDD > is cached. This is not documented anywhere except in [this blog > post|https://databricks.com/blog/2015/06/22/understanding-your-spark-application-through-visualization.html]. > It would be good if the Web UI itself documented this somehow (perhaps in the > tooltip?) so that the user can naturally learn what it means while using the > Web UI. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18495) Web UI should document meaning of green dot in DAG visualization
Nicholas Chammas created SPARK-18495: Summary: Web UI should document meaning of green dot in DAG visualization Key: SPARK-18495 URL: https://issues.apache.org/jira/browse/SPARK-18495 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 2.0.2 Reporter: Nicholas Chammas Priority: Trivial A green dot in the DAG visualization apparently means that the referenced RDD is cached. This is not documented anywhere except in [this blog post|https://databricks.com/blog/2015/06/22/understanding-your-spark-application-through-visualization.html]. It would be good if the Web UI itself documented this somehow (perhaps in the tooltip?) so that the user can naturally learn what it means while using the Web UI. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18493) Add withWatermark and checkpoint to python dataframe
[ https://issues.apache.org/jira/browse/SPARK-18493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz updated SPARK-18493: Component/s: PySpark > Add withWatermark and checkpoint to python dataframe > > > Key: SPARK-18493 > URL: https://issues.apache.org/jira/browse/SPARK-18493 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Reporter: Burak Yavuz > > These two methods were added to Scala Datasets, but are not available in > Python yet. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18252) Improve serialized BloomFilter size
[ https://issues.apache.org/jira/browse/SPARK-18252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674801#comment-15674801 ] Reynold Xin commented on SPARK-18252: - Those two methods are pretty inefficient. When we use this in SQL for joins, we'd want a hyper efficient probing without caring too much about the false positive rate (i.e. size will be small), and being able to have all the internal details exposed in a simple way would be critical there. It might be a good idea to create a version of the bloom filter that can be compressed, but I definitely don't want that to be the only implementation. > Improve serialized BloomFilter size > --- > > Key: SPARK-18252 > URL: https://issues.apache.org/jira/browse/SPARK-18252 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Aleksey Ponkin >Priority: Minor > > Since version 2.0 Spark has BloomFilter implementation - > org.apache.spark.util.sketch.BloomFilterImpl. I have noticed that current > implementation is using custom class org.apache.spark.util.sketch.BitArray > for bit vector, which is allocating memory for the whole filter no matter how > many elements are set. Since BloomFilter can be serialized and sent over > network in distribution stage may be we need to use some kind of compressed > bloom filters? For example > [https://github.com/RoaringBitmap/RoaringBitmap][RoaringBitmap] or > [javaewah][https://github.com/lemire/javaewah]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18252) Improve serialized BloomFilter size
[ https://issues.apache.org/jira/browse/SPARK-18252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674821#comment-15674821 ] Aleksey Ponkin commented on SPARK-18252: I do not have any benchmarks, but I believe that iteration over true positions in RB must be pretty fast, since this is sort of Run Length Encoding, where only random access operations like get and set are pretty painful. We are definitely need some benchmarks for this. > Improve serialized BloomFilter size > --- > > Key: SPARK-18252 > URL: https://issues.apache.org/jira/browse/SPARK-18252 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Aleksey Ponkin >Priority: Minor > > Since version 2.0 Spark has BloomFilter implementation - > org.apache.spark.util.sketch.BloomFilterImpl. I have noticed that current > implementation is using custom class org.apache.spark.util.sketch.BitArray > for bit vector, which is allocating memory for the whole filter no matter how > many elements are set. Since BloomFilter can be serialized and sent over > network in distribution stage may be we need to use some kind of compressed > bloom filters? For example > [https://github.com/RoaringBitmap/RoaringBitmap][RoaringBitmap] or > [javaewah][https://github.com/lemire/javaewah]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18252) Improve serialized BloomFilter size
[ https://issues.apache.org/jira/browse/SPARK-18252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674818#comment-15674818 ] Reynold Xin commented on SPARK-18252: - Regarding this - can you find some performance data on how the build time compares between roaring one and the normal bitset? I'm also thinking maybe we can have the default build returning a roaring one, and then have a way to return a simple bloom filter based on normal bitsets and SQL can use that one. > Improve serialized BloomFilter size > --- > > Key: SPARK-18252 > URL: https://issues.apache.org/jira/browse/SPARK-18252 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Aleksey Ponkin >Priority: Minor > > Since version 2.0 Spark has BloomFilter implementation - > org.apache.spark.util.sketch.BloomFilterImpl. I have noticed that current > implementation is using custom class org.apache.spark.util.sketch.BitArray > for bit vector, which is allocating memory for the whole filter no matter how > many elements are set. Since BloomFilter can be serialized and sent over > network in distribution stage may be we need to use some kind of compressed > bloom filters? For example > [https://github.com/RoaringBitmap/RoaringBitmap][RoaringBitmap] or > [javaewah][https://github.com/lemire/javaewah]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18493) Add withWatermark and checkpoint to python dataframe
Burak Yavuz created SPARK-18493: --- Summary: Add withWatermark and checkpoint to python dataframe Key: SPARK-18493 URL: https://issues.apache.org/jira/browse/SPARK-18493 Project: Spark Issue Type: Improvement Components: SQL Reporter: Burak Yavuz These two methods were added to Scala Datasets, but are not available in Python yet. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18493) Add withWatermark and checkpoint to python dataframe
[ https://issues.apache.org/jira/browse/SPARK-18493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674806#comment-15674806 ] Apache Spark commented on SPARK-18493: -- User 'brkyvz' has created a pull request for this issue: https://github.com/apache/spark/pull/15921 > Add withWatermark and checkpoint to python dataframe > > > Key: SPARK-18493 > URL: https://issues.apache.org/jira/browse/SPARK-18493 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Burak Yavuz > > These two methods were added to Scala Datasets, but are not available in > Python yet. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18493) Add withWatermark and checkpoint to python dataframe
[ https://issues.apache.org/jira/browse/SPARK-18493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18493: Assignee: Apache Spark > Add withWatermark and checkpoint to python dataframe > > > Key: SPARK-18493 > URL: https://issues.apache.org/jira/browse/SPARK-18493 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Burak Yavuz >Assignee: Apache Spark > > These two methods were added to Scala Datasets, but are not available in > Python yet. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18493) Add withWatermark and checkpoint to python dataframe
[ https://issues.apache.org/jira/browse/SPARK-18493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18493: Assignee: (was: Apache Spark) > Add withWatermark and checkpoint to python dataframe > > > Key: SPARK-18493 > URL: https://issues.apache.org/jira/browse/SPARK-18493 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Burak Yavuz > > These two methods were added to Scala Datasets, but are not available in > Python yet. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18252) Improve serialized BloomFilter size
[ https://issues.apache.org/jira/browse/SPARK-18252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674791#comment-15674791 ] Aleksey Ponkin commented on SPARK-18252: well, I do not know anything about vectorized probing, but you can freely iterate over all true positions in RoaringBitmap with this method [http://static.javadoc.io/org.roaringbitmap/RoaringBitmap/0.6.27/org/roaringbitmap/RoaringBitmap.html#iterator--], you can get int[] of true positions in RB with this method [http://static.javadoc.io/org.roaringbitmap/RoaringBitmap/0.6.27/org/roaringbitmap/RoaringBitmap.html#toArray--]. Or I missed something? > Improve serialized BloomFilter size > --- > > Key: SPARK-18252 > URL: https://issues.apache.org/jira/browse/SPARK-18252 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Aleksey Ponkin >Priority: Minor > > Since version 2.0 Spark has BloomFilter implementation - > org.apache.spark.util.sketch.BloomFilterImpl. I have noticed that current > implementation is using custom class org.apache.spark.util.sketch.BitArray > for bit vector, which is allocating memory for the whole filter no matter how > many elements are set. Since BloomFilter can be serialized and sent over > network in distribution stage may be we need to use some kind of compressed > bloom filters? For example > [https://github.com/RoaringBitmap/RoaringBitmap][RoaringBitmap] or > [javaewah][https://github.com/lemire/javaewah]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18252) Improve serialized BloomFilter size
[ https://issues.apache.org/jira/browse/SPARK-18252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674767#comment-15674767 ] Reynold Xin commented on SPARK-18252: - For 3, the sketch package has no external dependency, and was created explicitly this way so bloom filter built in Spark can be used in other applications without having to worry about dependency conflicts. For 4, it is much easier to just create a vectorized version of the probing code when all we are dealing with is a simple for loop. > Improve serialized BloomFilter size > --- > > Key: SPARK-18252 > URL: https://issues.apache.org/jira/browse/SPARK-18252 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Aleksey Ponkin >Priority: Minor > > Since version 2.0 Spark has BloomFilter implementation - > org.apache.spark.util.sketch.BloomFilterImpl. I have noticed that current > implementation is using custom class org.apache.spark.util.sketch.BitArray > for bit vector, which is allocating memory for the whole filter no matter how > many elements are set. Since BloomFilter can be serialized and sent over > network in distribution stage may be we need to use some kind of compressed > bloom filters? For example > [https://github.com/RoaringBitmap/RoaringBitmap][RoaringBitmap] or > [javaewah][https://github.com/lemire/javaewah]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18475) Be able to provide higher parallelization for StructuredStreaming Kafka Source
[ https://issues.apache.org/jira/browse/SPARK-18475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674628#comment-15674628 ] Ofir Manor commented on SPARK-18475: I was just wondering if it actually works, but it seems you found a way to hack it (I thought you would need different consumer group per worker to avoid coordination by the broker, but it didn't seem like it). If it does provide a big perf boost in some cases, and it is not enabled by default, I personally don't have any objections. [~c...@koeninger.org] - I didn't understand your objection. An RDD / dataset does not have any inherent order guarantees (same as a SQL result set), and the Kafka metadata per message (including topic, partition, offset) is exposed if someone really cares. If you guys have a smart hack that allows you to divide a specific partition into ranges and have different workers read different range of a partition in parallel, and if it does provide a significant perf boost, why not have it as an option? I don't think it should break correctness, as the boundries are decided anyway by the driver. > Be able to provide higher parallelization for StructuredStreaming Kafka Source > -- > > Key: SPARK-18475 > URL: https://issues.apache.org/jira/browse/SPARK-18475 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.0.2, 2.1.0 >Reporter: Burak Yavuz > > Right now the StructuredStreaming Kafka Source creates as many Spark tasks as > there are TopicPartitions that we're going to read from Kafka. > This doesn't work well when we have data skew, and there is no reason why we > shouldn't be able to increase parallelism further, i.e. have multiple Spark > tasks reading from the same Kafka TopicPartition. > What this will mean is that we won't be able to use the "CachedKafkaConsumer" > for what it is defined for (being cached) in this use case, but the extra > overhead is worth handling data skew and increasing parallelism especially in > ETL use cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18492) GeneratedIterator grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-18492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674599#comment-15674599 ] Kazuaki Ishizaki edited comment on SPARK-18492 at 11/17/16 7:31 PM: I agree with your point. Can you post a small program that can reproduce this issue? was (Author: kiszk): Can you post a small program that can reproduce this issue? > GeneratedIterator grows beyond 64 KB > > > Key: SPARK-18492 > URL: https://issues.apache.org/jira/browse/SPARK-18492 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 > Environment: CentOS release 6.7 (Final) >Reporter: Norris Merritt > > spark-submit fails with ERROR CodeGenerator: failed to compile: > org.codehaus.janino.JaninoRuntimeException: Code of method > "(I[Lscala/collection/Iterator;)V" of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" > grows beyond 64 KB > Error message is followed by a huge dump of generated source code. > The generated code declares 1,454 field sequences like the following: > /* 036 */ private org.apache.spark.sql.catalyst.expressions.ScalaUDF > project_scalaUDF1; > /* 037 */ private scala.Function1 project_catalystConverter1; > /* 038 */ private scala.Function1 project_converter1; > /* 039 */ private scala.Function1 project_converter2; > /* 040 */ private scala.Function2 project_udf1; > (many omitted lines) ... > /* 6089 */ private org.apache.spark.sql.catalyst.expressions.ScalaUDF > project_scalaUDF1454; > /* 6090 */ private scala.Function1 project_catalystConverter1454; > /* 6091 */ private scala.Function1 project_converter1695; > /* 6092 */ private scala.Function1 project_udf1454; > It then proceeds to emit code for several methods (init, processNext) each of > which has totally repetitive sequences of statements pertaining to each of > the sequences of variables declared in the class. For example: > /* 6101 */ public void init(int index, scala.collection.Iterator inputs[]) { > The reason that the 64KB JVM limit for code for a method is exceeded is > because the code generator is using an incredibly naive strategy. It emits a > sequence like the one shown below for each of the 1,454 groups of variables > shown above, in > /* 6132 */ this.project_udf = > (scala.Function1)project_scalaUDF.userDefinedFunc(); > /* 6133 */ this.project_scalaUDF1 = > (org.apache.spark.sql.catalyst.expressions.ScalaUDF) references[10]; > /* 6134 */ this.project_catalystConverter1 = > (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToCatalystConverter(project_scalaUDF1.dataType()); > /* 6135 */ this.project_converter1 = > (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToScalaConverter(((org.apache.spark.sql.catalyst.expressions.Expression)(((org.apache.spark.sql.catalyst.expressions.ScalaUDF)references[10]).getChildren().apply(0))).dataType()); > /* 6136 */ this.project_converter2 = > (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToScalaConverter(((org.apache.spark.sql.catalyst.expressions.Expression)(((org.apache.spark.sql.catalyst.expressions.ScalaUDF)references[10]).getChildren().apply(1))).dataType()); > It blows up after emitting 230 such sequences, while trying to emit the 231st: > /* 7282 */ this.project_udf230 = > (scala.Function2)project_scalaUDF230.userDefinedFunc(); > /* 7283 */ this.project_scalaUDF231 = > (org.apache.spark.sql.catalyst.expressions.ScalaUDF) references[240]; > /* 7284 */ this.project_catalystConverter231 = > (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToCatalystConverter(project_scalaUDF231.dataType()); > many omitted lines ... > Example of repetitive code sequences emitted for processNext method: > /* 12253 */ boolean project_isNull247 = project_result244 == null; > /* 12254 */ MapData project_value247 = null; > /* 12255 */ if (!project_isNull247) { > /* 12256 */ project_value247 = project_result244; > /* 12257 */ } > /* 12258 */ Object project_arg = sort_isNull5 ? null : > project_converter489.apply(sort_value5); > /* 12259 */ > /* 12260 */ ArrayData project_result249 = null; > /* 12261 */ try { > /* 12262 */ project_result249 = > (ArrayData)project_catalystConverter248.apply(project_udf248.apply(project_arg)); > /* 12263 */ } catch (Exception e) { > /* 12264 */ throw new > org.apache.spark.SparkException(project_scalaUDF248.udfErrorMessage(), e); > /* 12265 */ } > /* 12266 */ > /* 12267 */ boolean project_isNull252 = project_result249 == null; > /* 12268 */ ArrayData projec
[jira] [Commented] (SPARK-18492) GeneratedIterator grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-18492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674599#comment-15674599 ] Kazuaki Ishizaki commented on SPARK-18492: -- Can you post a small program that can reproduce this issue? > GeneratedIterator grows beyond 64 KB > > > Key: SPARK-18492 > URL: https://issues.apache.org/jira/browse/SPARK-18492 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 > Environment: CentOS release 6.7 (Final) >Reporter: Norris Merritt > > spark-submit fails with ERROR CodeGenerator: failed to compile: > org.codehaus.janino.JaninoRuntimeException: Code of method > "(I[Lscala/collection/Iterator;)V" of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" > grows beyond 64 KB > Error message is followed by a huge dump of generated source code. > The generated code declares 1,454 field sequences like the following: > /* 036 */ private org.apache.spark.sql.catalyst.expressions.ScalaUDF > project_scalaUDF1; > /* 037 */ private scala.Function1 project_catalystConverter1; > /* 038 */ private scala.Function1 project_converter1; > /* 039 */ private scala.Function1 project_converter2; > /* 040 */ private scala.Function2 project_udf1; > (many omitted lines) ... > /* 6089 */ private org.apache.spark.sql.catalyst.expressions.ScalaUDF > project_scalaUDF1454; > /* 6090 */ private scala.Function1 project_catalystConverter1454; > /* 6091 */ private scala.Function1 project_converter1695; > /* 6092 */ private scala.Function1 project_udf1454; > It then proceeds to emit code for several methods (init, processNext) each of > which has totally repetitive sequences of statements pertaining to each of > the sequences of variables declared in the class. For example: > /* 6101 */ public void init(int index, scala.collection.Iterator inputs[]) { > The reason that the 64KB JVM limit for code for a method is exceeded is > because the code generator is using an incredibly naive strategy. It emits a > sequence like the one shown below for each of the 1,454 groups of variables > shown above, in > /* 6132 */ this.project_udf = > (scala.Function1)project_scalaUDF.userDefinedFunc(); > /* 6133 */ this.project_scalaUDF1 = > (org.apache.spark.sql.catalyst.expressions.ScalaUDF) references[10]; > /* 6134 */ this.project_catalystConverter1 = > (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToCatalystConverter(project_scalaUDF1.dataType()); > /* 6135 */ this.project_converter1 = > (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToScalaConverter(((org.apache.spark.sql.catalyst.expressions.Expression)(((org.apache.spark.sql.catalyst.expressions.ScalaUDF)references[10]).getChildren().apply(0))).dataType()); > /* 6136 */ this.project_converter2 = > (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToScalaConverter(((org.apache.spark.sql.catalyst.expressions.Expression)(((org.apache.spark.sql.catalyst.expressions.ScalaUDF)references[10]).getChildren().apply(1))).dataType()); > It blows up after emitting 230 such sequences, while trying to emit the 231st: > /* 7282 */ this.project_udf230 = > (scala.Function2)project_scalaUDF230.userDefinedFunc(); > /* 7283 */ this.project_scalaUDF231 = > (org.apache.spark.sql.catalyst.expressions.ScalaUDF) references[240]; > /* 7284 */ this.project_catalystConverter231 = > (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToCatalystConverter(project_scalaUDF231.dataType()); > many omitted lines ... > Example of repetitive code sequences emitted for processNext method: > /* 12253 */ boolean project_isNull247 = project_result244 == null; > /* 12254 */ MapData project_value247 = null; > /* 12255 */ if (!project_isNull247) { > /* 12256 */ project_value247 = project_result244; > /* 12257 */ } > /* 12258 */ Object project_arg = sort_isNull5 ? null : > project_converter489.apply(sort_value5); > /* 12259 */ > /* 12260 */ ArrayData project_result249 = null; > /* 12261 */ try { > /* 12262 */ project_result249 = > (ArrayData)project_catalystConverter248.apply(project_udf248.apply(project_arg)); > /* 12263 */ } catch (Exception e) { > /* 12264 */ throw new > org.apache.spark.SparkException(project_scalaUDF248.udfErrorMessage(), e); > /* 12265 */ } > /* 12266 */ > /* 12267 */ boolean project_isNull252 = project_result249 == null; > /* 12268 */ ArrayData project_value252 = null; > /* 12269 */ if (!project_isNull252) { > /* 12270 */ project_value252 = project_result249; > /* 12271 */ } > /* 12272 */
[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests
[ https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674579#comment-15674579 ] Sean Owen commented on SPARK-9487: -- Agree, it seems like it should not be sensitive to ordering within each batch. This could convert the List> to Set> so that they are compared without regard to order. If that's the nature of the difference, then yes it's the test that should be fixed as part of this change. > Use the same num. worker threads in Scala/Python unit tests > --- > > Key: SPARK-9487 > URL: https://issues.apache.org/jira/browse/SPARK-9487 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL, Tests >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng > Labels: starter > Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults > > > In Python we use `local[4]` for unit tests, while in Scala/Java we use > `local[2]` and `local` for some unit tests in SQL, MLLib, and other > components. If the operation depends on partition IDs, e.g., random number > generator, this will lead to different result in Python and Scala/Java. It > would be nice to use the same number in all unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18252) Improve serialized BloomFilter size
[ https://issues.apache.org/jira/browse/SPARK-18252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674558#comment-15674558 ] Aleksey Ponkin edited comment on SPARK-18252 at 11/17/16 7:23 PM: -- Hi, good points. 1. Problem with current implementation that bloom filter acquire memory for full bloom filter even if there are no items inside. RoaringBitmap is not compressing a bit vector, it is compact representation of BitSet. Also compressing random bit vector with usual compressions will not gain any effect, since it is random and half filled. To compress bloom filter we need to create "sparse" bitset ([http://www.eecs.harvard.edu/~michaelm/NEWWORK/postscripts/cbf2.pdf]) and than apply something like arithmetic encoding.Current implementation is not very useful, because you can not serialize bloom filter (with 10 millions items for example) - kryo has a limit of 2G object size. 2. This is true, we need to increment version inside bloom filter 3. RoaringBitmap is already used in Spark([https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala#L135]). What is the problem with dependencies? 4. Please can you tell me more about vectorized probing and why RoaringBitmap is not suitable for this? Thanks in advance. was (Author: ponkin): Hi, good points. 1. Problem with current implementation that bloom filter acquire memory for full bloom filter even if there are no items inside. RoaringBitmap is not compressing a bit vector, it is compact representation of BitSet. Also compressing random bit vector with usual compressions will not gain any effect, since it is random and half filled. To compress bloom filter we need to create "sparse" bitset ([http://www.eecs.harvard.edu/~michaelm/NEWWORK/postscripts/cbf2.pdf]) and than apply something like arithmetic encoding.Current implementation is very useful because you can not serialize bloom filter (with 10 millions items for example) - kryo has a limit of 2G object size. 2. This is true, we need to increment version inside bloom filter 3. RoaringBitmap is already used in Spark([https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala#L135]). What is the problem with dependencies? 4. Please can you tell me more about vectorized probing and why RoaringBitmap is not suitable for this? Thanks in advance. > Improve serialized BloomFilter size > --- > > Key: SPARK-18252 > URL: https://issues.apache.org/jira/browse/SPARK-18252 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Aleksey Ponkin >Priority: Minor > > Since version 2.0 Spark has BloomFilter implementation - > org.apache.spark.util.sketch.BloomFilterImpl. I have noticed that current > implementation is using custom class org.apache.spark.util.sketch.BitArray > for bit vector, which is allocating memory for the whole filter no matter how > many elements are set. Since BloomFilter can be serialized and sent over > network in distribution stage may be we need to use some kind of compressed > bloom filters? For example > [https://github.com/RoaringBitmap/RoaringBitmap][RoaringBitmap] or > [javaewah][https://github.com/lemire/javaewah]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18252) Improve serialized BloomFilter size
[ https://issues.apache.org/jira/browse/SPARK-18252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674558#comment-15674558 ] Aleksey Ponkin edited comment on SPARK-18252 at 11/17/16 7:22 PM: -- Hi, good points. 1. Problem with current implementation that bloom filter acquire memory for full bloom filter even if there are no items inside. RoaringBitmap is not compressing a bit vector, it is compact representation of BitSet. Also compressing random bit vector with usual compressions will not gain any effect, since it is random and half filled. To compress bloom filter we need to create "sparse" bitset ([http://www.eecs.harvard.edu/~michaelm/NEWWORK/postscripts/cbf2.pdf]) and than apply something like arithmetic encoding.Current implementation is very useful because you can not serialize bloom filter (with 10 millions items for example) - kryo has a limit of 2G object size. 2. This is true, we need to increment version inside bloom filter 3. RoaringBitmap is already used in Spark([https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala#L135]). What is the problem with dependencies? 4. Please can you tell me more about vectorized probing and why RoaringBitmap is not suitable for this? Thanks in advance. was (Author: ponkin): Hi, good points. 1. Problem with current implementation that bloom filter acquire memory for full bloom filter even if there are no items inside. RoaringBitmap is not compressing a bit vector, it is compact representation of BitSet. Also compressing random bit vector with usual compressions will not gain any effect, since it is random and half filled. To compress bloom filter we need to create "sparse" bitset ([http://www.eecs.harvard.edu/~michaelm/NEWWORK/postscripts/cbf2.pdf]) and than apply something like arithmetic encoding. 2. This is true, we need to increment version inside bloom filter 3. RoaringBitmap is already used in Spark([https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala#L135]). What is the problem with dependencies? 4. Please can you tell me more about vectorized probing and why RoaringBitmap is not suitable for this? Thanks in advance. > Improve serialized BloomFilter size > --- > > Key: SPARK-18252 > URL: https://issues.apache.org/jira/browse/SPARK-18252 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Aleksey Ponkin >Priority: Minor > > Since version 2.0 Spark has BloomFilter implementation - > org.apache.spark.util.sketch.BloomFilterImpl. I have noticed that current > implementation is using custom class org.apache.spark.util.sketch.BitArray > for bit vector, which is allocating memory for the whole filter no matter how > many elements are set. Since BloomFilter can be serialized and sent over > network in distribution stage may be we need to use some kind of compressed > bloom filters? For example > [https://github.com/RoaringBitmap/RoaringBitmap][RoaringBitmap] or > [javaewah][https://github.com/lemire/javaewah]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18252) Improve serialized BloomFilter size
[ https://issues.apache.org/jira/browse/SPARK-18252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674558#comment-15674558 ] Aleksey Ponkin commented on SPARK-18252: Hi, good points. 1. Problem with current implementation that bloom filter acquire memory for full bloom filter even if there are no items inside. RoaringBitmap is not compressing a bit vector, it is compact representation of BitSet. Also compressing random bit vector with usual compressions will not gain any effect, since it is random and half filled. To compress bloom filter we need to create "sparse" bitset ([http://www.eecs.harvard.edu/~michaelm/NEWWORK/postscripts/cbf2.pdf]) and than apply something like arithmetic encoding. 2. This is true, we need to increment version inside bloom filter 3. RoaringBitmap is already used in Spark([https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala#L135]). What is the problem with dependencies? 4. Please can you tell me more about vectorized probing and why RoaringBitmap is not suitable for this? Thanks in advance. > Improve serialized BloomFilter size > --- > > Key: SPARK-18252 > URL: https://issues.apache.org/jira/browse/SPARK-18252 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Aleksey Ponkin >Priority: Minor > > Since version 2.0 Spark has BloomFilter implementation - > org.apache.spark.util.sketch.BloomFilterImpl. I have noticed that current > implementation is using custom class org.apache.spark.util.sketch.BitArray > for bit vector, which is allocating memory for the whole filter no matter how > many elements are set. Since BloomFilter can be serialized and sent over > network in distribution stage may be we need to use some kind of compressed > bloom filters? For example > [https://github.com/RoaringBitmap/RoaringBitmap][RoaringBitmap] or > [javaewah][https://github.com/lemire/javaewah]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-18252) Improve serialized BloomFilter size
[ https://issues.apache.org/jira/browse/SPARK-18252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aleksey Ponkin updated SPARK-18252: --- Comment: was deleted (was: Hi, good points. 1. Problem with current implementation that bloom filter acquire memory for full bloom filter even if there are no items inside. RoaringBitmap is not compressing a bit vector, it is compact representation of BitSet. Also compressing random bit vector with usual compressions will not gain any effect, since it is random and half filled. To compress bloom filter we need to create "sparse" bitset ([http://www.eecs.harvard.edu/~michaelm/NEWWORK/postscripts/cbf2.pdf]) and than apply something like arithmetic encoding. 2. This is true, we need to increment version inside bloom filter 3. RoaringBitmap is already used in Spark([https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala#L135]). What is the problem with dependencies? 4. Please can you tell me more about vectorized probing and why RoaringBitmap is not suitable for this? Thanks in advance. ) > Improve serialized BloomFilter size > --- > > Key: SPARK-18252 > URL: https://issues.apache.org/jira/browse/SPARK-18252 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Aleksey Ponkin >Priority: Minor > > Since version 2.0 Spark has BloomFilter implementation - > org.apache.spark.util.sketch.BloomFilterImpl. I have noticed that current > implementation is using custom class org.apache.spark.util.sketch.BitArray > for bit vector, which is allocating memory for the whole filter no matter how > many elements are set. Since BloomFilter can be serialized and sent over > network in distribution stage may be we need to use some kind of compressed > bloom filters? For example > [https://github.com/RoaringBitmap/RoaringBitmap][RoaringBitmap] or > [javaewah][https://github.com/lemire/javaewah]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18252) Improve serialized BloomFilter size
[ https://issues.apache.org/jira/browse/SPARK-18252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674559#comment-15674559 ] Aleksey Ponkin commented on SPARK-18252: Hi, good points. 1. Problem with current implementation that bloom filter acquire memory for full bloom filter even if there are no items inside. RoaringBitmap is not compressing a bit vector, it is compact representation of BitSet. Also compressing random bit vector with usual compressions will not gain any effect, since it is random and half filled. To compress bloom filter we need to create "sparse" bitset ([http://www.eecs.harvard.edu/~michaelm/NEWWORK/postscripts/cbf2.pdf]) and than apply something like arithmetic encoding. 2. This is true, we need to increment version inside bloom filter 3. RoaringBitmap is already used in Spark([https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala#L135]). What is the problem with dependencies? 4. Please can you tell me more about vectorized probing and why RoaringBitmap is not suitable for this? Thanks in advance. > Improve serialized BloomFilter size > --- > > Key: SPARK-18252 > URL: https://issues.apache.org/jira/browse/SPARK-18252 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Aleksey Ponkin >Priority: Minor > > Since version 2.0 Spark has BloomFilter implementation - > org.apache.spark.util.sketch.BloomFilterImpl. I have noticed that current > implementation is using custom class org.apache.spark.util.sketch.BitArray > for bit vector, which is allocating memory for the whole filter no matter how > many elements are set. Since BloomFilter can be serialized and sent over > network in distribution stage may be we need to use some kind of compressed > bloom filters? For example > [https://github.com/RoaringBitmap/RoaringBitmap][RoaringBitmap] or > [javaewah][https://github.com/lemire/javaewah]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests
[ https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674543#comment-15674543 ] Saikat Kanjilal commented on SPARK-9487: Sean, I took a look at the code and here it is: List> inputData = Arrays.asList( Arrays.asList("hello", "world"), Arrays.asList("hello", "moon"), Arrays.asList("hello")); List>> expected = Arrays.asList( Arrays.asList( new Tuple2<>("hello", 1L), new Tuple2<>("world", 1L)), Arrays.asList( new Tuple2<>("hello", 1L), new Tuple2<>("moon", 1L)), Arrays.asList( new Tuple2<>("hello", 1L))); JavaDStream stream = JavaTestUtils.attachTestInputStream(ssc, inputData, 1); JavaPairDStream counted = stream.countByValue(); JavaTestUtils.attachTestOutputStream(counted); List>> result = JavaTestUtils.runStreams(ssc, 3, 3); Assert.assertEquals(expected, result); As you can see the expected is assuming that the contents of the stream get counted accurately for every word, the output that gets generated through the flakiness just has hello,1 moon,1 reversed which I dont think matters, unless the goal of the test ist o identify words in order of how they enter the stream the expected and the actual answer are correct. Therefore net net the test is flaky, should I refactor the test to actually look at the word count and not the order, thoughts on next steps? > Use the same num. worker threads in Scala/Python unit tests > --- > > Key: SPARK-9487 > URL: https://issues.apache.org/jira/browse/SPARK-9487 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL, Tests >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng > Labels: starter > Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults > > > In Python we use `local[4]` for unit tests, while in Scala/Java we use > `local[2]` and `local` for some unit tests in SQL, MLLib, and other > components. If the operation depends on partition IDs, e.g., random number > generator, this will lead to different result in Python and Scala/Java. It > would be nice to use the same number in all unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18475) Be able to provide higher parallelization for StructuredStreaming Kafka Source
[ https://issues.apache.org/jira/browse/SPARK-18475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674535#comment-15674535 ] Burak Yavuz commented on SPARK-18475: - [~c...@koeninger.org] I don't see where you may need strict ordering in SQL land. The way I see it, I should be able to tell Kafka: Give me point A to B from Stream X. I thought this was the general use case of Kafka. Maybe you may want to use it as a FIFO buffer, maybe you want to use it as centralized storage. There are other systems out there that does this kind of processing (e.g. secor -> https://github.com/pinterest/secor/) therefore I don't think that it's pretty appropriate for the general use case. > Be able to provide higher parallelization for StructuredStreaming Kafka Source > -- > > Key: SPARK-18475 > URL: https://issues.apache.org/jira/browse/SPARK-18475 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.0.2, 2.1.0 >Reporter: Burak Yavuz > > Right now the StructuredStreaming Kafka Source creates as many Spark tasks as > there are TopicPartitions that we're going to read from Kafka. > This doesn't work well when we have data skew, and there is no reason why we > shouldn't be able to increase parallelism further, i.e. have multiple Spark > tasks reading from the same Kafka TopicPartition. > What this will mean is that we won't be able to use the "CachedKafkaConsumer" > for what it is defined for (being cached) in this use case, but the extra > overhead is worth handling data skew and increasing parallelism especially in > ETL use cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18252) Improve serialized BloomFilter size
[ https://issues.apache.org/jira/browse/SPARK-18252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674532#comment-15674532 ] Reynold Xin commented on SPARK-18252: - I'm not sure if it is worth fixing this: 1. We already compress data before we send them across the network. 2. This is also not a backward compatible change would require different versioning. 3. This brings in extra dependency for a package that has 0 external dependency. 4. We very likely will implement vectorized probing for bloom filter to be used in Spark SQL joins, and using roaring bitmap would make that a lot harder to do. > Improve serialized BloomFilter size > --- > > Key: SPARK-18252 > URL: https://issues.apache.org/jira/browse/SPARK-18252 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Aleksey Ponkin >Priority: Minor > > Since version 2.0 Spark has BloomFilter implementation - > org.apache.spark.util.sketch.BloomFilterImpl. I have noticed that current > implementation is using custom class org.apache.spark.util.sketch.BitArray > for bit vector, which is allocating memory for the whole filter no matter how > many elements are set. Since BloomFilter can be serialized and sent over > network in distribution stage may be we need to use some kind of compressed > bloom filters? For example > [https://github.com/RoaringBitmap/RoaringBitmap][RoaringBitmap] or > [javaewah][https://github.com/lemire/javaewah]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18492) GeneratedIterator grows beyond 64 KB
Norris Merritt created SPARK-18492: -- Summary: GeneratedIterator grows beyond 64 KB Key: SPARK-18492 URL: https://issues.apache.org/jira/browse/SPARK-18492 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.1 Environment: CentOS release 6.7 (Final) Reporter: Norris Merritt spark-submit fails with ERROR CodeGenerator: failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of method "(I[Lscala/collection/Iterator;)V" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" grows beyond 64 KB Error message is followed by a huge dump of generated source code. The generated code declares 1,454 field sequences like the following: /* 036 */ private org.apache.spark.sql.catalyst.expressions.ScalaUDF project_scalaUDF1; /* 037 */ private scala.Function1 project_catalystConverter1; /* 038 */ private scala.Function1 project_converter1; /* 039 */ private scala.Function1 project_converter2; /* 040 */ private scala.Function2 project_udf1; (many omitted lines) ... /* 6089 */ private org.apache.spark.sql.catalyst.expressions.ScalaUDF project_scalaUDF1454; /* 6090 */ private scala.Function1 project_catalystConverter1454; /* 6091 */ private scala.Function1 project_converter1695; /* 6092 */ private scala.Function1 project_udf1454; It then proceeds to emit code for several methods (init, processNext) each of which has totally repetitive sequences of statements pertaining to each of the sequences of variables declared in the class. For example: /* 6101 */ public void init(int index, scala.collection.Iterator inputs[]) { The reason that the 64KB JVM limit for code for a method is exceeded is because the code generator is using an incredibly naive strategy. It emits a sequence like the one shown below for each of the 1,454 groups of variables shown above, in /* 6132 */ this.project_udf = (scala.Function1)project_scalaUDF.userDefinedFunc(); /* 6133 */ this.project_scalaUDF1 = (org.apache.spark.sql.catalyst.expressions.ScalaUDF) references[10]; /* 6134 */ this.project_catalystConverter1 = (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToCatalystConverter(project_scalaUDF1.dataType()); /* 6135 */ this.project_converter1 = (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToScalaConverter(((org.apache.spark.sql.catalyst.expressions.Expression)(((org.apache.spark.sql.catalyst.expressions.ScalaUDF)references[10]).getChildren().apply(0))).dataType()); /* 6136 */ this.project_converter2 = (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToScalaConverter(((org.apache.spark.sql.catalyst.expressions.Expression)(((org.apache.spark.sql.catalyst.expressions.ScalaUDF)references[10]).getChildren().apply(1))).dataType()); It blows up after emitting 230 such sequences, while trying to emit the 231st: /* 7282 */ this.project_udf230 = (scala.Function2)project_scalaUDF230.userDefinedFunc(); /* 7283 */ this.project_scalaUDF231 = (org.apache.spark.sql.catalyst.expressions.ScalaUDF) references[240]; /* 7284 */ this.project_catalystConverter231 = (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToCatalystConverter(project_scalaUDF231.dataType()); many omitted lines ... Example of repetitive code sequences emitted for processNext method: /* 12253 */ boolean project_isNull247 = project_result244 == null; /* 12254 */ MapData project_value247 = null; /* 12255 */ if (!project_isNull247) { /* 12256 */ project_value247 = project_result244; /* 12257 */ } /* 12258 */ Object project_arg = sort_isNull5 ? null : project_converter489.apply(sort_value5); /* 12259 */ /* 12260 */ ArrayData project_result249 = null; /* 12261 */ try { /* 12262 */ project_result249 = (ArrayData)project_catalystConverter248.apply(project_udf248.apply(project_arg)); /* 12263 */ } catch (Exception e) { /* 12264 */ throw new org.apache.spark.SparkException(project_scalaUDF248.udfErrorMessage(), e); /* 12265 */ } /* 12266 */ /* 12267 */ boolean project_isNull252 = project_result249 == null; /* 12268 */ ArrayData project_value252 = null; /* 12269 */ if (!project_isNull252) { /* 12270 */ project_value252 = project_result249; /* 12271 */ } /* 12272 */ Object project_arg1 = project_isNull252 ? null : project_converter488.apply(project_value252); /* 12273 */ /* 12274 */ ArrayData project_result248 = null; /* 12275 */ try { /* 12276 */ project_result248 = (ArrayData)project_catalystConverter247.apply(project_udf247.apply(project_arg1)); /* 12277 */ } catch (Exception e) { /* 12278 */ throw new org.apache.sp
[jira] [Commented] (SPARK-18475) Be able to provide higher parallelization for StructuredStreaming Kafka Source
[ https://issues.apache.org/jira/browse/SPARK-18475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674459#comment-15674459 ] Cody Koeninger commented on SPARK-18475: This has come up several times, and my answer is consistently the same - as Ofir said, the Kafka model is parallelism bounded by number of partitions. Breaking that model breaks user expectations, e.g. concerning ordering. It's fine for you if this helps your specific use case, but I think it is not appropriate for general use. I'd recommend people fix their skew and/or repartition at the producer level. > Be able to provide higher parallelization for StructuredStreaming Kafka Source > -- > > Key: SPARK-18475 > URL: https://issues.apache.org/jira/browse/SPARK-18475 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.0.2, 2.1.0 >Reporter: Burak Yavuz > > Right now the StructuredStreaming Kafka Source creates as many Spark tasks as > there are TopicPartitions that we're going to read from Kafka. > This doesn't work well when we have data skew, and there is no reason why we > shouldn't be able to increase parallelism further, i.e. have multiple Spark > tasks reading from the same Kafka TopicPartition. > What this will mean is that we won't be able to use the "CachedKafkaConsumer" > for what it is defined for (being cached) in this use case, but the extra > overhead is worth handling data skew and increasing parallelism especially in > ETL use cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18317) ML, Graph 2.1 QA: API: Binary incompatible changes
[ https://issues.apache.org/jira/browse/SPARK-18317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-18317: -- Attachment: spark-graphx_2.11-2.0.2_to_2.11-2.1.0-SNAPSHOT.html spark-mllib-local_2.11-2.0.2_to_2.11-2.1.0-SNAPSHOT.html spark-mllib_2.11-2.0.2_to_2.11-2.1.0-SNAPSHOT.html I checked the result from japi-compliance-checker. All binary incompatible changes reported are either private or package private. So we are good to go. > ML, Graph 2.1 QA: API: Binary incompatible changes > -- > > Key: SPARK-18317 > URL: https://issues.apache.org/jira/browse/SPARK-18317 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: Xiangrui Meng >Priority: Blocker > Attachments: spark-graphx_2.11-2.0.2_to_2.11-2.1.0-SNAPSHOT.html, > spark-mllib-local_2.11-2.0.2_to_2.11-2.1.0-SNAPSHOT.html, > spark-mllib_2.11-2.0.2_to_2.11-2.1.0-SNAPSHOT.html > > > Generate a list of binary incompatible changes using MiMa and create new > JIRAs for issues found. Filter out false positives as needed. > If you want to take this task, look at the analogous task from the previous > release QA, and ping the Assignee for advice. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18317) ML, Graph 2.1 QA: API: Binary incompatible changes
[ https://issues.apache.org/jira/browse/SPARK-18317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-18317. --- Resolution: Done > ML, Graph 2.1 QA: API: Binary incompatible changes > -- > > Key: SPARK-18317 > URL: https://issues.apache.org/jira/browse/SPARK-18317 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: Xiangrui Meng >Priority: Blocker > Attachments: spark-graphx_2.11-2.0.2_to_2.11-2.1.0-SNAPSHOT.html, > spark-mllib-local_2.11-2.0.2_to_2.11-2.1.0-SNAPSHOT.html, > spark-mllib_2.11-2.0.2_to_2.11-2.1.0-SNAPSHOT.html > > > Generate a list of binary incompatible changes using MiMa and create new > JIRAs for issues found. Filter out false positives as needed. > If you want to take this task, look at the analogous task from the previous > release QA, and ping the Assignee for advice. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2620) case class cannot be used as key for reduce
[ https://issues.apache.org/jira/browse/SPARK-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674308#comment-15674308 ] holdenk commented on SPARK-2620: I don't think its been resolved, does your code need to be in the repl or can you compile it? Do you have a small repro case so we can see if it might be the same issue? If your working in Jupyter its possible that the issue might be fixed by using the new scala kernel which has a different REPL backend but I haven't had a chance to investigate it and see if the same issue is present there. > case class cannot be used as key for reduce > --- > > Key: SPARK-2620 > URL: https://issues.apache.org/jira/browse/SPARK-2620 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 1.0.0, 1.1.0, 1.3.0, 1.4.0, 1.5.0, 1.6.0, 2.0.0, 2.1.0 > Environment: reproduced on spark-shell local[4] >Reporter: Gerard Maas >Assignee: Tobias Schlatter >Priority: Critical > Labels: case-class, core > > Using a case class as a key doesn't seem to work properly on Spark 1.0.0 > A minimal example: > case class P(name:String) > val ps = Array(P("alice"), P("bob"), P("charly"), P("bob")) > sc.parallelize(ps).map(x=> (x,1)).reduceByKey((x,y) => x+y).collect > [Spark shell local mode] res : Array[(P, Int)] = Array((P(bob),1), > (P(bob),1), (P(abe),1), (P(charly),1)) > In contrast to the expected behavior, that should be equivalent to: > sc.parallelize(ps).map(x=> (x.name,1)).reduceByKey((x,y) => x+y).collect > Array[(String, Int)] = Array((charly,1), (abe,1), (bob,2)) > groupByKey and distinct also present the same behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17436) dataframe.write sometimes does not keep sorting
[ https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674253#comment-15674253 ] Sean Owen commented on SPARK-17436: --- Not sure, I have no failures on OS X, on Ubuntu, and all of the Jenkins tests pass. I don't see reports of problems building. > dataframe.write sometimes does not keep sorting > --- > > Key: SPARK-17436 > URL: https://issues.apache.org/jira/browse/SPARK-17436 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Ran Haim > > When using partition by, datawriter can sometimes mess up an ordered > dataframe. > The problem originates in > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer. > In the writeRows method when too many files are opened (configurable), it > starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows > again from the sorter and writes them to the corresponding files. > The problem is that the sorter actually sorts the rows using the partition > key, and that can sometimes mess up the original sort (or secondary sort if > you will). > I think the best way to fix it is to stop using a sorter, and just put the > rows in a map using key as partition key and value as an arraylist, and then > just walk through all the keys and write it in the original order - this will > probably be faster as there no need for ordering. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18475) Be able to provide higher parallelization for StructuredStreaming Kafka Source
[ https://issues.apache.org/jira/browse/SPARK-18475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674254#comment-15674254 ] Burak Yavuz commented on SPARK-18475: - [~ofirm] Thanks for your comment. I've seen significant performance improvements, and here's my explanation on how it happened, and why a simple repartition won't help: You're correct on the fact that for a given group.id, multiple consumers can't consume from the same TopicPartition concurrently. Where you get the benefit is that for skewed partitions, you don't wait for one Spark task to try and read everything from Kafka while all cores wait idle. You achieve better utilization because as tasks read less data, they can move on to the second step of the computation quicker, and while the first CPU has moved on to the second step (writing out to some storage system), your second CPU can start reading from Kafka. It kind of helps you pipeline your operations. If you use repartition, you're still going to have all your cores wait while that one consumer tries to read everything, and then you're going to cause a shuffle on top of it which is even worse. > Be able to provide higher parallelization for StructuredStreaming Kafka Source > -- > > Key: SPARK-18475 > URL: https://issues.apache.org/jira/browse/SPARK-18475 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.0.2, 2.1.0 >Reporter: Burak Yavuz > > Right now the StructuredStreaming Kafka Source creates as many Spark tasks as > there are TopicPartitions that we're going to read from Kafka. > This doesn't work well when we have data skew, and there is no reason why we > shouldn't be able to increase parallelism further, i.e. have multiple Spark > tasks reading from the same Kafka TopicPartition. > What this will mean is that we won't be able to use the "CachedKafkaConsumer" > for what it is defined for (being cached) in this use case, but the extra > overhead is worth handling data skew and increasing parallelism especially in > ETL use cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18490) duplicate nodename extrainfo of ShuffleExchange
[ https://issues.apache.org/jira/browse/SPARK-18490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-18490: -- Assignee: Song Jun > duplicate nodename extrainfo of ShuffleExchange > --- > > Key: SPARK-18490 > URL: https://issues.apache.org/jira/browse/SPARK-18490 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2 >Reporter: Song Jun >Assignee: Song Jun >Priority: Trivial > Fix For: 2.1.0 > > > {noformat} > override def nodeName: String = { > val extraInfo = coordinator match { > case Some(exchangeCoordinator) if exchangeCoordinator.isEstimated => > s"(coordinator id: ${System.identityHashCode(coordinator)})" > case Some(exchangeCoordinator) if !exchangeCoordinator.isEstimated => > s"(coordinator id: ${System.identityHashCode(coordinator)})" > case None => "" > } > {noformat} > [if exchangeCoordinator.isEstimated ] true or false with the same result. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18490) duplicate nodename extrainfo of ShuffleExchange
[ https://issues.apache.org/jira/browse/SPARK-18490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-18490. --- Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 15920 [https://github.com/apache/spark/pull/15920] > duplicate nodename extrainfo of ShuffleExchange > --- > > Key: SPARK-18490 > URL: https://issues.apache.org/jira/browse/SPARK-18490 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2 >Reporter: Song Jun >Priority: Trivial > Fix For: 2.1.0 > > > {noformat} > override def nodeName: String = { > val extraInfo = coordinator match { > case Some(exchangeCoordinator) if exchangeCoordinator.isEstimated => > s"(coordinator id: ${System.identityHashCode(coordinator)})" > case Some(exchangeCoordinator) if !exchangeCoordinator.isEstimated => > s"(coordinator id: ${System.identityHashCode(coordinator)})" > case None => "" > } > {noformat} > [if exchangeCoordinator.isEstimated ] true or false with the same result. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18356) Issue + Resolution: Kmeans Spark Performances (ML package)
[ https://issues.apache.org/jira/browse/SPARK-18356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674161#comment-15674161 ] zakaria hili commented on SPARK-18356: -- Hi [~yuhaoyan], I tried to improve the Kmeans using the same concept of caching in Logistic Regression. https://github.com/ZakariaHili/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala#L310 and my result of performances: I used only one VM (Local Mode) with python -> Spark without improvement: the training takes ~0,605s (as a mean value) -> Spark with Kmeans improved: ~0,518s (the warning disappeared) so we can say that we did not gain a lot, but maybe we will see the difference if we run the train method many times. what do you think ? > Issue + Resolution: Kmeans Spark Performances (ML package) > -- > > Key: SPARK-18356 > URL: https://issues.apache.org/jira/browse/SPARK-18356 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.0.0, 2.0.1 >Reporter: zakaria hili >Priority: Minor > Labels: easyfix > > Hello, > I'm newbie in spark, but I think that I found a small problem that can affect > spark Kmeans performances. > Before starting to explain the problem, I want to explain the warning that I > faced. > I tried to use Spark Kmeans with Dataframes to cluster my data > df_Part = assembler.transform(df_Part) > df_Part.cache() > while (k<=max_cluster) and (wssse > seuilStop): > kmeans = KMeans().setK(k) > model = kmeans.fit(df_Part) > wssse = model.computeCost(df_Part) > k=k+1 > but when I run the code I receive the warning : > WARN KMeans: The input data is not directly cached, which may hurt > performance if its parent RDDs are also uncached. > I searched in spark source code to find the source of this problem, then I > realized there is two classes responsible for this warning: > (mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala ) > (mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala ) > > When my dataframe is cached, the fit method transform my dataframe into an > internally rdd which is not cached. > Dataframe -> rdd -> run Training Kmeans Algo(rdd) > -> The first class (ml package) responsible for converting the dataframe into > rdd then call Kmeans Algorithm > ->The second class (mllib package) implements Kmeans Algorithm, and here > spark verify if the rdd is cached, if not a warning will be generated. > So, the solution of this problem is to cache the rdd before running Kmeans > Algorithm. > https://github.com/ZakariaHili/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala > All what we need is to add two lines: > Cache rdd just after dataframe transformation, then uncached it after > training algorithm. > I hope that I was clear. > If you think that I was wrong, please let me know. > Sincerely, > Zakaria HILI -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10816) EventTime based sessionization
[ https://issues.apache.org/jira/browse/SPARK-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674147#comment-15674147 ] Amit Sela commented on SPARK-10816: --- [~rxin] it might be worth taking into account a more generic implementation of "Merging Windows". Sessions are merging windows that use "gap duration" to determine when to close the session, such that if we say that the first element arrives in window {{[event_time1, event_time1 + gap_duration)}} and the next one at {{[event_time2, event_time2 + gap_duration)}} and {{event_time1 < event_time2}}, their combined value will belong to {{[event_time1, event_time2 + gap_duration)}}, right ? But the same "merge" of windows could very well be determined by a "close session" element (using Kafka for example would guarantee order of messages), or any user defined logic for that matter, as long as the "merge function" is provided by the user. Of course providing Sessions API out-of-the-box would prove most useful as it is the most common, but I don't see any downside to also have a more "advanced" API here. Thanks! > EventTime based sessionization > -- > > Key: SPARK-10816 > URL: https://issues.apache.org/jira/browse/SPARK-10816 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Reporter: Reynold Xin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12965) Indexer setInputCol() doesn't resolve column names like DataFrame.col()
[ https://issues.apache.org/jira/browse/SPARK-12965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674129#comment-15674129 ] Barry Becker commented on SPARK-12965: -- This is a big issue for us because we don't control the names of the columns that we get. One ugly workaround might be to to convert . to _ in the columns, but then you need to worry about conflicting with other columns that differ only by their use of . or _. The backquoting works in many places, but there are still many places, like this, where we have seen that it does not work. > Indexer setInputCol() doesn't resolve column names like DataFrame.col() > --- > > Key: SPARK-12965 > URL: https://issues.apache.org/jira/browse/SPARK-12965 > Project: Spark > Issue Type: Bug > Components: ML, Spark Core >Affects Versions: 1.6.0 >Reporter: Joshua Taylor > Attachments: SparkMLDotColumn.java > > > The setInputCol() method doesn't seem to resolve column names in the same way > that other methods do. E.g., Given a DataFrame df, {{df.col("`a.b`")}} will > return a column. On a StringIndexer indexer, > {{indexer.setInputCol("`a.b`")}} produces leads to an indexer where fitting > and transforming seem to have no effect. Running the following code produces: > {noformat} > +---+---++ > |a.b|a_b|a_bIndex| > +---+---++ > |foo|foo| 0.0| > |bar|bar| 1.0| > +---+---++ > {noformat} > but I think it should have another column, {{abIndex}} with the same contents > as a_bIndex. > {code} > public class SparkMLDotColumn { > public static void main(String[] args) { > // Get the contexts > SparkConf conf = new SparkConf() > .setMaster("local[*]") > .setAppName("test") > .set("spark.ui.enabled", "false"); > JavaSparkContext sparkContext = new JavaSparkContext(conf); > SQLContext sqlContext = new SQLContext(sparkContext); > > // Create a schema with a single string column named "a.b" > StructType schema = new StructType(new StructField[] { > DataTypes.createStructField("a.b", > DataTypes.StringType, false) > }); > // Create an empty RDD and DataFrame > List rows = Arrays.asList(RowFactory.create("foo"), > RowFactory.create("bar")); > JavaRDD rdd = sparkContext.parallelize(rows); > DataFrame df = sqlContext.createDataFrame(rdd, schema); > > df = df.withColumn("a_b", df.col("`a.b`")); > > StringIndexer indexer0 = new StringIndexer(); > indexer0.setInputCol("a_b"); > indexer0.setOutputCol("a_bIndex"); > df = indexer0.fit(df).transform(df); > > StringIndexer indexer1 = new StringIndexer(); > indexer1.setInputCol("`a.b`"); > indexer1.setOutputCol("abIndex"); > df = indexer1.fit(df).transform(df); > > df.show(); > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18475) Be able to provide higher parallelization for StructuredStreaming Kafka Source
[ https://issues.apache.org/jira/browse/SPARK-18475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674092#comment-15674092 ] Ofir Manor commented on SPARK-18475: Are you sure this is working? Having a visible perf effect? As far as I know, the maximum parallelism of Kafka is the number of topic-partitions, by design. If your consumer group has more consumers than that, some of them will be just idle. This is because when reading, each partition is owned by a single consumer (that allocation of partitions to consumer is dynamic, as consumers joins and leaves). To quote an older source: ??The first thing to understand is that a topic partition is the unit of parallelism in Kafka. On both the producer and the broker side, writes to different partitions can be done fully in parallel. So expensive operations such as compression can utilize more hardware resources. On the consumer side, Kafka always gives a single partition’s data to one consumer thread. Thus, the degree of parallelism in the consumer (within a consumer group) is bounded by the number of partitions being consumed. Therefore, in general, the more partitions there are in a Kafka cluster, the higher the throughput one can achieve.?? https://www.confluent.io/blog/how-to-choose-the-number-of-topicspartitions-in-a-kafka-cluster/ You could repartition the data in Spark after reading, to increase parallelism of Spark's processing. > Be able to provide higher parallelization for StructuredStreaming Kafka Source > -- > > Key: SPARK-18475 > URL: https://issues.apache.org/jira/browse/SPARK-18475 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.0.2, 2.1.0 >Reporter: Burak Yavuz > > Right now the StructuredStreaming Kafka Source creates as many Spark tasks as > there are TopicPartitions that we're going to read from Kafka. > This doesn't work well when we have data skew, and there is no reason why we > shouldn't be able to increase parallelism further, i.e. have multiple Spark > tasks reading from the same Kafka TopicPartition. > What this will mean is that we won't be able to use the "CachedKafkaConsumer" > for what it is defined for (being cached) in this use case, but the extra > overhead is worth handling data skew and increasing parallelism especially in > ETL use cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17436) dataframe.write sometimes does not keep sorting
[ https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674032#comment-15674032 ] Ran Haim commented on SPARK-17436: -- I have basiaclly cloned the repository from https://github.com/apache/spark and ran "build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 test This always fails for mecan you point me to someone who can help me? > dataframe.write sometimes does not keep sorting > --- > > Key: SPARK-17436 > URL: https://issues.apache.org/jira/browse/SPARK-17436 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Ran Haim > > When using partition by, datawriter can sometimes mess up an ordered > dataframe. > The problem originates in > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer. > In the writeRows method when too many files are opened (configurable), it > starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows > again from the sorter and writes them to the corresponding files. > The problem is that the sorter actually sorts the rows using the partition > key, and that can sometimes mess up the original sort (or secondary sort if > you will). > I think the best way to fix it is to stop using a sorter, and just put the > rows in a map using key as partition key and value as an arraylist, and then > just walk through all the keys and write it in the original order - this will > probably be faster as there no need for ordering. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17436) dataframe.write sometimes does not keep sorting
[ https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674032#comment-15674032 ] Ran Haim edited comment on SPARK-17436 at 11/17/16 3:48 PM: I have basiaclly cloned the repository from https://github.com/apache/spark and ran "build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 clean install" This always fails for mecan you point me to someone who can help me? was (Author: ran.h...@optimalplus.com): I have basiaclly cloned the repository from https://github.com/apache/spark and ran "build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 test This always fails for mecan you point me to someone who can help me? > dataframe.write sometimes does not keep sorting > --- > > Key: SPARK-17436 > URL: https://issues.apache.org/jira/browse/SPARK-17436 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Ran Haim > > When using partition by, datawriter can sometimes mess up an ordered > dataframe. > The problem originates in > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer. > In the writeRows method when too many files are opened (configurable), it > starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows > again from the sorter and writes them to the corresponding files. > The problem is that the sorter actually sorts the rows using the partition > key, and that can sometimes mess up the original sort (or secondary sort if > you will). > I think the best way to fix it is to stop using a sorter, and just put the > rows in a map using key as partition key and value as an arraylist, and then > just walk through all the keys and write it in the original order - this will > probably be faster as there no need for ordering. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18172) AnalysisException in first/last during aggregation
[ https://issues.apache.org/jira/browse/SPARK-18172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15673931#comment-15673931 ] Emlyn Corrin commented on SPARK-18172: -- I'm not sure I've got the time to build from source at the moment to verify this, I think if it now works for you and [~hvanhovell] it's most likely fixed now. If it reoccurs for me with 2.0.3 or 2.1.0 once they're released I'll reopen this. Thanks. > AnalysisException in first/last during aggregation > -- > > Key: SPARK-18172 > URL: https://issues.apache.org/jira/browse/SPARK-18172 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Emlyn Corrin > Fix For: 2.0.3, 2.1.0 > > > Since Spark 2.0.1, the following pyspark snippet fails with > {{AnalysisException: The second argument of First should be a boolean > literal}} (but it's not restricted to Python, similar code with in Java fails > in the same way). > It worked in Spark 2.0.0, so I believe it may be related to the fix for > SPARK-16648. > {code} > from pyspark.sql import functions as F > ds = spark.createDataFrame(sc.parallelize([[1, 1, 2], [1, 2, 3], [1, 3, 4]])) > ds.groupBy(ds._1).agg(F.first(ds._2), F.countDistinct(ds._2), > F.countDistinct(ds._2, ds._3)).show() > {code} > It works if any of the three arguments to {{.agg}} is removed. > The stack trace is: > {code} > Py4JJavaError Traceback (most recent call last) > in () > > 1 > ds.groupBy(ds._1).agg(F.first(ds._2),F.countDistinct(ds._2),F.countDistinct(ds._2, > ds._3)).show() > /usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/dataframe.py > in show(self, n, truncate) > 285 +---+-+ > 286 """ > --> 287 print(self._jdf.showString(n, truncate)) > 288 > 289 def __repr__(self): > /usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py > in __call__(self, *args) >1131 answer = self.gateway_client.send_command(command) >1132 return_value = get_return_value( > -> 1133 answer, self.gateway_client, self.target_id, self.name) >1134 >1135 for temp_arg in temp_args: > /usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/utils.py in > deco(*a, **kw) > 61 def deco(*a, **kw): > 62 try: > ---> 63 return f(*a, **kw) > 64 except py4j.protocol.Py4JJavaError as e: > 65 s = e.java_exception.toString() > /usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py > in get_return_value(answer, gateway_client, target_id, name) > 317 raise Py4JJavaError( > 318 "An error occurred while calling {0}{1}{2}.\n". > --> 319 format(target_id, ".", name), value) > 320 else: > 321 raise Py4JError( > Py4JJavaError: An error occurred while calling o76.showString. > : org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, > tree: first(_2#1L)() > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) > at > org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:387) > at > org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:256) > at > org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.org$apache$spark$sql$catalyst$optimizer$RewriteDistinctAggregates$$patchAggregateFunctionChildren$1(RewriteDistinctAggregates.scala:140) > at > org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$16.apply(RewriteDistinctAggregates.scala:182) > at > org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$16.apply(RewriteDistinctAggregates.scala:180) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.rewrite(RewriteDistinctAggregates.scala:180) > at > org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$apply$1.applyOrElse(RewriteDistinctAggregates.scala:105) > at > org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$apply$1.applyOrElse(RewriteDistinctAggr
[jira] [Commented] (SPARK-18490) duplicate nodename extrainfo of ShuffleExchange
[ https://issues.apache.org/jira/browse/SPARK-18490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15673918#comment-15673918 ] Song Jun commented on SPARK-18490: -- it is not a bug, just simplify this code > duplicate nodename extrainfo of ShuffleExchange > --- > > Key: SPARK-18490 > URL: https://issues.apache.org/jira/browse/SPARK-18490 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2 >Reporter: Song Jun >Priority: Trivial > > {noformat} > override def nodeName: String = { > val extraInfo = coordinator match { > case Some(exchangeCoordinator) if exchangeCoordinator.isEstimated => > s"(coordinator id: ${System.identityHashCode(coordinator)})" > case Some(exchangeCoordinator) if !exchangeCoordinator.isEstimated => > s"(coordinator id: ${System.identityHashCode(coordinator)})" > case None => "" > } > {noformat} > [if exchangeCoordinator.isEstimated ] true or false with the same result. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18004) DataFrame filter Predicate push-down fails for Oracle Timestamp type columns
[ https://issues.apache.org/jira/browse/SPARK-18004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15673904#comment-15673904 ] Herman van Hovell commented on SPARK-18004: --- which format should be passed to oracle? > DataFrame filter Predicate push-down fails for Oracle Timestamp type columns > > > Key: SPARK-18004 > URL: https://issues.apache.org/jira/browse/SPARK-18004 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Suhas Nalapure >Priority: Critical > > DataFrame filter Predicate push-down fails for Oracle Timestamp type columns > with Exception java.sql.SQLDataException: ORA-01861: literal does not match > format string: > Java source code (this code works fine for mysql & mssql databases) : > {noformat} > //DataFrame df = create a DataFrame over an Oracle table > df = df.filter(df.col("TS").lt(new > java.sql.Timestamp(System.currentTimeMillis(; > df.explain(); > df.show(); > {noformat} > Log statements with the Exception: > {noformat} > Schema: root > |-- ID: string (nullable = false) > |-- TS: timestamp (nullable = true) > |-- DEVICE_ID: string (nullable = true) > |-- REPLACEMENT: string (nullable = true) > {noformat} > {noformat} > == Physical Plan == > Filter (TS#1 < 1476861841934000) > +- Scan > JDBCRelation(jdbc:oracle:thin:@10.0.0.111:1521:orcl,ORATABLE,[Lorg.apache.spark.Partition;@78c74647,{user=user, > password=pwd, url=jdbc:oracle:thin:@10.0.0.111:1521:orcl, dbtable=ORATABLE, > driver=oracle.jdbc.driver.OracleDriver})[ID#0,TS#1,DEVICE_ID#2,REPLACEMENT#3] > PushedFilters: [LessThan(TS,2016-10-19 12:54:01.934)] > 2016-10-19 12:54:04,268 ERROR [Executor task launch worker-0] > org.apache.spark.executor.Executor > Exception in task 0.0 in stage 0.0 (TID 0) > java.sql.SQLDataException: ORA-01861: literal does not match format string > at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:461) > at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:402) > at oracle.jdbc.driver.T4C8Oall.processError(T4C8Oall.java:1065) > at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:681) > at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:256) > at oracle.jdbc.driver.T4C8Oall.doOALL(T4C8Oall.java:577) > at > oracle.jdbc.driver.T4CPreparedStatement.doOall8(T4CPreparedStatement.java:239) > at > oracle.jdbc.driver.T4CPreparedStatement.doOall8(T4CPreparedStatement.java:75) > at > oracle.jdbc.driver.T4CPreparedStatement.executeForDescribe(T4CPreparedStatement.java:1043) > at > oracle.jdbc.driver.OracleStatement.executeMaybeDescribe(OracleStatement.java:) > at > oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout(OracleStatement.java:1353) > at > oracle.jdbc.driver.OraclePreparedStatement.executeInternal(OraclePreparedStatement.java:4485) > at > oracle.jdbc.driver.OraclePreparedStatement.executeQuery(OraclePreparedStatement.java:4566) > at > oracle.jdbc.driver.OraclePreparedStatementWrapper.executeQuery(OraclePreparedStatementWrapper.java:5251) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.(JDBCRDD.scala:383) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:359) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18444) SparkR running in yarn-cluster mode should not download Spark package
[ https://issues.apache.org/jira/browse/SPARK-18444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-18444: Target Version/s: 2.1.0 > SparkR running in yarn-cluster mode should not download Spark package > - > > Key: SPARK-18444 > URL: https://issues.apache.org/jira/browse/SPARK-18444 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Yanbo Liang >Assignee: Yanbo Liang >Priority: Critical > > When running SparkR job in yarn-cluster mode, it will download Spark package > from apache website which is not necessary. > {code} > ./bin/spark-submit --master yarn-cluster ./examples/src/main/r/dataframe.R > {code} > The following is output: > {code} > Attaching package: ‘SparkR’ > The following objects are masked from ‘package:stats’: > cov, filter, lag, na.omit, predict, sd, var, window > The following objects are masked from ‘package:base’: > as.data.frame, colnames, colnames<-, drop, endsWith, intersect, > rank, rbind, sample, startsWith, subset, summary, transform, union > Spark not found in SPARK_HOME: > Spark not found in the cache directory. Installation will start. > MirrorUrl not provided. > Looking for preferred site from apache website... > .. > {code} > There's no {{SPARK_HOME}} in yarn-cluster mode since the R process is in a > remote host of the yarn cluster rather than in the client host. The JVM comes > up first and the R process then connects to it. So in such cases we should > never have to download Spark as Spark is already running. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18444) SparkR running in yarn-cluster mode should not download Spark package
[ https://issues.apache.org/jira/browse/SPARK-18444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-18444: Priority: Critical (was: Major) > SparkR running in yarn-cluster mode should not download Spark package > - > > Key: SPARK-18444 > URL: https://issues.apache.org/jira/browse/SPARK-18444 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Yanbo Liang >Assignee: Yanbo Liang >Priority: Critical > > When running SparkR job in yarn-cluster mode, it will download Spark package > from apache website which is not necessary. > {code} > ./bin/spark-submit --master yarn-cluster ./examples/src/main/r/dataframe.R > {code} > The following is output: > {code} > Attaching package: ‘SparkR’ > The following objects are masked from ‘package:stats’: > cov, filter, lag, na.omit, predict, sd, var, window > The following objects are masked from ‘package:base’: > as.data.frame, colnames, colnames<-, drop, endsWith, intersect, > rank, rbind, sample, startsWith, subset, summary, transform, union > Spark not found in SPARK_HOME: > Spark not found in the cache directory. Installation will start. > MirrorUrl not provided. > Looking for preferred site from apache website... > .. > {code} > There's no {{SPARK_HOME}} in yarn-cluster mode since the R process is in a > remote host of the yarn cluster rather than in the client host. The JVM comes > up first and the R process then connects to it. So in such cases we should > never have to download Spark as Spark is already running. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18444) SparkR running in yarn-cluster mode should not download Spark package
[ https://issues.apache.org/jira/browse/SPARK-18444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang reassigned SPARK-18444: --- Assignee: Yanbo Liang > SparkR running in yarn-cluster mode should not download Spark package > - > > Key: SPARK-18444 > URL: https://issues.apache.org/jira/browse/SPARK-18444 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Yanbo Liang >Assignee: Yanbo Liang > > When running SparkR job in yarn-cluster mode, it will download Spark package > from apache website which is not necessary. > {code} > ./bin/spark-submit --master yarn-cluster ./examples/src/main/r/dataframe.R > {code} > The following is output: > {code} > Attaching package: ‘SparkR’ > The following objects are masked from ‘package:stats’: > cov, filter, lag, na.omit, predict, sd, var, window > The following objects are masked from ‘package:base’: > as.data.frame, colnames, colnames<-, drop, endsWith, intersect, > rank, rbind, sample, startsWith, subset, summary, transform, union > Spark not found in SPARK_HOME: > Spark not found in the cache directory. Installation will start. > MirrorUrl not provided. > Looking for preferred site from apache website... > .. > {code} > There's no {{SPARK_HOME}} in yarn-cluster mode since the R process is in a > remote host of the yarn cluster rather than in the client host. The JVM comes > up first and the R process then connects to it. So in such cases we should > never have to download Spark as Spark is already running. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18004) DataFrame filter Predicate push-down fails for Oracle Timestamp type columns
[ https://issues.apache.org/jira/browse/SPARK-18004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15673885#comment-15673885 ] Suhas Nalapure commented on SPARK-18004: Date format as per the physical plan logged by Spark Dataframe: PushedFilters: [LessThan(TS,2016-11-17 19:42:01.057)] Confirmed the same from Oracle query logs as well: WHERE TS < '2016-11-17 19:42:01.057' > DataFrame filter Predicate push-down fails for Oracle Timestamp type columns > > > Key: SPARK-18004 > URL: https://issues.apache.org/jira/browse/SPARK-18004 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Suhas Nalapure >Priority: Critical > > DataFrame filter Predicate push-down fails for Oracle Timestamp type columns > with Exception java.sql.SQLDataException: ORA-01861: literal does not match > format string: > Java source code (this code works fine for mysql & mssql databases) : > {noformat} > //DataFrame df = create a DataFrame over an Oracle table > df = df.filter(df.col("TS").lt(new > java.sql.Timestamp(System.currentTimeMillis(; > df.explain(); > df.show(); > {noformat} > Log statements with the Exception: > {noformat} > Schema: root > |-- ID: string (nullable = false) > |-- TS: timestamp (nullable = true) > |-- DEVICE_ID: string (nullable = true) > |-- REPLACEMENT: string (nullable = true) > {noformat} > {noformat} > == Physical Plan == > Filter (TS#1 < 1476861841934000) > +- Scan > JDBCRelation(jdbc:oracle:thin:@10.0.0.111:1521:orcl,ORATABLE,[Lorg.apache.spark.Partition;@78c74647,{user=user, > password=pwd, url=jdbc:oracle:thin:@10.0.0.111:1521:orcl, dbtable=ORATABLE, > driver=oracle.jdbc.driver.OracleDriver})[ID#0,TS#1,DEVICE_ID#2,REPLACEMENT#3] > PushedFilters: [LessThan(TS,2016-10-19 12:54:01.934)] > 2016-10-19 12:54:04,268 ERROR [Executor task launch worker-0] > org.apache.spark.executor.Executor > Exception in task 0.0 in stage 0.0 (TID 0) > java.sql.SQLDataException: ORA-01861: literal does not match format string > at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:461) > at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:402) > at oracle.jdbc.driver.T4C8Oall.processError(T4C8Oall.java:1065) > at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:681) > at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:256) > at oracle.jdbc.driver.T4C8Oall.doOALL(T4C8Oall.java:577) > at > oracle.jdbc.driver.T4CPreparedStatement.doOall8(T4CPreparedStatement.java:239) > at > oracle.jdbc.driver.T4CPreparedStatement.doOall8(T4CPreparedStatement.java:75) > at > oracle.jdbc.driver.T4CPreparedStatement.executeForDescribe(T4CPreparedStatement.java:1043) > at > oracle.jdbc.driver.OracleStatement.executeMaybeDescribe(OracleStatement.java:) > at > oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout(OracleStatement.java:1353) > at > oracle.jdbc.driver.OraclePreparedStatement.executeInternal(OraclePreparedStatement.java:4485) > at > oracle.jdbc.driver.OraclePreparedStatement.executeQuery(OraclePreparedStatement.java:4566) > at > oracle.jdbc.driver.OraclePreparedStatementWrapper.executeQuery(OraclePreparedStatementWrapper.java:5251) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.(JDBCRDD.scala:383) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:359) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spar