[jira] [Resolved] (SPARK-19518) IGNORE NULLS in first_value / last_value should be supported in SQL statements
[ https://issues.apache.org/jira/browse/SPARK-19518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-19518. --- Resolution: Fixed Assignee: Hyukjin Kwon Fix Version/s: 2.2.0 > IGNORE NULLS in first_value / last_value should be supported in SQL statements > -- > > Key: SPARK-19518 > URL: https://issues.apache.org/jira/browse/SPARK-19518 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Ferenc Erdelyi >Assignee: Hyukjin Kwon > Fix For: 2.2.0 > > > https://issues.apache.org/jira/browse/SPARK-13049 was implemented in Spark2, > however it does not work in SQL statements as it is not implemented in Hive > yet: https://issues.apache.org/jira/browse/HIVE-11189 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20243) DebugFilesystem.assertNoOpenStreams thread race
[ https://issues.apache.org/jira/browse/SPARK-20243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-20243. --- Resolution: Fixed Assignee: Bogdan Raducanu Fix Version/s: 2.2.0 > DebugFilesystem.assertNoOpenStreams thread race > --- > > Key: SPARK-20243 > URL: https://issues.apache.org/jira/browse/SPARK-20243 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.2.0 >Reporter: Bogdan Raducanu >Assignee: Bogdan Raducanu > Fix For: 2.2.0 > > > Introduced by SPARK-19946. > DebugFilesystem.assertNoOpenStreams gets the size of the openStreams > ConcurrentHashMap and then later, if the size was > 0, accesses the first > element in openStreams.values. But, the ConcurrentHashMap might be cleared by > another thread between getting its size and accessing it, resulting in an > exception when trying to call .head on an empty collection. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16742) Kerberos support for Spark on Mesos
[ https://issues.apache.org/jira/browse/SPARK-16742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963136#comment-15963136 ] Marcelo Vanzin commented on SPARK-16742: bq. The problem is then that a kerberos-authenticated user submitting their job would be unaware that their credentials are being leaked to other users. That's the gist of it, yes. But note that it isn't restricted to files. If all the user processes are running as the same user, one can just dump the other's heap, or connect using JVMTI, and get the credentials. Same problem. The most basic feature needed for any kerberos-related work is user isolation (different users cannot mess with each others' processes). I was under the impression that Mesos supported that. bq. I'm assuming that hadoop.security.auth_to_local is what maps the Kerberos user to the Unix user... I'm not exactly familiar with all the YARN settings but yes, the result you get is that the submitting user runs YARN containers as their own user (nor as some generic, shared user). Without that, you shouldn't even bother thinking about inserting Kerberos in the picture, IMO. bq. We avoid the shared-file problem for keytabs entirely See my first comment above, that's not enough. bq. We're probably going to punt on cluster mode for now You don't need to punt on cluster mode. I don't know where this notion that cluster mode requires you to distribute keytabs comes from; Spark works just fine in YARN cluster mode without distributing keytabs. All you need to distribute are delegation tokens. Keytabs aren't even necessary to log in and submit the app at all (you can use passwords with kinit, after all). The only thing distributing keytabs buys you is running applications for longer than the delegation tokens' max lifetime (normally 7 days by default). bq. If you see any blockers Lack of user isolation is always a blocker; without that there's no way to prevent one user from seeing another's credentials. But I've asked this in the past and the answer I got is that Mesos supports it... > Kerberos support for Spark on Mesos > --- > > Key: SPARK-16742 > URL: https://issues.apache.org/jira/browse/SPARK-16742 > Project: Spark > Issue Type: New Feature > Components: Mesos >Reporter: Michael Gummelt > > We at Mesosphere have written Kerberos support for Spark on Mesos. We'll be > contributing it to Apache Spark soon. > Mesosphere design doc: > https://docs.google.com/document/d/1xyzICg7SIaugCEcB4w1vBWp24UDkyJ1Pyt2jtnREFqc/edit#heading=h.tdnq7wilqrj6 > Mesosphere code: > https://github.com/mesosphere/spark/commit/73ba2ab8d97510d5475ef9a48c673ce34f7173fa -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20273) Disallow Non-deterministic Filter push-down into Join Conditions
[ https://issues.apache.org/jira/browse/SPARK-20273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-20273. - Resolution: Fixed Fix Version/s: 2.2.0 > Disallow Non-deterministic Filter push-down into Join Conditions > > > Key: SPARK-20273 > URL: https://issues.apache.org/jira/browse/SPARK-20273 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.0 >Reporter: Xiao Li >Assignee: Xiao Li > Fix For: 2.2.0 > > > {noformat} > sql("SELECT t1.b, rand(0) as r FROM cachedData, cachedData t1 GROUP BY t1.b > having r > 0.5").show() > {noformat} > We will get the following error: > {noformat} > Job aborted due to stage failure: Task 1 in stage 4.0 failed 1 times, most > recent failure: Lost task 1.0 in stage 4.0 (TID 8, localhost, executor > driver): java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown > Source) > at > org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$1.apply(BroadcastNestedLoopJoinExec.scala:87) > at > org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$1.apply(BroadcastNestedLoopJoinExec.scala:87) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463) > {noformat} > Filters could be pushed down to the join conditions by the optimizer rule > {{PushPredicateThroughJoin}}. However, we block users to add > non-deterministics conditions by the analyzer (For details, see the PR > https://github.com/apache/spark/pull/7535). > We should not push down non-deterministic conditions; otherwise, we should > allow users to do it by explicitly initialize the non-deterministic > expressions -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19352) Sorting issues on relatively big datasets
[ https://issues.apache.org/jira/browse/SPARK-19352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963367#comment-15963367 ] Charles Pritchard commented on SPARK-19352: --- Does this fix the issue in SPARK-18934 ? > Sorting issues on relatively big datasets > - > > Key: SPARK-19352 > URL: https://issues.apache.org/jira/browse/SPARK-19352 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 > Environment: Spark version 2.1.0 > Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_102 > macOS 10.12.3 >Reporter: Ivan Gozali > > _More details, including the script to generate the synthetic dataset > (requires pandas and numpy) are in this GitHub gist._ > https://gist.github.com/igozali/d327a85646abe7ab10c2ae479bed431f > Given a relatively large synthetic time series dataset of various users > (4.1GB), when attempting to: > * partition this dataset by user ID > * sort the time series data for each user by timestamp > * write each partition to a single CSV file > then some files are unsorted in a very specific manner. In one of the > supposedly sorted files, the rows looked as follows: > {code} > 2014-01-01T00:00:00.000-08:00,-0.07,0.39,-0.39 > 2014-12-31T02:07:30.000-08:00,0.34,-0.62,-0.22 > 2014-01-01T00:00:05.000-08:00,-0.07,-0.52,0.47 > 2014-12-31T02:07:35.000-08:00,-0.15,-0.13,-0.14 > 2014-01-01T00:00:10.000-08:00,-1.31,-1.17,2.24 > 2014-12-31T02:07:40.000-08:00,-1.28,0.88,-0.43 > {code} > The above is attempted using the following Scala/Spark code: > {code} > val inpth = "/tmp/gen_data_3cols_small" > spark > .read > .option("inferSchema", "true") > .option("header", "true") > .csv(inpth) > .repartition($"userId") > .sortWithinPartitions("timestamp") > .write > .partitionBy("userId") > .option("header", "true") > .csv(inpth + "_sorted") > {code} > This issue is not seen when using a smaller sized dataset by making the time > span smaller (354MB, with the same number of columns). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20156) Java String toLowerCase "Turkish locale bug" causes Spark problems
[ https://issues.apache.org/jira/browse/SPARK-20156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963406#comment-15963406 ] Serkan Taş commented on SPARK-20156: Thank you Sean, it was more than i expected i think. > Java String toLowerCase "Turkish locale bug" causes Spark problems > -- > > Key: SPARK-20156 > URL: https://issues.apache.org/jira/browse/SPARK-20156 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.1.0 > Environment: Ubunutu 16.04 > Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_121) >Reporter: Serkan Taş >Assignee: Sean Owen > Fix For: 2.2.0 > > Attachments: sprk_shell.txt > > > If the regional setting of the operation system is Turkish, the famous java > locale problem occurs (https://jira.atlassian.com/browse/CONF-5931 or > https://issues.apache.org/jira/browse/AVRO-1493). > e.g : > "SERDEINFO" lowers to "serdeınfo" > "uniquetable" uppers to "UNİQUETABLE" > work around : > add -Duser.country=US -Duser.language=en to the end of the line > SPARK_SUBMIT_OPTS="$SPARK_SUBMIT_OPTS -Dscala.usejavacp=true" > in spark-shell.sh -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12837) Spark driver requires large memory space for serialized results even there are no data collected to the driver
[ https://issues.apache.org/jira/browse/SPARK-12837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963402#comment-15963402 ] Apache Spark commented on SPARK-12837: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/17596 > Spark driver requires large memory space for serialized results even there > are no data collected to the driver > -- > > Key: SPARK-12837 > URL: https://issues.apache.org/jira/browse/SPARK-12837 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 1.5.2, 1.6.0 >Reporter: Tien-Dung LE >Assignee: Wenchen Fan >Priority: Critical > Fix For: 2.0.0 > > > Executing a sql statement with a large number of partitions requires a high > memory space for the driver even there are no requests to collect data back > to the driver. > Here are steps to re-produce the issue. > 1. Start spark shell with a spark.driver.maxResultSize setting > {code:java} > bin/spark-shell --driver-memory=1g --conf spark.driver.maxResultSize=1m > {code} > 2. Execute the code > {code:java} > case class Toto( a: Int, b: Int) > val df = sc.parallelize( 1 to 1e6.toInt).map( i => Toto( i, i)).toDF > sqlContext.setConf( "spark.sql.shuffle.partitions", "200" ) > df.groupBy("a").count().saveAsParquetFile( "toto1" ) // OK > sqlContext.setConf( "spark.sql.shuffle.partitions", 1e3.toInt.toString ) > df.repartition(1e3.toInt).groupBy("a").count().repartition(1e3.toInt).saveAsParquetFile( > "toto2" ) // ERROR > {code} > The error message is > {code:java} > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Total size of serialized results of 393 tasks (1025.9 KB) is bigger than > spark.driver.maxResultSize (1024.0 KB) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12837) Spark driver requires large memory space for serialized results even there are no data collected to the driver
[ https://issues.apache.org/jira/browse/SPARK-12837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12837: Assignee: Apache Spark (was: Wenchen Fan) > Spark driver requires large memory space for serialized results even there > are no data collected to the driver > -- > > Key: SPARK-12837 > URL: https://issues.apache.org/jira/browse/SPARK-12837 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 1.5.2, 1.6.0 >Reporter: Tien-Dung LE >Assignee: Apache Spark >Priority: Critical > Fix For: 2.0.0 > > > Executing a sql statement with a large number of partitions requires a high > memory space for the driver even there are no requests to collect data back > to the driver. > Here are steps to re-produce the issue. > 1. Start spark shell with a spark.driver.maxResultSize setting > {code:java} > bin/spark-shell --driver-memory=1g --conf spark.driver.maxResultSize=1m > {code} > 2. Execute the code > {code:java} > case class Toto( a: Int, b: Int) > val df = sc.parallelize( 1 to 1e6.toInt).map( i => Toto( i, i)).toDF > sqlContext.setConf( "spark.sql.shuffle.partitions", "200" ) > df.groupBy("a").count().saveAsParquetFile( "toto1" ) // OK > sqlContext.setConf( "spark.sql.shuffle.partitions", 1e3.toInt.toString ) > df.repartition(1e3.toInt).groupBy("a").count().repartition(1e3.toInt).saveAsParquetFile( > "toto2" ) // ERROR > {code} > The error message is > {code:java} > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Total size of serialized results of 393 tasks (1025.9 KB) is bigger than > spark.driver.maxResultSize (1024.0 KB) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12837) Spark driver requires large memory space for serialized results even there are no data collected to the driver
[ https://issues.apache.org/jira/browse/SPARK-12837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12837: Assignee: Wenchen Fan (was: Apache Spark) > Spark driver requires large memory space for serialized results even there > are no data collected to the driver > -- > > Key: SPARK-12837 > URL: https://issues.apache.org/jira/browse/SPARK-12837 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 1.5.2, 1.6.0 >Reporter: Tien-Dung LE >Assignee: Wenchen Fan >Priority: Critical > Fix For: 2.0.0 > > > Executing a sql statement with a large number of partitions requires a high > memory space for the driver even there are no requests to collect data back > to the driver. > Here are steps to re-produce the issue. > 1. Start spark shell with a spark.driver.maxResultSize setting > {code:java} > bin/spark-shell --driver-memory=1g --conf spark.driver.maxResultSize=1m > {code} > 2. Execute the code > {code:java} > case class Toto( a: Int, b: Int) > val df = sc.parallelize( 1 to 1e6.toInt).map( i => Toto( i, i)).toDF > sqlContext.setConf( "spark.sql.shuffle.partitions", "200" ) > df.groupBy("a").count().saveAsParquetFile( "toto1" ) // OK > sqlContext.setConf( "spark.sql.shuffle.partitions", 1e3.toInt.toString ) > df.repartition(1e3.toInt).groupBy("a").count().repartition(1e3.toInt).saveAsParquetFile( > "toto2" ) // ERROR > {code} > The error message is > {code:java} > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Total size of serialized results of 393 tasks (1025.9 KB) is bigger than > spark.driver.maxResultSize (1024.0 KB) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20285) Flaky test: pyspark.streaming.tests.BasicOperationTests.test_cogroup
[ https://issues.apache.org/jira/browse/SPARK-20285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20285: Assignee: Apache Spark (was: Shixiong Zhu) > Flaky test: pyspark.streaming.tests.BasicOperationTests.test_cogroup > > > Key: SPARK-20285 > URL: https://issues.apache.org/jira/browse/SPARK-20285 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.2.0 >Reporter: Shixiong Zhu >Assignee: Apache Spark >Priority: Minor > > Saw the following failure locally: > {code} > Traceback (most recent call last): > File "/home/jenkins/workspace/python/pyspark/streaming/tests.py", line 351, > in test_cogroup > self._test_func(input, func, expected, sort=True, input2=input2) > File "/home/jenkins/workspace/python/pyspark/streaming/tests.py", line 162, > in _test_func > self.assertEqual(expected, result) > AssertionError: Lists differ: [[(1, ([1], [2])), (2, ([1], [... != [] > First list contains 3 additional elements. > First extra element 0: > [(1, ([1], [2])), (2, ([1], [])), (3, ([1], []))] > + [] > - [[(1, ([1], [2])), (2, ([1], [])), (3, ([1], []))], > - [(1, ([1, 1, 1], [])), (2, ([1], [])), (4, ([], [1]))], > - [('', ([1, 1], [1, 2])), ('a', ([1, 1], [1, 1])), ('b', ([1], [1]))]] > {code} > It also happened on Jenkins: > http://spark-tests.appspot.com/builds/spark-branch-2.1-test-sbt-hadoop-2.7/120 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20285) Flaky test: pyspark.streaming.tests.BasicOperationTests.test_cogroup
[ https://issues.apache.org/jira/browse/SPARK-20285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963412#comment-15963412 ] Apache Spark commented on SPARK-20285: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/17597 > Flaky test: pyspark.streaming.tests.BasicOperationTests.test_cogroup > > > Key: SPARK-20285 > URL: https://issues.apache.org/jira/browse/SPARK-20285 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.2.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Minor > > Saw the following failure locally: > {code} > Traceback (most recent call last): > File "/home/jenkins/workspace/python/pyspark/streaming/tests.py", line 351, > in test_cogroup > self._test_func(input, func, expected, sort=True, input2=input2) > File "/home/jenkins/workspace/python/pyspark/streaming/tests.py", line 162, > in _test_func > self.assertEqual(expected, result) > AssertionError: Lists differ: [[(1, ([1], [2])), (2, ([1], [... != [] > First list contains 3 additional elements. > First extra element 0: > [(1, ([1], [2])), (2, ([1], [])), (3, ([1], []))] > + [] > - [[(1, ([1], [2])), (2, ([1], [])), (3, ([1], []))], > - [(1, ([1, 1, 1], [])), (2, ([1], [])), (4, ([], [1]))], > - [('', ([1, 1], [1, 2])), ('a', ([1, 1], [1, 1])), ('b', ([1], [1]))]] > {code} > It also happened on Jenkins: > http://spark-tests.appspot.com/builds/spark-branch-2.1-test-sbt-hadoop-2.7/120 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20285) Flaky test: pyspark.streaming.tests.BasicOperationTests.test_cogroup
[ https://issues.apache.org/jira/browse/SPARK-20285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20285: Assignee: Shixiong Zhu (was: Apache Spark) > Flaky test: pyspark.streaming.tests.BasicOperationTests.test_cogroup > > > Key: SPARK-20285 > URL: https://issues.apache.org/jira/browse/SPARK-20285 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.2.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Minor > > Saw the following failure locally: > {code} > Traceback (most recent call last): > File "/home/jenkins/workspace/python/pyspark/streaming/tests.py", line 351, > in test_cogroup > self._test_func(input, func, expected, sort=True, input2=input2) > File "/home/jenkins/workspace/python/pyspark/streaming/tests.py", line 162, > in _test_func > self.assertEqual(expected, result) > AssertionError: Lists differ: [[(1, ([1], [2])), (2, ([1], [... != [] > First list contains 3 additional elements. > First extra element 0: > [(1, ([1], [2])), (2, ([1], [])), (3, ([1], []))] > + [] > - [[(1, ([1], [2])), (2, ([1], [])), (3, ([1], []))], > - [(1, ([1, 1, 1], [])), (2, ([1], [])), (4, ([], [1]))], > - [('', ([1, 1], [1, 2])), ('a', ([1, 1], [1, 1])), ('b', ([1], [1]))]] > {code} > It also happened on Jenkins: > http://spark-tests.appspot.com/builds/spark-branch-2.1-test-sbt-hadoop-2.7/120 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20284) Make SerializationStream and DeserializationStream extend Closeable
[ https://issues.apache.org/jira/browse/SPARK-20284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963419#comment-15963419 ] Apache Spark commented on SPARK-20284: -- User 'superbobry' has created a pull request for this issue: https://github.com/apache/spark/pull/17598 > Make SerializationStream and DeserializationStream extend Closeable > --- > > Key: SPARK-20284 > URL: https://issues.apache.org/jira/browse/SPARK-20284 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.3, 2.1.0 >Reporter: Sergei Lebedev >Priority: Minor > > Both {{SerializationStream}} and {{DeserializationStream}} implement > {{close}} but do not extend {{Closeable}}. As a result, these streams cannot > be used in try-with-resources. > Was this intentional or rather nobody ever needed that? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20284) Make SerializationStream and DeserializationStream extend Closeable
[ https://issues.apache.org/jira/browse/SPARK-20284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20284: Assignee: (was: Apache Spark) > Make SerializationStream and DeserializationStream extend Closeable > --- > > Key: SPARK-20284 > URL: https://issues.apache.org/jira/browse/SPARK-20284 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.3, 2.1.0 >Reporter: Sergei Lebedev >Priority: Minor > > Both {{SerializationStream}} and {{DeserializationStream}} implement > {{close}} but do not extend {{Closeable}}. As a result, these streams cannot > be used in try-with-resources. > Was this intentional or rather nobody ever needed that? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20284) Make SerializationStream and DeserializationStream extend Closeable
[ https://issues.apache.org/jira/browse/SPARK-20284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20284: Assignee: Apache Spark > Make SerializationStream and DeserializationStream extend Closeable > --- > > Key: SPARK-20284 > URL: https://issues.apache.org/jira/browse/SPARK-20284 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.3, 2.1.0 >Reporter: Sergei Lebedev >Assignee: Apache Spark >Priority: Minor > > Both {{SerializationStream}} and {{DeserializationStream}} implement > {{close}} but do not extend {{Closeable}}. As a result, these streams cannot > be used in try-with-resources. > Was this intentional or rather nobody ever needed that? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20285) Flaky test: pyspark.streaming.tests.BasicOperationTests.test_cogroup
[ https://issues.apache.org/jira/browse/SPARK-20285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-20285. -- Resolution: Fixed Fix Version/s: 2.2.0 2.1.1 2.0.3 > Flaky test: pyspark.streaming.tests.BasicOperationTests.test_cogroup > > > Key: SPARK-20285 > URL: https://issues.apache.org/jira/browse/SPARK-20285 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.2.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Minor > Fix For: 2.0.3, 2.1.1, 2.2.0 > > > Saw the following failure locally: > {code} > Traceback (most recent call last): > File "/home/jenkins/workspace/python/pyspark/streaming/tests.py", line 351, > in test_cogroup > self._test_func(input, func, expected, sort=True, input2=input2) > File "/home/jenkins/workspace/python/pyspark/streaming/tests.py", line 162, > in _test_func > self.assertEqual(expected, result) > AssertionError: Lists differ: [[(1, ([1], [2])), (2, ([1], [... != [] > First list contains 3 additional elements. > First extra element 0: > [(1, ([1], [2])), (2, ([1], [])), (3, ([1], []))] > + [] > - [[(1, ([1], [2])), (2, ([1], [])), (3, ([1], []))], > - [(1, ([1, 1, 1], [])), (2, ([1], [])), (4, ([], [1]))], > - [('', ([1, 1], [1, 2])), ('a', ([1, 1], [1, 1])), ('b', ([1], [1]))]] > {code} > It also happened on Jenkins: > http://spark-tests.appspot.com/builds/spark-branch-2.1-test-sbt-hadoop-2.7/120 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-20285) Flaky test: pyspark.streaming.tests.BasicOperationTests.test_cogroup
[ https://issues.apache.org/jira/browse/SPARK-20285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-20285: - Comment: was deleted (was: https://github.com/apache/spark/pull/17597) > Flaky test: pyspark.streaming.tests.BasicOperationTests.test_cogroup > > > Key: SPARK-20285 > URL: https://issues.apache.org/jira/browse/SPARK-20285 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.2.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Minor > > Saw the following failure locally: > {code} > Traceback (most recent call last): > File "/home/jenkins/workspace/python/pyspark/streaming/tests.py", line 351, > in test_cogroup > self._test_func(input, func, expected, sort=True, input2=input2) > File "/home/jenkins/workspace/python/pyspark/streaming/tests.py", line 162, > in _test_func > self.assertEqual(expected, result) > AssertionError: Lists differ: [[(1, ([1], [2])), (2, ([1], [... != [] > First list contains 3 additional elements. > First extra element 0: > [(1, ([1], [2])), (2, ([1], [])), (3, ([1], []))] > + [] > - [[(1, ([1], [2])), (2, ([1], [])), (3, ([1], []))], > - [(1, ([1, 1, 1], [])), (2, ([1], [])), (4, ([], [1]))], > - [('', ([1, 1], [1, 2])), ('a', ([1, 1], [1, 1])), ('b', ([1], [1]))]] > {code} > It also happened on Jenkins: > http://spark-tests.appspot.com/builds/spark-branch-2.1-test-sbt-hadoop-2.7/120 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20285) Flaky test: pyspark.streaming.tests.BasicOperationTests.test_cogroup
[ https://issues.apache.org/jira/browse/SPARK-20285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963422#comment-15963422 ] Shixiong Zhu commented on SPARK-20285: -- https://github.com/apache/spark/pull/17597 > Flaky test: pyspark.streaming.tests.BasicOperationTests.test_cogroup > > > Key: SPARK-20285 > URL: https://issues.apache.org/jira/browse/SPARK-20285 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.2.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Minor > > Saw the following failure locally: > {code} > Traceback (most recent call last): > File "/home/jenkins/workspace/python/pyspark/streaming/tests.py", line 351, > in test_cogroup > self._test_func(input, func, expected, sort=True, input2=input2) > File "/home/jenkins/workspace/python/pyspark/streaming/tests.py", line 162, > in _test_func > self.assertEqual(expected, result) > AssertionError: Lists differ: [[(1, ([1], [2])), (2, ([1], [... != [] > First list contains 3 additional elements. > First extra element 0: > [(1, ([1], [2])), (2, ([1], [])), (3, ([1], []))] > + [] > - [[(1, ([1], [2])), (2, ([1], [])), (3, ([1], []))], > - [(1, ([1, 1, 1], [])), (2, ([1], [])), (4, ([], [1]))], > - [('', ([1, 1], [1, 2])), ('a', ([1, 1], [1, 1])), ('b', ([1], [1]))]] > {code} > It also happened on Jenkins: > http://spark-tests.appspot.com/builds/spark-branch-2.1-test-sbt-hadoop-2.7/120 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19067) mapGroupsWithState - arbitrary stateful operations with Structured Streaming (similar to DStream.mapWithState)
[ https://issues.apache.org/jira/browse/SPARK-19067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963353#comment-15963353 ] Michael Armbrust commented on SPARK-19067: -- No, this will be available in Spark 2.2.0 > mapGroupsWithState - arbitrary stateful operations with Structured Streaming > (similar to DStream.mapWithState) > -- > > Key: SPARK-19067 > URL: https://issues.apache.org/jira/browse/SPARK-19067 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Reporter: Michael Armbrust >Assignee: Tathagata Das >Priority: Critical > Fix For: 2.2.0 > > > Right now the only way to do stateful operations with with Aggregator or > UDAF. However, this does not give users control of emission or expiration of > state making it hard to implement things like sessionization. We should add > a more general construct (probably similar to {{DStream.mapWithState}}) to > structured streaming. Here is the design. > *Requirements* > - Users should be able to specify a function that can do the following > - Access the input row corresponding to a key > - Access the previous state corresponding to a key > - Optionally, update or remove the state > - Output any number of new rows (or none at all) > *Proposed API* > {code} > // New methods on KeyValueGroupedDataset > class KeyValueGroupedDataset[K, V] { > // Scala friendly > def mapGroupsWithState[S: Encoder, U: Encoder](func: (K, Iterator[V], > State[S]) => U) > def flatMapGroupsWithState[S: Encode, U: Encoder](func: (K, > Iterator[V], State[S]) => Iterator[U]) > // Java friendly >def mapGroupsWithState[S, U](func: MapGroupsWithStateFunction[K, V, S, > R], stateEncoder: Encoder[S], resultEncoder: Encoder[U]) >def flatMapGroupsWithState[S, U](func: > FlatMapGroupsWithStateFunction[K, V, S, R], stateEncoder: Encoder[S], > resultEncoder: Encoder[U]) > } > // --- New Java-friendly function classes --- > public interface MapGroupsWithStateFunctionextends Serializable { > R call(K key, Iterator values, state: State) throws Exception; > } > public interface FlatMapGroupsWithStateFunction extends > Serializable { > Iterator call(K key, Iterator values, state: GroupState) throws > Exception; > } > // -- Wrapper class for state data -- > trait GroupState[S] { > def exists(): Boolean > def get(): S// throws Exception is state does not > exist > def getOption(): Option[S] > def update(newState: S): Unit > def remove(): Unit // exists() will be false after this > } > {code} > Key Semantics of the State class > - The state can be null. > - If the state.remove() is called, then state.exists() will return false, and > getOption will returm None. > - After that state.update(newState) is called, then state.exists() will > return true, and getOption will return Some(...). > - None of the operations are thread-safe. This is to avoid memory barriers. > *Usage* > {code} > val stateFunc = (word: String, words: Iterator[String, runningCount: > GroupState[Long]) => { > val newCount = words.size + runningCount.getOption.getOrElse(0L) > runningCount.update(newCount) >(word, newCount) > } > dataset // type > is Dataset[String] > .groupByKey[String](w => w) // generates > KeyValueGroupedDataset[String, String] > .mapGroupsWithState[Long, (String, Long)](stateFunc)// returns > Dataset[(String, Long)] > {code} > *Future Directions* > - Timeout based state expiration (that has not received data for a while) - > Done > - General expression based expiration - TODO. Any real usecases that cannot > be done with timeouts? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20282) Flaky test: org.apache.spark.sql.streaming/StreamingQuerySuite/OneTime_trigger__commit_log__and_exception
[ https://issues.apache.org/jira/browse/SPARK-20282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20282: Assignee: (was: Apache Spark) > Flaky test: > org.apache.spark.sql.streaming/StreamingQuerySuite/OneTime_trigger__commit_log__and_exception > - > > Key: SPARK-20282 > URL: https://issues.apache.org/jira/browse/SPARK-20282 > Project: Spark > Issue Type: Test > Components: Structured Streaming, Tests >Affects Versions: 2.2.0 >Reporter: Shixiong Zhu >Priority: Minor > Labels: flaky-test > > I saw the following failure several times: > {code} > sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: > Assert on query failed: > == Progress == >AssertOnQuery(, ) >StopStream >AddData to MemoryStream[value#30891]: 1,2 > > StartStream(OneTimeTrigger,org.apache.spark.util.SystemClock@35cdc93a,Map()) >CheckAnswer: [6],[3] >StopStream > => AssertOnQuery(, ) >AssertOnQuery(, ) >StartStream(OneTimeTrigger,org.apache.spark.util.SystemClock@cdb247d,Map()) >CheckAnswer: [6],[3] >StopStream >AddData to MemoryStream[value#30891]: 3 > > StartStream(OneTimeTrigger,org.apache.spark.util.SystemClock@55394e4d,Map()) >CheckLastBatch: [2] >StopStream >AddData to MemoryStream[value#30891]: 0 > > StartStream(OneTimeTrigger,org.apache.spark.util.SystemClock@749aa997,Map()) >ExpectFailure[org.apache.spark.SparkException, isFatalError: false] >AssertOnQuery(, ) >AssertOnQuery(, incorrect start offset or end offset on > exception) > == Stream == > Output Mode: Append > Stream state: not started > Thread state: dead > == Sink == > 0: [6] [3] > == Plan == > > > at > org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:495) > at > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555) > at org.scalatest.Assertions$class.fail(Assertions.scala:1328) > at org.scalatest.FunSuite.fail(FunSuite.scala:1555) > at > org.apache.spark.sql.streaming.StreamTest$class.failTest$1(StreamTest.scala:347) > at > org.apache.spark.sql.streaming.StreamTest$class.verify$1(StreamTest.scala:318) > at > org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:483) > at > org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:357) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.sql.streaming.StreamTest$class.liftedTree1$1(StreamTest.scala:357) > at > org.apache.spark.sql.streaming.StreamTest$class.testStream(StreamTest.scala:356) > at > org.apache.spark.sql.streaming.StreamingQuerySuite.testStream(StreamingQuerySuite.scala:41) > at > org.apache.spark.sql.streaming.StreamingQuerySuite$$anonfun$6.apply$mcV$sp(StreamingQuerySuite.scala:166) > at > org.apache.spark.sql.streaming.StreamingQuerySuite$$anonfun$6.apply(StreamingQuerySuite.scala:161) > at > org.apache.spark.sql.streaming.StreamingQuerySuite$$anonfun$6.apply(StreamingQuerySuite.scala:161) > at org.apache.spark.sql.catalyst.util.package$.quietly(package.scala:42) > at > org.apache.spark.sql.test.SQLTestUtils$$anonfun$testQuietly$1.apply$mcV$sp(SQLTestUtils.scala:268) > at > org.apache.spark.sql.test.SQLTestUtils$$anonfun$testQuietly$1.apply(SQLTestUtils.scala:268) > at > org.apache.spark.sql.test.SQLTestUtils$$anonfun$testQuietly$1.apply(SQLTestUtils.scala:268) > at > org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) > at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68) > at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) > at > org.apache.spark.sql.streaming.StreamingQuerySuite.org$scalatest$BeforeAndAfterEach$$super$runTest(StreamingQuerySuite.scala:41) >
[jira] [Commented] (SPARK-20282) Flaky test: org.apache.spark.sql.streaming/StreamingQuerySuite/OneTime_trigger__commit_log__and_exception
[ https://issues.apache.org/jira/browse/SPARK-20282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963341#comment-15963341 ] Apache Spark commented on SPARK-20282: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/17594 > Flaky test: > org.apache.spark.sql.streaming/StreamingQuerySuite/OneTime_trigger__commit_log__and_exception > - > > Key: SPARK-20282 > URL: https://issues.apache.org/jira/browse/SPARK-20282 > Project: Spark > Issue Type: Test > Components: Structured Streaming, Tests >Affects Versions: 2.2.0 >Reporter: Shixiong Zhu >Priority: Minor > Labels: flaky-test > > I saw the following failure several times: > {code} > sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: > Assert on query failed: > == Progress == >AssertOnQuery(, ) >StopStream >AddData to MemoryStream[value#30891]: 1,2 > > StartStream(OneTimeTrigger,org.apache.spark.util.SystemClock@35cdc93a,Map()) >CheckAnswer: [6],[3] >StopStream > => AssertOnQuery(, ) >AssertOnQuery(, ) >StartStream(OneTimeTrigger,org.apache.spark.util.SystemClock@cdb247d,Map()) >CheckAnswer: [6],[3] >StopStream >AddData to MemoryStream[value#30891]: 3 > > StartStream(OneTimeTrigger,org.apache.spark.util.SystemClock@55394e4d,Map()) >CheckLastBatch: [2] >StopStream >AddData to MemoryStream[value#30891]: 0 > > StartStream(OneTimeTrigger,org.apache.spark.util.SystemClock@749aa997,Map()) >ExpectFailure[org.apache.spark.SparkException, isFatalError: false] >AssertOnQuery(, ) >AssertOnQuery(, incorrect start offset or end offset on > exception) > == Stream == > Output Mode: Append > Stream state: not started > Thread state: dead > == Sink == > 0: [6] [3] > == Plan == > > > at > org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:495) > at > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555) > at org.scalatest.Assertions$class.fail(Assertions.scala:1328) > at org.scalatest.FunSuite.fail(FunSuite.scala:1555) > at > org.apache.spark.sql.streaming.StreamTest$class.failTest$1(StreamTest.scala:347) > at > org.apache.spark.sql.streaming.StreamTest$class.verify$1(StreamTest.scala:318) > at > org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:483) > at > org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:357) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.sql.streaming.StreamTest$class.liftedTree1$1(StreamTest.scala:357) > at > org.apache.spark.sql.streaming.StreamTest$class.testStream(StreamTest.scala:356) > at > org.apache.spark.sql.streaming.StreamingQuerySuite.testStream(StreamingQuerySuite.scala:41) > at > org.apache.spark.sql.streaming.StreamingQuerySuite$$anonfun$6.apply$mcV$sp(StreamingQuerySuite.scala:166) > at > org.apache.spark.sql.streaming.StreamingQuerySuite$$anonfun$6.apply(StreamingQuerySuite.scala:161) > at > org.apache.spark.sql.streaming.StreamingQuerySuite$$anonfun$6.apply(StreamingQuerySuite.scala:161) > at org.apache.spark.sql.catalyst.util.package$.quietly(package.scala:42) > at > org.apache.spark.sql.test.SQLTestUtils$$anonfun$testQuietly$1.apply$mcV$sp(SQLTestUtils.scala:268) > at > org.apache.spark.sql.test.SQLTestUtils$$anonfun$testQuietly$1.apply(SQLTestUtils.scala:268) > at > org.apache.spark.sql.test.SQLTestUtils$$anonfun$testQuietly$1.apply(SQLTestUtils.scala:268) > at > org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) > at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68) > at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) > at >
[jira] [Assigned] (SPARK-20282) Flaky test: org.apache.spark.sql.streaming/StreamingQuerySuite/OneTime_trigger__commit_log__and_exception
[ https://issues.apache.org/jira/browse/SPARK-20282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20282: Assignee: Apache Spark > Flaky test: > org.apache.spark.sql.streaming/StreamingQuerySuite/OneTime_trigger__commit_log__and_exception > - > > Key: SPARK-20282 > URL: https://issues.apache.org/jira/browse/SPARK-20282 > Project: Spark > Issue Type: Test > Components: Structured Streaming, Tests >Affects Versions: 2.2.0 >Reporter: Shixiong Zhu >Assignee: Apache Spark >Priority: Minor > Labels: flaky-test > > I saw the following failure several times: > {code} > sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: > Assert on query failed: > == Progress == >AssertOnQuery(, ) >StopStream >AddData to MemoryStream[value#30891]: 1,2 > > StartStream(OneTimeTrigger,org.apache.spark.util.SystemClock@35cdc93a,Map()) >CheckAnswer: [6],[3] >StopStream > => AssertOnQuery(, ) >AssertOnQuery(, ) >StartStream(OneTimeTrigger,org.apache.spark.util.SystemClock@cdb247d,Map()) >CheckAnswer: [6],[3] >StopStream >AddData to MemoryStream[value#30891]: 3 > > StartStream(OneTimeTrigger,org.apache.spark.util.SystemClock@55394e4d,Map()) >CheckLastBatch: [2] >StopStream >AddData to MemoryStream[value#30891]: 0 > > StartStream(OneTimeTrigger,org.apache.spark.util.SystemClock@749aa997,Map()) >ExpectFailure[org.apache.spark.SparkException, isFatalError: false] >AssertOnQuery(, ) >AssertOnQuery(, incorrect start offset or end offset on > exception) > == Stream == > Output Mode: Append > Stream state: not started > Thread state: dead > == Sink == > 0: [6] [3] > == Plan == > > > at > org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:495) > at > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555) > at org.scalatest.Assertions$class.fail(Assertions.scala:1328) > at org.scalatest.FunSuite.fail(FunSuite.scala:1555) > at > org.apache.spark.sql.streaming.StreamTest$class.failTest$1(StreamTest.scala:347) > at > org.apache.spark.sql.streaming.StreamTest$class.verify$1(StreamTest.scala:318) > at > org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:483) > at > org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:357) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.sql.streaming.StreamTest$class.liftedTree1$1(StreamTest.scala:357) > at > org.apache.spark.sql.streaming.StreamTest$class.testStream(StreamTest.scala:356) > at > org.apache.spark.sql.streaming.StreamingQuerySuite.testStream(StreamingQuerySuite.scala:41) > at > org.apache.spark.sql.streaming.StreamingQuerySuite$$anonfun$6.apply$mcV$sp(StreamingQuerySuite.scala:166) > at > org.apache.spark.sql.streaming.StreamingQuerySuite$$anonfun$6.apply(StreamingQuerySuite.scala:161) > at > org.apache.spark.sql.streaming.StreamingQuerySuite$$anonfun$6.apply(StreamingQuerySuite.scala:161) > at org.apache.spark.sql.catalyst.util.package$.quietly(package.scala:42) > at > org.apache.spark.sql.test.SQLTestUtils$$anonfun$testQuietly$1.apply$mcV$sp(SQLTestUtils.scala:268) > at > org.apache.spark.sql.test.SQLTestUtils$$anonfun$testQuietly$1.apply(SQLTestUtils.scala:268) > at > org.apache.spark.sql.test.SQLTestUtils$$anonfun$testQuietly$1.apply(SQLTestUtils.scala:268) > at > org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) > at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68) > at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) > at >
[jira] [Comment Edited] (SPARK-16742) Kerberos support for Spark on Mesos
[ https://issues.apache.org/jira/browse/SPARK-16742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963446#comment-15963446 ] Michael Gummelt edited comment on SPARK-16742 at 4/10/17 8:18 PM: -- [~vanzin] bq. The most basic feature needed for any kerberos-related work is user isolation (different users cannot mess with each others' processes). I was under the impression that Mesos supported that. Mesos of course supports configuring the Linux user that process runs as. But in Spark, this isn't currently derived from the Kerberos principal. It's configured by the user, and the *Mesos* principal of the scheduler, along with ACLs configured in Mesos, is what determines which Linux users are allowed. That's why I was asking about {{hadoop.security.auth_to_local}}, to understand how YARN determines what Linux user to run executors as. It would be a vulnerability, for example, if the Linux user for the executors is simply derived from that of the driver, because two human users running as the same Linux user, but logged in via different Kerberos principals, would be able to see each others' tokens. bq. I don't know where this notion that cluster mode requires you to distribute keytabs comes from As you said, it's mostly the renewal use case that requires distributing the keytab, but that's not all. In many Mesos setups, and certainly in DC/OS, the submitting user might not already be kinit'd. They may be running from outside the datacenter entirely, without network access to the KDC. was (Author: mgummelt): bq. The most basic feature needed for any kerberos-related work is user isolation (different users cannot mess with each others' processes). I was under the impression that Mesos supported that. Mesos of course supports configuring the Linux user that process runs as. But in Spark, this isn't currently derived from the Kerberos principal. It's configured by the user, and the *Mesos* principal of the scheduler, along with ACLs configured in Mesos, is what determines which Linux users are allowed. That's why I was asking about {{hadoop.security.auth_to_local}}, to understand how YARN determines what Linux user to run executors as. It would be a vulnerability, for example, if the Linux user for the executors is simply derived from that of the driver, because two human users running as the same Linux user, but logged in via different Kerberos principals, would be able to see each others' tokens. bq. I don't know where this notion that cluster mode requires you to distribute keytabs comes from As you said, it's mostly the renewal use case that requires distributing the keytab, but that's not all. In many Mesos setups, and certainly in DC/OS, the submitting user might not already be kinit'd. They may be running from outside the datacenter entirely, without network access to the KDC. > Kerberos support for Spark on Mesos > --- > > Key: SPARK-16742 > URL: https://issues.apache.org/jira/browse/SPARK-16742 > Project: Spark > Issue Type: New Feature > Components: Mesos >Reporter: Michael Gummelt > > We at Mesosphere have written Kerberos support for Spark on Mesos. We'll be > contributing it to Apache Spark soon. > Mesosphere design doc: > https://docs.google.com/document/d/1xyzICg7SIaugCEcB4w1vBWp24UDkyJ1Pyt2jtnREFqc/edit#heading=h.tdnq7wilqrj6 > Mesosphere code: > https://github.com/mesosphere/spark/commit/73ba2ab8d97510d5475ef9a48c673ce34f7173fa -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20285) Flaky test: pyspark.streaming.tests.BasicOperationTests.test_cogroup
[ https://issues.apache.org/jira/browse/SPARK-20285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-20285: - Affects Version/s: (was: 2.1.1) (was: 2.0.3) (was: 2.2.0) 2.1.0 > Flaky test: pyspark.streaming.tests.BasicOperationTests.test_cogroup > > > Key: SPARK-20285 > URL: https://issues.apache.org/jira/browse/SPARK-20285 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.1.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Minor > Fix For: 2.0.3, 2.1.1, 2.2.0 > > > Saw the following failure locally: > {code} > Traceback (most recent call last): > File "/home/jenkins/workspace/python/pyspark/streaming/tests.py", line 351, > in test_cogroup > self._test_func(input, func, expected, sort=True, input2=input2) > File "/home/jenkins/workspace/python/pyspark/streaming/tests.py", line 162, > in _test_func > self.assertEqual(expected, result) > AssertionError: Lists differ: [[(1, ([1], [2])), (2, ([1], [... != [] > First list contains 3 additional elements. > First extra element 0: > [(1, ([1], [2])), (2, ([1], [])), (3, ([1], []))] > + [] > - [[(1, ([1], [2])), (2, ([1], [])), (3, ([1], []))], > - [(1, ([1, 1, 1], [])), (2, ([1], [])), (4, ([], [1]))], > - [('', ([1, 1], [1, 2])), ('a', ([1, 1], [1, 1])), ('b', ([1], [1]))]] > {code} > It also happened on Jenkins: > http://spark-tests.appspot.com/builds/spark-branch-2.1-test-sbt-hadoop-2.7/120 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20285) Flaky test: pyspark.streaming.tests.BasicOperationTests.test_cogroup
[ https://issues.apache.org/jira/browse/SPARK-20285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-20285: - Affects Version/s: 2.1.1 2.0.3 > Flaky test: pyspark.streaming.tests.BasicOperationTests.test_cogroup > > > Key: SPARK-20285 > URL: https://issues.apache.org/jira/browse/SPARK-20285 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.1.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Minor > Fix For: 2.0.3, 2.1.1, 2.2.0 > > > Saw the following failure locally: > {code} > Traceback (most recent call last): > File "/home/jenkins/workspace/python/pyspark/streaming/tests.py", line 351, > in test_cogroup > self._test_func(input, func, expected, sort=True, input2=input2) > File "/home/jenkins/workspace/python/pyspark/streaming/tests.py", line 162, > in _test_func > self.assertEqual(expected, result) > AssertionError: Lists differ: [[(1, ([1], [2])), (2, ([1], [... != [] > First list contains 3 additional elements. > First extra element 0: > [(1, ([1], [2])), (2, ([1], [])), (3, ([1], []))] > + [] > - [[(1, ([1], [2])), (2, ([1], [])), (3, ([1], []))], > - [(1, ([1, 1, 1], [])), (2, ([1], [])), (4, ([], [1]))], > - [('', ([1, 1], [1, 2])), ('a', ([1, 1], [1, 1])), ('b', ([1], [1]))]] > {code} > It also happened on Jenkins: > http://spark-tests.appspot.com/builds/spark-branch-2.1-test-sbt-hadoop-2.7/120 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963304#comment-15963304 ] Ryan Williams edited comment on SPARK-650 at 4/10/17 6:42 PM: -- Both suggested workarounds here are lacking or broken / actively harmful, afaict, and the use case is real and valid. The ADAM project struggled for >2 years with this problem: - [a 3rd-party {{OutputFormat}} required this field to be set|https://github.com/HadoopGenomics/Hadoop-BAM/blob/eb688fb90c60e8c956f9d1e4793fea01e3164056/src/main/java/org/seqdoop/hadoop_bam/KeyIgnoringAnySAMOutputFormat.java#L93] - the value of the field is computed on the driver, and needs to somehow be sent to and set in each executor JVM. h3. {{mapPartitions}} hack [Some attempts to set the field via a dummy {{mapPartitions}} job|https://github.com/hammerlab/adam/blob/b87bfb72c7411b5ea088b12334aa1b548102eb4b/adam-core/src/main/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDFunctions.scala#L134-L146] actually added [pernicious, non-deterministic bugs|https://github.com/bigdatagenomics/adam/issues/676#issuecomment-219347677]. In general Spark seems to provide no guarantees that ≥1 tasks will get scheduled on each executor in such a situation: - in the above, node locality resulted in some executors being missed - dynamic-allocation also offers chances for executors to come online later and never be initialized h3. object/singleton initialization How can one use singleton initialization to pass an object from the driver to each executor? Maybe I've missed this in the discussion above. In the end, ADAM decided to write the object to a file and route that file's path to the {{OutputFormat}} via a hadoop configuration value, which is pretty inelegant. h4. Another use case I have another need for this atm where regular lazy-object-initialization is also insufficient: [due to a rough-edge in Scala programs' classloader configuration, {{FileSystemProvider}}'s in user JARs are not loaded properly|https://github.com/scala/bug/issues/10247]. [A workaround discussed in the 1st post on that issue fixes the problem|https://github.com/hammerlab/spark-commands/blob/1.0.3/src/main/scala/org/hammerlab/commands/FileSystems.scala#L8-L20], but needs to be run before {{FileSystemProvider.installedProviders}} is first called on the JVM, which can be triggered by numerous {{java.nio.file}} operations. I don't see a clear way to work in code in that will always lazily call my {{FileSystems.load}} function on each executor, let alone ensure that it happens before any code in the JAR calls e.g. {{Paths.get}}. was (Author: rdub): Both suggested workarounds here are lacking or broken / actively harmful, afaict, and the use case is real and valid. The ADAM project struggled for >2 years with this problem: - [a 3rd-party {{OutputFormat}} required this field to be set|https://github.com/HadoopGenomics/Hadoop-BAM/blob/eb688fb90c60e8c956f9d1e4793fea01e3164056/src/main/java/org/seqdoop/hadoop_bam/KeyIgnoringAnySAMOutputFormat.java#L93] - the value of the field is computed on the driver, and needs to somehow be sent to and set in each executor JVM. h3. {{mapPartitions}} hack [Some attempts to set the field via a dummy {{mapPartitions}} job|https://github.com/hammerlab/adam/blob/b87bfb72c7411b5ea088b12334aa1b548102eb4b/adam-core/src/main/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDFunctions.scala#L134-L146] actually added [pernicious, non-deterministic bugs|https://github.com/bigdatagenomics/adam/issues/676#issuecomment-219347677]. In general Spark seems to provide no guarantees that ≥1 tasks will get scheduled on each executor in such a situation: - in the above, node locality resulted in some executors being missed - dynamic-allocation also offers chances for executors to come online later and never be initialized h3. object/singleton initialization How can one use singleton initialization to pass an object from the driver to each executor? Maybe I've missed this in the discussion above. In the end, ADAM decided to write the object to a file and route that file's path to the {{OutputFormat}} via a hadoop configuration value, which is pretty inelegant. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
[jira] [Created] (SPARK-20283) Add preOptimizationBatches
Reynold Xin created SPARK-20283: --- Summary: Add preOptimizationBatches Key: SPARK-20283 URL: https://issues.apache.org/jira/browse/SPARK-20283 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 2.1.0 Reporter: Reynold Xin Assignee: Reynold Xin We currently have postHocOptimizationBatches, but not preOptimizationBatches. Let's add it so it is symmetric. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20156) Java String toLowerCase "Turkish locale bug" causes Spark problems
[ https://issues.apache.org/jira/browse/SPARK-20156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-20156. --- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17527 [https://github.com/apache/spark/pull/17527] > Java String toLowerCase "Turkish locale bug" causes Spark problems > -- > > Key: SPARK-20156 > URL: https://issues.apache.org/jira/browse/SPARK-20156 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.1.0 > Environment: Ubunutu 16.04 > Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_121) >Reporter: Serkan Taş >Assignee: Sean Owen > Fix For: 2.2.0 > > Attachments: sprk_shell.txt > > > If the regional setting of the operation system is Turkish, the famous java > locale problem occurs (https://jira.atlassian.com/browse/CONF-5931 or > https://issues.apache.org/jira/browse/AVRO-1493). > e.g : > "SERDEINFO" lowers to "serdeınfo" > "uniquetable" uppers to "UNİQUETABLE" > work around : > add -Duser.country=US -Duser.language=en to the end of the line > SPARK_SUBMIT_OPTS="$SPARK_SUBMIT_OPTS -Dscala.usejavacp=true" > in spark-shell.sh -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19352) Sorting issues on relatively big datasets
[ https://issues.apache.org/jira/browse/SPARK-19352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963365#comment-15963365 ] Charles Pritchard commented on SPARK-19352: --- [~cloud_fan] Yes, Hive relies on sorting optimizations for running map side joins. DISTRIBUTE BY and SORT BY can be used to manually output data into single sorted files per partition. Hive will ensure sorting when running INSERT OVERWRITE statements, when a table is created with PARTITIONED BY... CLUSTERED BY... SORTED BY ... INTO 1 BUCKETS. Spark also reads the Hive metastore to detect when files are already sorted, and runs optimizations. > Sorting issues on relatively big datasets > - > > Key: SPARK-19352 > URL: https://issues.apache.org/jira/browse/SPARK-19352 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 > Environment: Spark version 2.1.0 > Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_102 > macOS 10.12.3 >Reporter: Ivan Gozali > > _More details, including the script to generate the synthetic dataset > (requires pandas and numpy) are in this GitHub gist._ > https://gist.github.com/igozali/d327a85646abe7ab10c2ae479bed431f > Given a relatively large synthetic time series dataset of various users > (4.1GB), when attempting to: > * partition this dataset by user ID > * sort the time series data for each user by timestamp > * write each partition to a single CSV file > then some files are unsorted in a very specific manner. In one of the > supposedly sorted files, the rows looked as follows: > {code} > 2014-01-01T00:00:00.000-08:00,-0.07,0.39,-0.39 > 2014-12-31T02:07:30.000-08:00,0.34,-0.62,-0.22 > 2014-01-01T00:00:05.000-08:00,-0.07,-0.52,0.47 > 2014-12-31T02:07:35.000-08:00,-0.15,-0.13,-0.14 > 2014-01-01T00:00:10.000-08:00,-1.31,-1.17,2.24 > 2014-12-31T02:07:40.000-08:00,-1.28,0.88,-0.43 > {code} > The above is attempted using the following Scala/Spark code: > {code} > val inpth = "/tmp/gen_data_3cols_small" > spark > .read > .option("inferSchema", "true") > .option("header", "true") > .csv(inpth) > .repartition($"userId") > .sortWithinPartitions("timestamp") > .write > .partitionBy("userId") > .option("header", "true") > .csv(inpth + "_sorted") > {code} > This issue is not seen when using a smaller sized dataset by making the time > span smaller (354MB, with the same number of columns). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20284) Make SerializationStream and DeserializationStream extend Closeable
Sergei Lebedev created SPARK-20284: -- Summary: Make SerializationStream and DeserializationStream extend Closeable Key: SPARK-20284 URL: https://issues.apache.org/jira/browse/SPARK-20284 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.1.0, 1.6.3 Reporter: Sergei Lebedev Priority: Minor Both {{SerializationStream}} and {{DeserializationStream}} implement {{close}} but do not extend {{Closeable}}. As a result, these streams cannot be used in try-with-resources. Was this intentional or rather nobody ever needed that? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20280) SharedInMemoryCache Weigher integer overflow
[ https://issues.apache.org/jira/browse/SPARK-20280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-20280. --- Resolution: Fixed Assignee: Bogdan Raducanu Fix Version/s: 2.2.0 2.1.1 > SharedInMemoryCache Weigher integer overflow > > > Key: SPARK-20280 > URL: https://issues.apache.org/jira/browse/SPARK-20280 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.2.0 >Reporter: Bogdan Raducanu >Assignee: Bogdan Raducanu > Fix For: 2.1.1, 2.2.0 > > > in FileStatusCache.scala: > {code} > .weigher(new Weigher[(ClientId, Path), Array[FileStatus]] { > override def weigh(key: (ClientId, Path), value: Array[FileStatus]): > Int = { > (SizeEstimator.estimate(key) + SizeEstimator.estimate(value)).toInt > }}) > {code} > Weigher.weigh returns Int but the size of an Array[FileStatus] could be > bigger than Int.maxValue. Then, a negative value is returned, leading to this > exception: > {code} > * [info] java.lang.IllegalStateException: Weights must be non-negative > * [info] at > com.google.common.base.Preconditions.checkState(Preconditions.java:149) > * [info] at > com.google.common.cache.LocalCache$Segment.setValue(LocalCache.java:2223) > * [info] at > com.google.common.cache.LocalCache$Segment.put(LocalCache.java:2944) > * [info] at com.google.common.cache.LocalCache.put(LocalCache.java:4212) > * [info] at > com.google.common.cache.LocalCache$LocalManualCache.put(LocalCache.java:4804) > * [info] at > org.apache.spark.sql.execution.datasources.SharedInMemoryCache$$anon$3.putLeafFiles(FileStatusCache.scala:131) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16742) Kerberos support for Spark on Mesos
[ https://issues.apache.org/jira/browse/SPARK-16742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963446#comment-15963446 ] Michael Gummelt commented on SPARK-16742: - bq. The most basic feature needed for any kerberos-related work is user isolation (different users cannot mess with each others' processes). I was under the impression that Mesos supported that. Mesos of course supports configuring the Linux user that process runs as. But in Spark, this isn't currently derived from the Kerberos principal. It's configured by the user, and the *Mesos* principal of the scheduler, along with ACLs configured in Mesos, is what determines which Linux users are allowed. That's why I was asking about {{hadoop.security.auth_to_local}}, to understand how YARN determines what Linux user to run executors as. It would be a vulnerability, for example, if the Linux user for the executors is simply derived from that of the driver, because two human users running as the same Linux user, but logged in via different Kerberos principals, would be able to see each others' tokens. bq. I don't know where this notion that cluster mode requires you to distribute keytabs comes from As you said, it's mostly the renewal use case that requires distributing the keytab, but that's not all. In many Mesos setups, and certainly in DC/OS, the submitting user might not already be kinit'd. They may be running from outside the datacenter entirely, without network access to the KDC. > Kerberos support for Spark on Mesos > --- > > Key: SPARK-16742 > URL: https://issues.apache.org/jira/browse/SPARK-16742 > Project: Spark > Issue Type: New Feature > Components: Mesos >Reporter: Michael Gummelt > > We at Mesosphere have written Kerberos support for Spark on Mesos. We'll be > contributing it to Apache Spark soon. > Mesosphere design doc: > https://docs.google.com/document/d/1xyzICg7SIaugCEcB4w1vBWp24UDkyJ1Pyt2jtnREFqc/edit#heading=h.tdnq7wilqrj6 > Mesosphere code: > https://github.com/mesosphere/spark/commit/73ba2ab8d97510d5475ef9a48c673ce34f7173fa -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16742) Kerberos support for Spark on Mesos
[ https://issues.apache.org/jira/browse/SPARK-16742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963446#comment-15963446 ] Michael Gummelt edited comment on SPARK-16742 at 4/10/17 8:20 PM: -- [~vanzin] bq. The most basic feature needed for any kerberos-related work is user isolation (different users cannot mess with each others' processes). I was under the impression that Mesos supported that. Mesos of course supports configuring the Linux user that process runs as. But in Spark, this isn't currently derived from the Kerberos principal. It's configured by the user. The scheduler's *Mesos* principal, along with ACLs configured in Mesos, is what determines which Linux users are allowed. That's why I was asking about {{hadoop.security.auth_to_local}}, to understand how YARN determines what Linux user to run executors as. It would be a vulnerability, for example, if the Linux user for the executors is simply derived from that of the driver, because two human users running as the same Linux user, but logged in via different Kerberos principals, would be able to see each others' tokens. bq. I don't know where this notion that cluster mode requires you to distribute keytabs comes from As you said, it's mostly the renewal use case that requires distributing the keytab, but that's not all. In many Mesos setups, and certainly in DC/OS, the submitting user might not already be kinit'd. They may be running from outside the datacenter entirely, without network access to the KDC. was (Author: mgummelt): [~vanzin] bq. The most basic feature needed for any kerberos-related work is user isolation (different users cannot mess with each others' processes). I was under the impression that Mesos supported that. Mesos of course supports configuring the Linux user that process runs as. But in Spark, this isn't currently derived from the Kerberos principal. It's configured by the user, and the *Mesos* principal of the scheduler, along with ACLs configured in Mesos, is what determines which Linux users are allowed. That's why I was asking about {{hadoop.security.auth_to_local}}, to understand how YARN determines what Linux user to run executors as. It would be a vulnerability, for example, if the Linux user for the executors is simply derived from that of the driver, because two human users running as the same Linux user, but logged in via different Kerberos principals, would be able to see each others' tokens. bq. I don't know where this notion that cluster mode requires you to distribute keytabs comes from As you said, it's mostly the renewal use case that requires distributing the keytab, but that's not all. In many Mesos setups, and certainly in DC/OS, the submitting user might not already be kinit'd. They may be running from outside the datacenter entirely, without network access to the KDC. > Kerberos support for Spark on Mesos > --- > > Key: SPARK-16742 > URL: https://issues.apache.org/jira/browse/SPARK-16742 > Project: Spark > Issue Type: New Feature > Components: Mesos >Reporter: Michael Gummelt > > We at Mesosphere have written Kerberos support for Spark on Mesos. We'll be > contributing it to Apache Spark soon. > Mesosphere design doc: > https://docs.google.com/document/d/1xyzICg7SIaugCEcB4w1vBWp24UDkyJ1Pyt2jtnREFqc/edit#heading=h.tdnq7wilqrj6 > Mesosphere code: > https://github.com/mesosphere/spark/commit/73ba2ab8d97510d5475ef9a48c673ce34f7173fa -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16742) Kerberos support for Spark on Mesos
[ https://issues.apache.org/jira/browse/SPARK-16742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963469#comment-15963469 ] Michael Gummelt commented on SPARK-16742: - [~jerryshao] Great! The current RPC used in Mesos is very simple. The executor just periodically requests the latest credentials from the driver, which uses the keytab to periodically renew. We can swap in a different mechanism once that exists. I left a comment on your design doc. > Kerberos support for Spark on Mesos > --- > > Key: SPARK-16742 > URL: https://issues.apache.org/jira/browse/SPARK-16742 > Project: Spark > Issue Type: New Feature > Components: Mesos >Reporter: Michael Gummelt > > We at Mesosphere have written Kerberos support for Spark on Mesos. We'll be > contributing it to Apache Spark soon. > Mesosphere design doc: > https://docs.google.com/document/d/1xyzICg7SIaugCEcB4w1vBWp24UDkyJ1Pyt2jtnREFqc/edit#heading=h.tdnq7wilqrj6 > Mesosphere code: > https://github.com/mesosphere/spark/commit/73ba2ab8d97510d5475ef9a48c673ce34f7173fa -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963304#comment-15963304 ] Ryan Williams commented on SPARK-650: - Both suggested workarounds here are lacking or broken / actively harmful, afaict, and the use case is real and valid. The ADAM project struggled for >2 years with this problem: - [a 3rd-party {{OutputFormat}} required this field to be set|https://github.com/HadoopGenomics/Hadoop-BAM/blob/eb688fb90c60e8c956f9d1e4793fea01e3164056/src/main/java/org/seqdoop/hadoop_bam/KeyIgnoringAnySAMOutputFormat.java#L93] - the value of the field is computed on the driver, and needs to somehow be sent to and set in each executor JVM. h3. {{mapPartitions}} hack [Some attempts to set the field via a dummy {{mapPartitions}} job|https://github.com/hammerlab/adam/blob/b87bfb72c7411b5ea088b12334aa1b548102eb4b/adam-core/src/main/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDFunctions.scala#L134-L146] actually added [pernicious, non-deterministic bugs|https://github.com/bigdatagenomics/adam/issues/676#issuecomment-219347677]. In general Spark seems to provide no guarantees that ≥1 tasks will get scheduled on each executor in such a situation: - in the above, node locality resulted in some executors being missed - dynamic-allocation also offers chances for executors to come online later and never be initialized h3. object/singleton initialization How can one use singleton initialization to pass an object from the driver to each executor? Maybe I've missed this in the discussion above. In the end, ADAM decided to write the object to a file and route that file's path to the {{OutputFormat}} via a hadoop configuration value, which is pretty inelegant. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20282) Flaky test: org.apache.spark.sql.streaming/StreamingQuerySuite/OneTime_trigger__commit_log__and_exception
[ https://issues.apache.org/jira/browse/SPARK-20282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-20282: - Issue Type: Test (was: Bug) > Flaky test: > org.apache.spark.sql.streaming/StreamingQuerySuite/OneTime_trigger__commit_log__and_exception > - > > Key: SPARK-20282 > URL: https://issues.apache.org/jira/browse/SPARK-20282 > Project: Spark > Issue Type: Test > Components: Structured Streaming, Tests >Affects Versions: 2.2.0 >Reporter: Shixiong Zhu >Priority: Minor > Labels: flaky-test > > I saw the following failure several times: > {code} > sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: > Assert on query failed: > == Progress == >AssertOnQuery(, ) >StopStream >AddData to MemoryStream[value#30891]: 1,2 > > StartStream(OneTimeTrigger,org.apache.spark.util.SystemClock@35cdc93a,Map()) >CheckAnswer: [6],[3] >StopStream > => AssertOnQuery(, ) >AssertOnQuery(, ) >StartStream(OneTimeTrigger,org.apache.spark.util.SystemClock@cdb247d,Map()) >CheckAnswer: [6],[3] >StopStream >AddData to MemoryStream[value#30891]: 3 > > StartStream(OneTimeTrigger,org.apache.spark.util.SystemClock@55394e4d,Map()) >CheckLastBatch: [2] >StopStream >AddData to MemoryStream[value#30891]: 0 > > StartStream(OneTimeTrigger,org.apache.spark.util.SystemClock@749aa997,Map()) >ExpectFailure[org.apache.spark.SparkException, isFatalError: false] >AssertOnQuery(, ) >AssertOnQuery(, incorrect start offset or end offset on > exception) > == Stream == > Output Mode: Append > Stream state: not started > Thread state: dead > == Sink == > 0: [6] [3] > == Plan == > > > at > org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:495) > at > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555) > at org.scalatest.Assertions$class.fail(Assertions.scala:1328) > at org.scalatest.FunSuite.fail(FunSuite.scala:1555) > at > org.apache.spark.sql.streaming.StreamTest$class.failTest$1(StreamTest.scala:347) > at > org.apache.spark.sql.streaming.StreamTest$class.verify$1(StreamTest.scala:318) > at > org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:483) > at > org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:357) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.sql.streaming.StreamTest$class.liftedTree1$1(StreamTest.scala:357) > at > org.apache.spark.sql.streaming.StreamTest$class.testStream(StreamTest.scala:356) > at > org.apache.spark.sql.streaming.StreamingQuerySuite.testStream(StreamingQuerySuite.scala:41) > at > org.apache.spark.sql.streaming.StreamingQuerySuite$$anonfun$6.apply$mcV$sp(StreamingQuerySuite.scala:166) > at > org.apache.spark.sql.streaming.StreamingQuerySuite$$anonfun$6.apply(StreamingQuerySuite.scala:161) > at > org.apache.spark.sql.streaming.StreamingQuerySuite$$anonfun$6.apply(StreamingQuerySuite.scala:161) > at org.apache.spark.sql.catalyst.util.package$.quietly(package.scala:42) > at > org.apache.spark.sql.test.SQLTestUtils$$anonfun$testQuietly$1.apply$mcV$sp(SQLTestUtils.scala:268) > at > org.apache.spark.sql.test.SQLTestUtils$$anonfun$testQuietly$1.apply(SQLTestUtils.scala:268) > at > org.apache.spark.sql.test.SQLTestUtils$$anonfun$testQuietly$1.apply(SQLTestUtils.scala:268) > at > org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) > at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68) > at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) > at > org.apache.spark.sql.streaming.StreamingQuerySuite.org$scalatest$BeforeAndAfterEach$$super$runTest(StreamingQuerySuite.scala:41) > at >
[jira] [Created] (SPARK-20282) Flaky test: org.apache.spark.sql.streaming/StreamingQuerySuite/OneTime_trigger__commit_log__and_exception
Shixiong Zhu created SPARK-20282: Summary: Flaky test: org.apache.spark.sql.streaming/StreamingQuerySuite/OneTime_trigger__commit_log__and_exception Key: SPARK-20282 URL: https://issues.apache.org/jira/browse/SPARK-20282 Project: Spark Issue Type: Bug Components: Structured Streaming, Tests Affects Versions: 2.2.0 Reporter: Shixiong Zhu Priority: Minor I saw the following failure several times: {code} sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: Assert on query failed: == Progress == AssertOnQuery(, ) StopStream AddData to MemoryStream[value#30891]: 1,2 StartStream(OneTimeTrigger,org.apache.spark.util.SystemClock@35cdc93a,Map()) CheckAnswer: [6],[3] StopStream => AssertOnQuery(, ) AssertOnQuery(, ) StartStream(OneTimeTrigger,org.apache.spark.util.SystemClock@cdb247d,Map()) CheckAnswer: [6],[3] StopStream AddData to MemoryStream[value#30891]: 3 StartStream(OneTimeTrigger,org.apache.spark.util.SystemClock@55394e4d,Map()) CheckLastBatch: [2] StopStream AddData to MemoryStream[value#30891]: 0 StartStream(OneTimeTrigger,org.apache.spark.util.SystemClock@749aa997,Map()) ExpectFailure[org.apache.spark.SparkException, isFatalError: false] AssertOnQuery(, ) AssertOnQuery(, incorrect start offset or end offset on exception) == Stream == Output Mode: Append Stream state: not started Thread state: dead == Sink == 0: [6] [3] == Plan == at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:495) at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555) at org.scalatest.Assertions$class.fail(Assertions.scala:1328) at org.scalatest.FunSuite.fail(FunSuite.scala:1555) at org.apache.spark.sql.streaming.StreamTest$class.failTest$1(StreamTest.scala:347) at org.apache.spark.sql.streaming.StreamTest$class.verify$1(StreamTest.scala:318) at org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:483) at org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:357) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.sql.streaming.StreamTest$class.liftedTree1$1(StreamTest.scala:357) at org.apache.spark.sql.streaming.StreamTest$class.testStream(StreamTest.scala:356) at org.apache.spark.sql.streaming.StreamingQuerySuite.testStream(StreamingQuerySuite.scala:41) at org.apache.spark.sql.streaming.StreamingQuerySuite$$anonfun$6.apply$mcV$sp(StreamingQuerySuite.scala:166) at org.apache.spark.sql.streaming.StreamingQuerySuite$$anonfun$6.apply(StreamingQuerySuite.scala:161) at org.apache.spark.sql.streaming.StreamingQuerySuite$$anonfun$6.apply(StreamingQuerySuite.scala:161) at org.apache.spark.sql.catalyst.util.package$.quietly(package.scala:42) at org.apache.spark.sql.test.SQLTestUtils$$anonfun$testQuietly$1.apply$mcV$sp(SQLTestUtils.scala:268) at org.apache.spark.sql.test.SQLTestUtils$$anonfun$testQuietly$1.apply(SQLTestUtils.scala:268) at org.apache.spark.sql.test.SQLTestUtils$$anonfun$testQuietly$1.apply(SQLTestUtils.scala:268) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) at org.apache.spark.sql.streaming.StreamingQuerySuite.org$scalatest$BeforeAndAfterEach$$super$runTest(StreamingQuerySuite.scala:41) at org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255) at org.apache.spark.sql.streaming.StreamingQuerySuite.org$scalatest$BeforeAndAfter$$super$runTest(StreamingQuerySuite.scala:41) at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200) at org.apache.spark.sql.streaming.StreamingQuerySuite.runTest(StreamingQuerySuite.scala:41) at
[jira] [Assigned] (SPARK-20283) Add preOptimizationBatches
[ https://issues.apache.org/jira/browse/SPARK-20283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20283: Assignee: Reynold Xin (was: Apache Spark) > Add preOptimizationBatches > -- > > Key: SPARK-20283 > URL: https://issues.apache.org/jira/browse/SPARK-20283 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.1.0 >Reporter: Reynold Xin >Assignee: Reynold Xin > > We currently have postHocOptimizationBatches, but not preOptimizationBatches. > This patch adds preOptimizationBatches so the optimizer debugging extensions > are symmetric. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20283) Add preOptimizationBatches
[ https://issues.apache.org/jira/browse/SPARK-20283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963351#comment-15963351 ] Apache Spark commented on SPARK-20283: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/17595 > Add preOptimizationBatches > -- > > Key: SPARK-20283 > URL: https://issues.apache.org/jira/browse/SPARK-20283 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.1.0 >Reporter: Reynold Xin >Assignee: Reynold Xin > > We currently have postHocOptimizationBatches, but not preOptimizationBatches. > Let's add it so it is symmetric. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20283) Add preOptimizationBatches
[ https://issues.apache.org/jira/browse/SPARK-20283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20283: Assignee: Apache Spark (was: Reynold Xin) > Add preOptimizationBatches > -- > > Key: SPARK-20283 > URL: https://issues.apache.org/jira/browse/SPARK-20283 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.1.0 >Reporter: Reynold Xin >Assignee: Apache Spark > > We currently have postHocOptimizationBatches, but not preOptimizationBatches. > Let's add it so it is symmetric. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20283) Add preOptimizationBatches
[ https://issues.apache.org/jira/browse/SPARK-20283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-20283: Description: We currently have postHocOptimizationBatches, but not preOptimizationBatches. This patch adds preOptimizationBatches so the optimizer debugging extensions are symmetric. was: We currently have postHocOptimizationBatches, but not preOptimizationBatches. Let's add it so it is symmetric. > Add preOptimizationBatches > -- > > Key: SPARK-20283 > URL: https://issues.apache.org/jira/browse/SPARK-20283 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.1.0 >Reporter: Reynold Xin >Assignee: Apache Spark > > We currently have postHocOptimizationBatches, but not preOptimizationBatches. > This patch adds preOptimizationBatches so the optimizer debugging extensions > are symmetric. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20285) Flaky test:
Shixiong Zhu created SPARK-20285: Summary: Flaky test: Key: SPARK-20285 URL: https://issues.apache.org/jira/browse/SPARK-20285 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 2.2.0 Reporter: Shixiong Zhu Assignee: Shixiong Zhu Priority: Minor Saw the following failure locally: {code} Traceback (most recent call last): File "/home/jenkins/workspace/python/pyspark/streaming/tests.py", line 351, in test_cogroup self._test_func(input, func, expected, sort=True, input2=input2) File "/home/jenkins/workspace/python/pyspark/streaming/tests.py", line 162, in _test_func self.assertEqual(expected, result) AssertionError: Lists differ: [[(1, ([1], [2])), (2, ([1], [... != [] First list contains 3 additional elements. First extra element 0: [(1, ([1], [2])), (2, ([1], [])), (3, ([1], []))] + [] - [[(1, ([1], [2])), (2, ([1], [])), (3, ([1], []))], - [(1, ([1, 1, 1], [])), (2, ([1], [])), (4, ([], [1]))], - [('', ([1, 1], [1, 2])), ('a', ([1, 1], [1, 1])), ('b', ([1], [1]))]] {code} It also happened on Jenkins: http://spark-tests.appspot.com/builds/spark-branch-2.1-test-sbt-hadoop-2.7/120 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20285) Flaky test: pyspark.streaming.tests.BasicOperationTests.test_cogroup
[ https://issues.apache.org/jira/browse/SPARK-20285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-20285: - Summary: Flaky test: pyspark.streaming.tests.BasicOperationTests.test_cogroup (was: Flaky test: ) > Flaky test: pyspark.streaming.tests.BasicOperationTests.test_cogroup > > > Key: SPARK-20285 > URL: https://issues.apache.org/jira/browse/SPARK-20285 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.2.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Minor > > Saw the following failure locally: > {code} > Traceback (most recent call last): > File "/home/jenkins/workspace/python/pyspark/streaming/tests.py", line 351, > in test_cogroup > self._test_func(input, func, expected, sort=True, input2=input2) > File "/home/jenkins/workspace/python/pyspark/streaming/tests.py", line 162, > in _test_func > self.assertEqual(expected, result) > AssertionError: Lists differ: [[(1, ([1], [2])), (2, ([1], [... != [] > First list contains 3 additional elements. > First extra element 0: > [(1, ([1], [2])), (2, ([1], [])), (3, ([1], []))] > + [] > - [[(1, ([1], [2])), (2, ([1], [])), (3, ([1], []))], > - [(1, ([1, 1, 1], [])), (2, ([1], [])), (4, ([], [1]))], > - [('', ([1, 1], [1, 2])), ('a', ([1, 1], [1, 1])), ('b', ([1], [1]))]] > {code} > It also happened on Jenkins: > http://spark-tests.appspot.com/builds/spark-branch-2.1-test-sbt-hadoop-2.7/120 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16742) Kerberos support for Spark on Mesos
[ https://issues.apache.org/jira/browse/SPARK-16742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963446#comment-15963446 ] Michael Gummelt edited comment on SPARK-16742 at 4/10/17 8:35 PM: -- [~vanzin] bq. The most basic feature needed for any kerberos-related work is user isolation (different users cannot mess with each others' processes). I was under the impression that Mesos supported that. Mesos of course supports configuring the Linux user that process runs as. But in Spark, this isn't currently derived from the Kerberos principal. It's configured by the user. The scheduler's *Mesos* principal, along with ACLs configured in Mesos, is what determines which Linux users are allowed. That's why I was asking about {{hadoop.security.auth_to_local}}, to understand how YARN determines what Linux user to run executors as. It would be a vulnerability, for example, if the Linux user for the executors is simply derived from that of the driver, because two human users running as the same Linux user, but logged in via different Kerberos principals, would be able to see each others' tokens. bq. I don't know where this notion that cluster mode requires you to distribute keytabs comes from As you said, it's mostly the renewal use case that requires distributing the keytab, but that's not all. In many Mesos setups, and certainly in DC/OS, the submitting user might not already be kinit'd. They may be running from outside the datacenter entirely, without network access to the KDC. You're right that we could implement cluster mode in some form, but I'd rather keep the initial PR small. I hope that's acceptable. was (Author: mgummelt): [~vanzin] bq. The most basic feature needed for any kerberos-related work is user isolation (different users cannot mess with each others' processes). I was under the impression that Mesos supported that. Mesos of course supports configuring the Linux user that process runs as. But in Spark, this isn't currently derived from the Kerberos principal. It's configured by the user. The scheduler's *Mesos* principal, along with ACLs configured in Mesos, is what determines which Linux users are allowed. That's why I was asking about {{hadoop.security.auth_to_local}}, to understand how YARN determines what Linux user to run executors as. It would be a vulnerability, for example, if the Linux user for the executors is simply derived from that of the driver, because two human users running as the same Linux user, but logged in via different Kerberos principals, would be able to see each others' tokens. bq. I don't know where this notion that cluster mode requires you to distribute keytabs comes from As you said, it's mostly the renewal use case that requires distributing the keytab, but that's not all. In many Mesos setups, and certainly in DC/OS, the submitting user might not already be kinit'd. They may be running from outside the datacenter entirely, without network access to the KDC. > Kerberos support for Spark on Mesos > --- > > Key: SPARK-16742 > URL: https://issues.apache.org/jira/browse/SPARK-16742 > Project: Spark > Issue Type: New Feature > Components: Mesos >Reporter: Michael Gummelt > > We at Mesosphere have written Kerberos support for Spark on Mesos. We'll be > contributing it to Apache Spark soon. > Mesosphere design doc: > https://docs.google.com/document/d/1xyzICg7SIaugCEcB4w1vBWp24UDkyJ1Pyt2jtnREFqc/edit#heading=h.tdnq7wilqrj6 > Mesosphere code: > https://github.com/mesosphere/spark/commit/73ba2ab8d97510d5475ef9a48c673ce34f7173fa -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20282) Flaky test: org.apache.spark.sql.streaming/StreamingQuerySuite/OneTime_trigger__commit_log__and_exception
[ https://issues.apache.org/jira/browse/SPARK-20282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-20282. -- Resolution: Fixed Assignee: Shixiong Zhu Fix Version/s: 2.2.0 > Flaky test: > org.apache.spark.sql.streaming/StreamingQuerySuite/OneTime_trigger__commit_log__and_exception > - > > Key: SPARK-20282 > URL: https://issues.apache.org/jira/browse/SPARK-20282 > Project: Spark > Issue Type: Test > Components: Structured Streaming, Tests >Affects Versions: 2.2.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Minor > Labels: flaky-test > Fix For: 2.2.0 > > > I saw the following failure several times: > {code} > sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: > Assert on query failed: > == Progress == >AssertOnQuery(, ) >StopStream >AddData to MemoryStream[value#30891]: 1,2 > > StartStream(OneTimeTrigger,org.apache.spark.util.SystemClock@35cdc93a,Map()) >CheckAnswer: [6],[3] >StopStream > => AssertOnQuery(, ) >AssertOnQuery(, ) >StartStream(OneTimeTrigger,org.apache.spark.util.SystemClock@cdb247d,Map()) >CheckAnswer: [6],[3] >StopStream >AddData to MemoryStream[value#30891]: 3 > > StartStream(OneTimeTrigger,org.apache.spark.util.SystemClock@55394e4d,Map()) >CheckLastBatch: [2] >StopStream >AddData to MemoryStream[value#30891]: 0 > > StartStream(OneTimeTrigger,org.apache.spark.util.SystemClock@749aa997,Map()) >ExpectFailure[org.apache.spark.SparkException, isFatalError: false] >AssertOnQuery(, ) >AssertOnQuery(, incorrect start offset or end offset on > exception) > == Stream == > Output Mode: Append > Stream state: not started > Thread state: dead > == Sink == > 0: [6] [3] > == Plan == > > > at > org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:495) > at > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555) > at org.scalatest.Assertions$class.fail(Assertions.scala:1328) > at org.scalatest.FunSuite.fail(FunSuite.scala:1555) > at > org.apache.spark.sql.streaming.StreamTest$class.failTest$1(StreamTest.scala:347) > at > org.apache.spark.sql.streaming.StreamTest$class.verify$1(StreamTest.scala:318) > at > org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:483) > at > org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:357) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.sql.streaming.StreamTest$class.liftedTree1$1(StreamTest.scala:357) > at > org.apache.spark.sql.streaming.StreamTest$class.testStream(StreamTest.scala:356) > at > org.apache.spark.sql.streaming.StreamingQuerySuite.testStream(StreamingQuerySuite.scala:41) > at > org.apache.spark.sql.streaming.StreamingQuerySuite$$anonfun$6.apply$mcV$sp(StreamingQuerySuite.scala:166) > at > org.apache.spark.sql.streaming.StreamingQuerySuite$$anonfun$6.apply(StreamingQuerySuite.scala:161) > at > org.apache.spark.sql.streaming.StreamingQuerySuite$$anonfun$6.apply(StreamingQuerySuite.scala:161) > at org.apache.spark.sql.catalyst.util.package$.quietly(package.scala:42) > at > org.apache.spark.sql.test.SQLTestUtils$$anonfun$testQuietly$1.apply$mcV$sp(SQLTestUtils.scala:268) > at > org.apache.spark.sql.test.SQLTestUtils$$anonfun$testQuietly$1.apply(SQLTestUtils.scala:268) > at > org.apache.spark.sql.test.SQLTestUtils$$anonfun$testQuietly$1.apply(SQLTestUtils.scala:268) > at > org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) > at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68) > at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) > at >
[jira] [Commented] (SPARK-17564) Flaky RequestTimeoutIntegrationSuite, furtherRequestsDelay
[ https://issues.apache.org/jira/browse/SPARK-17564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963663#comment-15963663 ] Apache Spark commented on SPARK-17564: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/17599 > Flaky RequestTimeoutIntegrationSuite, furtherRequestsDelay > -- > > Key: SPARK-17564 > URL: https://issues.apache.org/jira/browse/SPARK-17564 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 2.0.1, 2.1.0 >Reporter: Adam Roberts >Priority: Minor > > Could be related to [SPARK-10680] > This is the test and one fix would be to increase the timeouts from 1.2 > seconds to 5 seconds > {code} > // The timeout is relative to the LAST request sent, which is kinda weird, > but still. > // This test also makes sure the timeout works for Fetch requests as well > as RPCs. > @Test > public void furtherRequestsDelay() throws Exception { > final byte[] response = new byte[16]; > final StreamManager manager = new StreamManager() { > @Override > public ManagedBuffer getChunk(long streamId, int chunkIndex) { > Uninterruptibles.sleepUninterruptibly(FOREVER, TimeUnit.MILLISECONDS); > return new NioManagedBuffer(ByteBuffer.wrap(response)); > } > }; > RpcHandler handler = new RpcHandler() { > @Override > public void receive( > TransportClient client, > ByteBuffer message, > RpcResponseCallback callback) { > throw new UnsupportedOperationException(); > } > @Override > public StreamManager getStreamManager() { > return manager; > } > }; > TransportContext context = new TransportContext(conf, handler); > server = context.createServer(); > clientFactory = context.createClientFactory(); > TransportClient client = > clientFactory.createClient(TestUtils.getLocalHost(), server.getPort()); > // Send one request, which will eventually fail. > TestCallback callback0 = new TestCallback(); > client.fetchChunk(0, 0, callback0); > Uninterruptibles.sleepUninterruptibly(1200, TimeUnit.MILLISECONDS); > // Send a second request before the first has failed. > TestCallback callback1 = new TestCallback(); > client.fetchChunk(0, 1, callback1); > Uninterruptibles.sleepUninterruptibly(1200, TimeUnit.MILLISECONDS); > // not complete yet, but should complete soon > assertEquals(-1, callback0.successLength); > assertNull(callback0.failure); > callback0.latch.await(60, TimeUnit.SECONDS); > assertTrue(callback0.failure instanceof IOException); > // failed at same time as previous > assertTrue(callback1.failure instanceof IOException); // This is where we > fail because callback1.failure is null > } > {code} > If there are better suggestions for improving this test let's take them > onboard, I think using 5 sec timeout periods would be a place to start so > folks don't need to needlessly triage this failure. Will add a few prints and > report back -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20270) na.fill will change the values in long or integer when the default value is in double
[ https://issues.apache.org/jira/browse/SPARK-20270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai updated SPARK-20270: Fix Version/s: 2.1.1 2.0.3 > na.fill will change the values in long or integer when the default value is > in double > - > > Key: SPARK-20270 > URL: https://issues.apache.org/jira/browse/SPARK-20270 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0 >Reporter: DB Tsai >Assignee: DB Tsai >Priority: Critical > Fix For: 2.0.3, 2.1.1, 2.2.0 > > > This bug was partially addressed in SPARK-18555, but the root cause isn't > completely solved. This bug is pretty critical since it changes the member id > in Long in our application if the member id can not be represented by Double > losslessly when the member id is very big. > Here is an example how this happens, with > {code} > Seq[(java.lang.Long, java.lang.Double)]((null, 3.14), > (9123146099426677101L, null), > (9123146560113991650L, 1.6), (null, null)).toDF("a", > "b").na.fill(0.2), > {code} > the logical plan will be > {code} > == Analyzed Logical Plan == > a: bigint, b: double > Project [cast(coalesce(cast(a#232L as double), cast(0.2 as double)) as > bigint) AS a#240L, cast(coalesce(nanvl(b#233, cast(null as double)), 0.2) as > double) AS b#241] > +- Project [_1#229L AS a#232L, _2#230 AS b#233] >+- LocalRelation [_1#229L, _2#230] > {code}. > Note that even the value is not null, Spark will cast the Long into Double > first. Then if it's not null, Spark will cast it back to Long which results > in losing precision. > The behavior should be that the original value should not be changed if it's > not null, but Spark will change the value which is wrong. > With the PR, the logical plan will be > {code} > == Analyzed Logical Plan == > a: bigint, b: double > Project [coalesce(a#232L, cast(0.2 as bigint)) AS a#240L, > coalesce(nanvl(b#233, cast(null as double)), cast(0.2 as double)) AS b#241] > +- Project [_1#229L AS a#232L, _2#230 AS b#233] >+- LocalRelation [_1#229L, _2#230] > {code} > which behaves correctly without changing the original Long values. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16742) Kerberos support for Spark on Mesos
[ https://issues.apache.org/jira/browse/SPARK-16742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963559#comment-15963559 ] Marcelo Vanzin commented on SPARK-16742: bq. But in Spark, this isn't currently derived from the Kerberos principal. It's configured by the user. That sounds problematic. The way YARN works is that it actually authenticates the user. Are you saying that Mesos doesn't do user authentication? The overarching point I'm trying to make with my comments is that for kerberos support to be properly secure, the cluster manager needs to be secure. That means running applications from different users in a way that doesn't allow them to hack each other. YARN does that by doing authentication when users request applications to run, and by running the containers as the requested user. The exact way in which YARN achieves that seems kinda tangential to the actual question, which is: What is the story for Mesos? Basically, the way in which you support Kerberos will depend on how your cluster manager does security. If Mesos behaves more like Spark Standalone than it does like YARN, then any solution that requires distributing user credentials is a non-starter, because it just becomes a security liability. bq. It would be a vulnerability, for example, if the Linux user for the executors is simply derived from that of the driver, because two human users running as the same Linux user, but logged in via different Kerberos principals, would be able to see each others' tokens. Are you saying that for YARN or Mesos? When YARN runs in Kerberos mode, Kerberos dictates the user. That's how the user is authenticating to YARN. There's a requirement that an OS user exists matching that particular user, but that's just a configuration detail. The security comes from the fact that the user is authenticating to the KDC. bq. You're right that we could implement cluster mode in some form, but I'd rather keep the initial PR small. I hope that's acceptable. The main point I'm trying to convey here is that running things in client and cluster mode should be exactly the same from the point of view of distributing tokens. The use case you mention ("user starting an application in cluster mode with no kerberos credentials") sounds actually worrying, because what's authenticating the user? > Kerberos support for Spark on Mesos > --- > > Key: SPARK-16742 > URL: https://issues.apache.org/jira/browse/SPARK-16742 > Project: Spark > Issue Type: New Feature > Components: Mesos >Reporter: Michael Gummelt > > We at Mesosphere have written Kerberos support for Spark on Mesos. We'll be > contributing it to Apache Spark soon. > Mesosphere design doc: > https://docs.google.com/document/d/1xyzICg7SIaugCEcB4w1vBWp24UDkyJ1Pyt2jtnREFqc/edit#heading=h.tdnq7wilqrj6 > Mesosphere code: > https://github.com/mesosphere/spark/commit/73ba2ab8d97510d5475ef9a48c673ce34f7173fa -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20287) Kafka Consumer should be able to subscribe to more than one topic partition
Stephane Maarek created SPARK-20287: --- Summary: Kafka Consumer should be able to subscribe to more than one topic partition Key: SPARK-20287 URL: https://issues.apache.org/jira/browse/SPARK-20287 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 2.1.0 Reporter: Stephane Maarek As I understand and as it stands, one Kafka Consumer is created for each topic partition in the source Kafka topics, and they're cached. cf https://github.com/apache/spark/blob/master/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/CachedKafkaConsumer.scala#L48 In my opinion, that makes the design an anti pattern for Kafka and highly unefficient: - Each Kafka consumer creates a connection to Kafka - Spark doesn't leverage the power of the Kafka consumers, which is that it automatically assigns and balances partitions amongst all the consumers that share the same group.id - You can still cache your Kafka consumer even if it has multiple partitions. I'm not sure about how that translates to the spark underlying RDD architecture, but from a Kafka standpoint, I believe creating one consumer per partition is a big overhead, and a risk as the user may have to increase the spark.streaming.kafka.consumer.cache.maxCapacity parameter. Happy to discuss to understand the rationale -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20284) Make SerializationStream and DeserializationStream extend Closeable
[ https://issues.apache.org/jira/browse/SPARK-20284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-20284: -- Priority: Trivial (was: Minor) It might be fine to add, but, does it help anything? try-with-resources is only for Java. > Make SerializationStream and DeserializationStream extend Closeable > --- > > Key: SPARK-20284 > URL: https://issues.apache.org/jira/browse/SPARK-20284 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.3, 2.1.0 >Reporter: Sergei Lebedev >Priority: Trivial > > Both {{SerializationStream}} and {{DeserializationStream}} implement > {{close}} but do not extend {{Closeable}}. As a result, these streams cannot > be used in try-with-resources. > Was this intentional or rather nobody ever needed that? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16742) Kerberos support for Spark on Mesos
[ https://issues.apache.org/jira/browse/SPARK-16742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963601#comment-15963601 ] Michael Gummelt commented on SPARK-16742: - bq. So, assuming that Mesos is configured properly, then it should be OK for Spark code to distribute user credentials. Right. It's just a matter of the cluster admin syncing Mesos credentials and kerberos credentials properly. In summary, it's simpler in YARN because YARN is Kerberos-aware, whereas Mesos isn't. bq. That sounds like you might need the current code that distributes keytabs and logs in the cluster to make even client mode work in this setup. Since client mode requires network access to the Mesos master, we generally assume that the user is on the same network as their datacenter, and can thus kinit against the KDC. > Kerberos support for Spark on Mesos > --- > > Key: SPARK-16742 > URL: https://issues.apache.org/jira/browse/SPARK-16742 > Project: Spark > Issue Type: New Feature > Components: Mesos >Reporter: Michael Gummelt > > We at Mesosphere have written Kerberos support for Spark on Mesos. We'll be > contributing it to Apache Spark soon. > Mesosphere design doc: > https://docs.google.com/document/d/1xyzICg7SIaugCEcB4w1vBWp24UDkyJ1Pyt2jtnREFqc/edit#heading=h.tdnq7wilqrj6 > Mesosphere code: > https://github.com/mesosphere/spark/commit/73ba2ab8d97510d5475ef9a48c673ce34f7173fa -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18555) na.fill miss up original values in long integers
[ https://issues.apache.org/jira/browse/SPARK-18555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963684#comment-15963684 ] Apache Spark commented on SPARK-18555: -- User 'dbtsai' has created a pull request for this issue: https://github.com/apache/spark/pull/17600 > na.fill miss up original values in long integers > > > Key: SPARK-18555 > URL: https://issues.apache.org/jira/browse/SPARK-18555 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2 >Reporter: Mahmoud Rawas >Assignee: Song Jun >Priority: Critical > Fix For: 2.0.3, 2.1.1, 2.2.0 > > > Manly the issue is clarified in the following example: > Given a Dataset: > scala> data.show > | a| b| > | 1| 2| > | -1| -2| > |9123146099426677101|9123146560113991650| > theoretically when we call na.fill(0) nothing should change, while the > current result is: > scala> data.na.fill(0).show > | a| b| > | 1| 2| > | -1| -2| > |9123146099426676736|9123146560113991680| -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18555) na.fill miss up original values in long integers
[ https://issues.apache.org/jira/browse/SPARK-18555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963685#comment-15963685 ] Apache Spark commented on SPARK-18555: -- User 'dbtsai' has created a pull request for this issue: https://github.com/apache/spark/pull/17601 > na.fill miss up original values in long integers > > > Key: SPARK-18555 > URL: https://issues.apache.org/jira/browse/SPARK-18555 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2 >Reporter: Mahmoud Rawas >Assignee: Song Jun >Priority: Critical > Fix For: 2.0.3, 2.1.1, 2.2.0 > > > Manly the issue is clarified in the following example: > Given a Dataset: > scala> data.show > | a| b| > | 1| 2| > | -1| -2| > |9123146099426677101|9123146560113991650| > theoretically when we call na.fill(0) nothing should change, while the > current result is: > scala> data.na.fill(0).show > | a| b| > | 1| 2| > | -1| -2| > |9123146099426676736|9123146560113991680| -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12837) Spark driver requires large memory space for serialized results even there are no data collected to the driver
[ https://issues.apache.org/jira/browse/SPARK-12837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963510#comment-15963510 ] Ruslan Dautkhanov commented on SPARK-12837: --- It might be a bug in broadcast join. Following Spark 2 query fails with Total size of serialized results of 128 tasks (1026.2 MB) is bigger than spark.driver.maxResultSize (1024.0 MB) when we set {code} sqlc.setConf("spark.sql.autoBroadcastJoinThreshold", 500 *1024*1024) # 500 mb {code} {noformat} SELECT . . . m.year, m.quarter, t.individ, t.hh_key FROMmv_dev.mv_raw_all_20170314 m, disc_dv.tsp_dv_02122017 t where m.psn = t.person_seq_no limit 10 {noformat} when we drop down to 400Mb {code} sqlc.setConf("spark.sql.autoBroadcastJoinThreshold", 400 *1024*1024) # 400 mb {code} , this error does not show up. > Spark driver requires large memory space for serialized results even there > are no data collected to the driver > -- > > Key: SPARK-12837 > URL: https://issues.apache.org/jira/browse/SPARK-12837 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 1.5.2, 1.6.0 >Reporter: Tien-Dung LE >Assignee: Wenchen Fan >Priority: Critical > Fix For: 2.0.0 > > > Executing a sql statement with a large number of partitions requires a high > memory space for the driver even there are no requests to collect data back > to the driver. > Here are steps to re-produce the issue. > 1. Start spark shell with a spark.driver.maxResultSize setting > {code:java} > bin/spark-shell --driver-memory=1g --conf spark.driver.maxResultSize=1m > {code} > 2. Execute the code > {code:java} > case class Toto( a: Int, b: Int) > val df = sc.parallelize( 1 to 1e6.toInt).map( i => Toto( i, i)).toDF > sqlContext.setConf( "spark.sql.shuffle.partitions", "200" ) > df.groupBy("a").count().saveAsParquetFile( "toto1" ) // OK > sqlContext.setConf( "spark.sql.shuffle.partitions", 1e3.toInt.toString ) > df.repartition(1e3.toInt).groupBy("a").count().repartition(1e3.toInt).saveAsParquetFile( > "toto2" ) // ERROR > {code} > The error message is > {code:java} > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Total size of serialized results of 393 tasks (1025.9 KB) is bigger than > spark.driver.maxResultSize (1024.0 KB) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12837) Spark driver requires large memory space for serialized results even there are no data collected to the driver
[ https://issues.apache.org/jira/browse/SPARK-12837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963510#comment-15963510 ] Ruslan Dautkhanov edited comment on SPARK-12837 at 4/10/17 9:29 PM: It might be a bug in broadcast join. Following Spark 2 query fails with {quote}Total size of serialized results of 128 tasks (1026.2 MB) is bigger than spark.driver.maxResultSize (1024.0 MB){quote} when we set {code} sqlc.setConf("spark.sql.autoBroadcastJoinThreshold", 500 *1024*1024) # 500 mb {code} {noformat} SELECT . . . m.year, m.quarter, t.individ, t.hh_key FROMmv_dev.mv_raw_all_20170314 m, disc_dv.tsp_dv_02122017 t where m.psn = t.person_seq_no limit 10 {noformat} when we drop down to 400Mb {code} sqlc.setConf("spark.sql.autoBroadcastJoinThreshold", 400 *1024*1024) # 400 mb {code} , this error does not show up. was (Author: tagar): It might be a bug in broadcast join. Following Spark 2 query fails with Total size of serialized results of 128 tasks (1026.2 MB) is bigger than spark.driver.maxResultSize (1024.0 MB) when we set {code} sqlc.setConf("spark.sql.autoBroadcastJoinThreshold", 500 *1024*1024) # 500 mb {code} {noformat} SELECT . . . m.year, m.quarter, t.individ, t.hh_key FROMmv_dev.mv_raw_all_20170314 m, disc_dv.tsp_dv_02122017 t where m.psn = t.person_seq_no limit 10 {noformat} when we drop down to 400Mb {code} sqlc.setConf("spark.sql.autoBroadcastJoinThreshold", 400 *1024*1024) # 400 mb {code} , this error does not show up. > Spark driver requires large memory space for serialized results even there > are no data collected to the driver > -- > > Key: SPARK-12837 > URL: https://issues.apache.org/jira/browse/SPARK-12837 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 1.5.2, 1.6.0 >Reporter: Tien-Dung LE >Assignee: Wenchen Fan >Priority: Critical > Fix For: 2.0.0 > > > Executing a sql statement with a large number of partitions requires a high > memory space for the driver even there are no requests to collect data back > to the driver. > Here are steps to re-produce the issue. > 1. Start spark shell with a spark.driver.maxResultSize setting > {code:java} > bin/spark-shell --driver-memory=1g --conf spark.driver.maxResultSize=1m > {code} > 2. Execute the code > {code:java} > case class Toto( a: Int, b: Int) > val df = sc.parallelize( 1 to 1e6.toInt).map( i => Toto( i, i)).toDF > sqlContext.setConf( "spark.sql.shuffle.partitions", "200" ) > df.groupBy("a").count().saveAsParquetFile( "toto1" ) // OK > sqlContext.setConf( "spark.sql.shuffle.partitions", 1e3.toInt.toString ) > df.repartition(1e3.toInt).groupBy("a").count().repartition(1e3.toInt).saveAsParquetFile( > "toto2" ) // ERROR > {code} > The error message is > {code:java} > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Total size of serialized results of 393 tasks (1025.9 KB) is bigger than > spark.driver.maxResultSize (1024.0 KB) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20286) dynamicAllocation.executorIdleTimeout is ignored after unpersist
Miguel Pérez created SPARK-20286: Summary: dynamicAllocation.executorIdleTimeout is ignored after unpersist Key: SPARK-20286 URL: https://issues.apache.org/jira/browse/SPARK-20286 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.0.1 Reporter: Miguel Pérez With dynamic allocation enabled, it seems that executors with cached data which are unpersisted are still being killed using the {{dynamicAllocation.cachedExecutorIdleTimeout}} configuration, instead of {{dynamicAllocation.executorIdleTimeout}}. Assuming the default configuration ({{dynamicAllocation.cachedExecutorIdleTimeout = Infinity}}), an executor with unpersisted data won't be released until the job ends. *How to reproduce* - Set different values for {{dynamicAllocation.executorIdleTimeout}} and {{dynamicAllocation.cachedExecutorIdleTimeout}} - Load a file into a RDD and persist it - Execute an action on the RDD (like a count) so some executors are activated. - When the action has finished, unpersist the RDD - The application UI removes correctly the persisted data from the *Storage* tab, but if you look in the *Executors* tab, you will find that the executors remain *active* until ({{dynamicAllocation.cachedExecutorIdleTimeout}} is reached. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16742) Kerberos support for Spark on Mesos
[ https://issues.apache.org/jira/browse/SPARK-16742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963592#comment-15963592 ] Marcelo Vanzin commented on SPARK-16742: bq. It authenticates the Mesos principal, and this principal is allowed to launch processes only as certain Linux users. It's up the cluster admin to setup this mapping appropriately. Ok, that sounds similar then. Basically, you *can* set up Mesos so that it can do this securely, which was my initial question. (Being able to set things up in an insecure way is not actually that interesting; I just wanted to make sure there *is* a way to set things up securely.) So, assuming that Mesos is configured properly, then it should be OK for Spark code to distribute user credentials. bq. I actually said a "user might not be kinit'd". They may, however, have access to the keytab. That sounds like you might need the current code that distributes keytabs and logs in the cluster to make even client mode work in this setup. > Kerberos support for Spark on Mesos > --- > > Key: SPARK-16742 > URL: https://issues.apache.org/jira/browse/SPARK-16742 > Project: Spark > Issue Type: New Feature > Components: Mesos >Reporter: Michael Gummelt > > We at Mesosphere have written Kerberos support for Spark on Mesos. We'll be > contributing it to Apache Spark soon. > Mesosphere design doc: > https://docs.google.com/document/d/1xyzICg7SIaugCEcB4w1vBWp24UDkyJ1Pyt2jtnREFqc/edit#heading=h.tdnq7wilqrj6 > Mesosphere code: > https://github.com/mesosphere/spark/commit/73ba2ab8d97510d5475ef9a48c673ce34f7173fa -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17564) Flaky RequestTimeoutIntegrationSuite, furtherRequestsDelay
[ https://issues.apache.org/jira/browse/SPARK-17564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17564: Assignee: (was: Apache Spark) > Flaky RequestTimeoutIntegrationSuite, furtherRequestsDelay > -- > > Key: SPARK-17564 > URL: https://issues.apache.org/jira/browse/SPARK-17564 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 2.0.1, 2.1.0 >Reporter: Adam Roberts >Priority: Minor > > Could be related to [SPARK-10680] > This is the test and one fix would be to increase the timeouts from 1.2 > seconds to 5 seconds > {code} > // The timeout is relative to the LAST request sent, which is kinda weird, > but still. > // This test also makes sure the timeout works for Fetch requests as well > as RPCs. > @Test > public void furtherRequestsDelay() throws Exception { > final byte[] response = new byte[16]; > final StreamManager manager = new StreamManager() { > @Override > public ManagedBuffer getChunk(long streamId, int chunkIndex) { > Uninterruptibles.sleepUninterruptibly(FOREVER, TimeUnit.MILLISECONDS); > return new NioManagedBuffer(ByteBuffer.wrap(response)); > } > }; > RpcHandler handler = new RpcHandler() { > @Override > public void receive( > TransportClient client, > ByteBuffer message, > RpcResponseCallback callback) { > throw new UnsupportedOperationException(); > } > @Override > public StreamManager getStreamManager() { > return manager; > } > }; > TransportContext context = new TransportContext(conf, handler); > server = context.createServer(); > clientFactory = context.createClientFactory(); > TransportClient client = > clientFactory.createClient(TestUtils.getLocalHost(), server.getPort()); > // Send one request, which will eventually fail. > TestCallback callback0 = new TestCallback(); > client.fetchChunk(0, 0, callback0); > Uninterruptibles.sleepUninterruptibly(1200, TimeUnit.MILLISECONDS); > // Send a second request before the first has failed. > TestCallback callback1 = new TestCallback(); > client.fetchChunk(0, 1, callback1); > Uninterruptibles.sleepUninterruptibly(1200, TimeUnit.MILLISECONDS); > // not complete yet, but should complete soon > assertEquals(-1, callback0.successLength); > assertNull(callback0.failure); > callback0.latch.await(60, TimeUnit.SECONDS); > assertTrue(callback0.failure instanceof IOException); > // failed at same time as previous > assertTrue(callback1.failure instanceof IOException); // This is where we > fail because callback1.failure is null > } > {code} > If there are better suggestions for improving this test let's take them > onboard, I think using 5 sec timeout periods would be a place to start so > folks don't need to needlessly triage this failure. Will add a few prints and > report back -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17564) Flaky RequestTimeoutIntegrationSuite, furtherRequestsDelay
[ https://issues.apache.org/jira/browse/SPARK-17564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17564: Assignee: Apache Spark > Flaky RequestTimeoutIntegrationSuite, furtherRequestsDelay > -- > > Key: SPARK-17564 > URL: https://issues.apache.org/jira/browse/SPARK-17564 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 2.0.1, 2.1.0 >Reporter: Adam Roberts >Assignee: Apache Spark >Priority: Minor > > Could be related to [SPARK-10680] > This is the test and one fix would be to increase the timeouts from 1.2 > seconds to 5 seconds > {code} > // The timeout is relative to the LAST request sent, which is kinda weird, > but still. > // This test also makes sure the timeout works for Fetch requests as well > as RPCs. > @Test > public void furtherRequestsDelay() throws Exception { > final byte[] response = new byte[16]; > final StreamManager manager = new StreamManager() { > @Override > public ManagedBuffer getChunk(long streamId, int chunkIndex) { > Uninterruptibles.sleepUninterruptibly(FOREVER, TimeUnit.MILLISECONDS); > return new NioManagedBuffer(ByteBuffer.wrap(response)); > } > }; > RpcHandler handler = new RpcHandler() { > @Override > public void receive( > TransportClient client, > ByteBuffer message, > RpcResponseCallback callback) { > throw new UnsupportedOperationException(); > } > @Override > public StreamManager getStreamManager() { > return manager; > } > }; > TransportContext context = new TransportContext(conf, handler); > server = context.createServer(); > clientFactory = context.createClientFactory(); > TransportClient client = > clientFactory.createClient(TestUtils.getLocalHost(), server.getPort()); > // Send one request, which will eventually fail. > TestCallback callback0 = new TestCallback(); > client.fetchChunk(0, 0, callback0); > Uninterruptibles.sleepUninterruptibly(1200, TimeUnit.MILLISECONDS); > // Send a second request before the first has failed. > TestCallback callback1 = new TestCallback(); > client.fetchChunk(0, 1, callback1); > Uninterruptibles.sleepUninterruptibly(1200, TimeUnit.MILLISECONDS); > // not complete yet, but should complete soon > assertEquals(-1, callback0.successLength); > assertNull(callback0.failure); > callback0.latch.await(60, TimeUnit.SECONDS); > assertTrue(callback0.failure instanceof IOException); > // failed at same time as previous > assertTrue(callback1.failure instanceof IOException); // This is where we > fail because callback1.failure is null > } > {code} > If there are better suggestions for improving this test let's take them > onboard, I think using 5 sec timeout periods would be a place to start so > folks don't need to needlessly triage this failure. Will add a few prints and > report back -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18555) na.fill miss up original values in long integers
[ https://issues.apache.org/jira/browse/SPARK-18555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai updated SPARK-18555: Fix Version/s: 2.1.1 2.0.3 Component/s: SQL > na.fill miss up original values in long integers > > > Key: SPARK-18555 > URL: https://issues.apache.org/jira/browse/SPARK-18555 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2 >Reporter: Mahmoud Rawas >Assignee: Song Jun >Priority: Critical > Fix For: 2.0.3, 2.1.1, 2.2.0 > > > Manly the issue is clarified in the following example: > Given a Dataset: > scala> data.show > | a| b| > | 1| 2| > | -1| -2| > |9123146099426677101|9123146560113991650| > theoretically when we call na.fill(0) nothing should change, while the > current result is: > scala> data.na.fill(0).show > | a| b| > | 1| 2| > | -1| -2| > |9123146099426676736|9123146560113991680| -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12837) Spark driver requires large memory space for serialized results even there are no data collected to the driver
[ https://issues.apache.org/jira/browse/SPARK-12837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963705#comment-15963705 ] Wenchen Fan commented on SPARK-12837: - [~Tagar] I think this may be because the size estimation was not accurate, can you try master branch? It should be fixed. > Spark driver requires large memory space for serialized results even there > are no data collected to the driver > -- > > Key: SPARK-12837 > URL: https://issues.apache.org/jira/browse/SPARK-12837 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 1.5.2, 1.6.0 >Reporter: Tien-Dung LE >Assignee: Wenchen Fan >Priority: Critical > Fix For: 2.0.0 > > > Executing a sql statement with a large number of partitions requires a high > memory space for the driver even there are no requests to collect data back > to the driver. > Here are steps to re-produce the issue. > 1. Start spark shell with a spark.driver.maxResultSize setting > {code:java} > bin/spark-shell --driver-memory=1g --conf spark.driver.maxResultSize=1m > {code} > 2. Execute the code > {code:java} > case class Toto( a: Int, b: Int) > val df = sc.parallelize( 1 to 1e6.toInt).map( i => Toto( i, i)).toDF > sqlContext.setConf( "spark.sql.shuffle.partitions", "200" ) > df.groupBy("a").count().saveAsParquetFile( "toto1" ) // OK > sqlContext.setConf( "spark.sql.shuffle.partitions", 1e3.toInt.toString ) > df.repartition(1e3.toInt).groupBy("a").count().repartition(1e3.toInt).saveAsParquetFile( > "toto2" ) // ERROR > {code} > The error message is > {code:java} > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Total size of serialized results of 393 tasks (1025.9 KB) is bigger than > spark.driver.maxResultSize (1024.0 KB) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12837) Spark driver requires large memory space for serialized results even there are no data collected to the driver
[ https://issues.apache.org/jira/browse/SPARK-12837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963705#comment-15963705 ] Wenchen Fan edited comment on SPARK-12837 at 4/11/17 1:06 AM: -- [~Tagar] I think this may be because the size estimation was not accurate, can you try master branch? It should have been fixed. was (Author: cloud_fan): [~Tagar] I think this may be because the size estimation was not accurate, can you try master branch? It should be fixed. > Spark driver requires large memory space for serialized results even there > are no data collected to the driver > -- > > Key: SPARK-12837 > URL: https://issues.apache.org/jira/browse/SPARK-12837 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 1.5.2, 1.6.0 >Reporter: Tien-Dung LE >Assignee: Wenchen Fan >Priority: Critical > Fix For: 2.0.0 > > > Executing a sql statement with a large number of partitions requires a high > memory space for the driver even there are no requests to collect data back > to the driver. > Here are steps to re-produce the issue. > 1. Start spark shell with a spark.driver.maxResultSize setting > {code:java} > bin/spark-shell --driver-memory=1g --conf spark.driver.maxResultSize=1m > {code} > 2. Execute the code > {code:java} > case class Toto( a: Int, b: Int) > val df = sc.parallelize( 1 to 1e6.toInt).map( i => Toto( i, i)).toDF > sqlContext.setConf( "spark.sql.shuffle.partitions", "200" ) > df.groupBy("a").count().saveAsParquetFile( "toto1" ) // OK > sqlContext.setConf( "spark.sql.shuffle.partitions", 1e3.toInt.toString ) > df.repartition(1e3.toInt).groupBy("a").count().repartition(1e3.toInt).saveAsParquetFile( > "toto2" ) // ERROR > {code} > The error message is > {code:java} > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Total size of serialized results of 393 tasks (1025.9 KB) is bigger than > spark.driver.maxResultSize (1024.0 KB) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20274) support compatible array element type in encoder
[ https://issues.apache.org/jira/browse/SPARK-20274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962507#comment-15962507 ] Apache Spark commented on SPARK-20274: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/17587 > support compatible array element type in encoder > > > Key: SPARK-20274 > URL: https://issues.apache.org/jira/browse/SPARK-20274 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20274) support compatible array element type in encoder
[ https://issues.apache.org/jira/browse/SPARK-20274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20274: Assignee: Wenchen Fan (was: Apache Spark) > support compatible array element type in encoder > > > Key: SPARK-20274 > URL: https://issues.apache.org/jira/browse/SPARK-20274 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20274) support compatible array element type in encoder
[ https://issues.apache.org/jira/browse/SPARK-20274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20274: Assignee: Apache Spark (was: Wenchen Fan) > support compatible array element type in encoder > > > Key: SPARK-20274 > URL: https://issues.apache.org/jira/browse/SPARK-20274 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20274) support compatible array element type in encoder
Wenchen Fan created SPARK-20274: --- Summary: support compatible array element type in encoder Key: SPARK-20274 URL: https://issues.apache.org/jira/browse/SPARK-20274 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20275) HistoryServer page shows incorrect complete date of inprogress apps
Saisai Shao created SPARK-20275: --- Summary: HistoryServer page shows incorrect complete date of inprogress apps Key: SPARK-20275 URL: https://issues.apache.org/jira/browse/SPARK-20275 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.2.0 Reporter: Saisai Shao Priority: Minor The HistoryServer's incomplete page shows in-progress application's completed date as {{1969-12-31 23:59:59}}, which is not meaningful and could be improved. So instead of showing this date, here proposed to not display this column since it is not required for in-progress applications. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20273) Disallow Non-deterministic Filter push-down into Join Conditions
[ https://issues.apache.org/jira/browse/SPARK-20273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20273: Assignee: Apache Spark (was: Xiao Li) > Disallow Non-deterministic Filter push-down into Join Conditions > > > Key: SPARK-20273 > URL: https://issues.apache.org/jira/browse/SPARK-20273 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.0 >Reporter: Xiao Li >Assignee: Apache Spark > > {noformat} > sql("SELECT t1.b, rand(0) as r FROM cachedData, cachedData t1 GROUP BY t1.b > having r > 0.5").show() > {noformat} > We will get the following error: > {noformat} > Job aborted due to stage failure: Task 1 in stage 4.0 failed 1 times, most > recent failure: Lost task 1.0 in stage 4.0 (TID 8, localhost, executor > driver): java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown > Source) > at > org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$1.apply(BroadcastNestedLoopJoinExec.scala:87) > at > org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$1.apply(BroadcastNestedLoopJoinExec.scala:87) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463) > {noformat} > Filters could be pushed down to the join conditions by the optimizer rule > {{PushPredicateThroughJoin}}. However, we block users to add > non-deterministics conditions by the analyzer (For details, see the PR > https://github.com/apache/spark/pull/7535). > We should not push down non-deterministic conditions; otherwise, we should > allow users to do it by explicitly initialize the non-deterministic > expressions -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20273) Disallow Non-deterministic Filter push-down into Join Conditions
[ https://issues.apache.org/jira/browse/SPARK-20273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20273: Assignee: Xiao Li (was: Apache Spark) > Disallow Non-deterministic Filter push-down into Join Conditions > > > Key: SPARK-20273 > URL: https://issues.apache.org/jira/browse/SPARK-20273 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.0 >Reporter: Xiao Li >Assignee: Xiao Li > > {noformat} > sql("SELECT t1.b, rand(0) as r FROM cachedData, cachedData t1 GROUP BY t1.b > having r > 0.5").show() > {noformat} > We will get the following error: > {noformat} > Job aborted due to stage failure: Task 1 in stage 4.0 failed 1 times, most > recent failure: Lost task 1.0 in stage 4.0 (TID 8, localhost, executor > driver): java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown > Source) > at > org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$1.apply(BroadcastNestedLoopJoinExec.scala:87) > at > org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$1.apply(BroadcastNestedLoopJoinExec.scala:87) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463) > {noformat} > Filters could be pushed down to the join conditions by the optimizer rule > {{PushPredicateThroughJoin}}. However, we block users to add > non-deterministics conditions by the analyzer (For details, see the PR > https://github.com/apache/spark/pull/7535). > We should not push down non-deterministic conditions; otherwise, we should > allow users to do it by explicitly initialize the non-deterministic > expressions -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20273) Disallow Non-deterministic Filter push-down into Join Conditions
[ https://issues.apache.org/jira/browse/SPARK-20273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962469#comment-15962469 ] Apache Spark commented on SPARK-20273: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/17585 > Disallow Non-deterministic Filter push-down into Join Conditions > > > Key: SPARK-20273 > URL: https://issues.apache.org/jira/browse/SPARK-20273 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.0 >Reporter: Xiao Li >Assignee: Xiao Li > > {noformat} > sql("SELECT t1.b, rand(0) as r FROM cachedData, cachedData t1 GROUP BY t1.b > having r > 0.5").show() > {noformat} > We will get the following error: > {noformat} > Job aborted due to stage failure: Task 1 in stage 4.0 failed 1 times, most > recent failure: Lost task 1.0 in stage 4.0 (TID 8, localhost, executor > driver): java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown > Source) > at > org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$1.apply(BroadcastNestedLoopJoinExec.scala:87) > at > org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$1.apply(BroadcastNestedLoopJoinExec.scala:87) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463) > {noformat} > Filters could be pushed down to the join conditions by the optimizer rule > {{PushPredicateThroughJoin}}. However, we block users to add > non-deterministics conditions by the analyzer (For details, see the PR > https://github.com/apache/spark/pull/7535). > We should not push down non-deterministic conditions; otherwise, we should > allow users to do it by explicitly initialize the non-deterministic > expressions -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-20252) java.lang.ClassNotFoundException: $line22.$read$$iwC$$iwC$movie_row
[ https://issues.apache.org/jira/browse/SPARK-20252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Mead reopened SPARK-20252: I'm Not sure how this explains how it works the first (and every) time if the spark context is not changed? There must be a discrepancy in the way that DSE creates the spark context the first time through and the way I create it after sc.stop? > java.lang.ClassNotFoundException: $line22.$read$$iwC$$iwC$movie_row > --- > > Key: SPARK-20252 > URL: https://issues.apache.org/jira/browse/SPARK-20252 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 1.6.3 > Environment: Datastax DSE dual node SPARK cluster > [cqlsh 5.0.1 | Cassandra 3.0.12.1586 | DSE 5.0.7 | CQL spec 3.4.0 | Native > protocol v4] >Reporter: Peter Mead > > After starting a spark shell using DSE -u -p x spark > scala> case class movie_row (actor: String, character_name: String, video_id: > java.util.UUID, video_year: Int, title: String) > defined class movie_row > scala> val vids=sc.cassandraTable("hcl","videos_by_actor") > vids: > com.datastax.spark.connector.rdd.CassandraTableScanRDD[com.datastax.spark.connector.CassandraRow] > = CassandraTableScanRDD[0] at RDD at CassandraRDD.scala:15 > scala> val vids=sc.cassandraTable[movie_row]("hcl","videos_by_actor") > vids: com.datastax.spark.connector.rdd.CassandraTableScanRDD[movie_row] = > CassandraTableScanRDD[1] at RDD at CassandraRDD.scala:15 > scala> vids.count > res0: Long = 114961 > Works OK!! > BUT if the spark context is stopped and recreated THEN: > scala> sc.stop() > scala> import org.apache.spark.SparkContext, org.apache.spark.SparkContext._, > org.apache.spark.SparkConf > import org.apache.spark.SparkContext > import org.apache.spark.SparkContext._ > import org.apache.spark.SparkConf > scala> :paste > // Entering paste mode (ctrl-D to finish) > val conf = new SparkConf(true) > .set("spark.cassandra.connection.host", "redacted") > .set("spark.cassandra.auth.username", "redacted") > .set("spark.cassandra.auth.password", "redacted") > // Exiting paste mode, now interpreting. > conf: org.apache.spark.SparkConf = org.apache.spark.SparkConf@e207342 > scala> val sc = new SparkContext("spark://192.168.1.84:7077", "pjm", conf) > sc: org.apache.spark.SparkContext = org.apache.spark.SparkContext@12b8b7c8 > scala> case class movie_row (actor: String, character_name: String, video_id: > java.util.UUID, video_year: Int, title: String) > defined class movie_row > scala> val vids=sc.cassandraTable[movie_row]("hcl","videos_by_actor") > vids: com.datastax.spark.connector.rdd.CassandraTableScanRDD[movie_row] = > CassandraTableScanRDD[0] at RDD at CassandraRDD.scala:15 > scala> vids.count > [Stage 0:> (0 + 2) / > 2]WARN 2017-04-07 12:52:03,277 org.apache.spark.scheduler.TaskSetManager: > Lost task 0.0 in stage 0.0 (TID 0, cassandra183): > java.lang.ClassNotFoundException: $line22.$read$$iwC$$iwC$movie_row > FAILS!! > I have been unable to get this to work from a remote SPARK shell! -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20249) Add summary for LinearSVCModel
[ https://issues.apache.org/jira/browse/SPARK-20249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated SPARK-20249: --- Component/s: PySpark > Add summary for LinearSVCModel > -- > > Key: SPARK-20249 > URL: https://issues.apache.org/jira/browse/SPARK-20249 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.1.0 >Reporter: Jeff Zhang >Priority: Minor > > I'd like to get the ObjectiveHistory of each iteration. So I want to add > summary for LinearSVCModel as LogisticRegression. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20275) HistoryServer page shows incorrect complete date of inprogress apps
[ https://issues.apache.org/jira/browse/SPARK-20275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated SPARK-20275: Description: The HistoryServer's incomplete page shows in-progress application's completed date as {{1969-12-31 23:59:59}}, which is not meaningful and could be improved. !https://issues.apache.org/jira/secure/attachment/12862656/screenshot-1.png! So instead of showing this date, here proposed to not display this column since it is not required for in-progress applications. was: The HistoryServer's incomplete page shows in-progress application's completed date as {{1969-12-31 23:59:59}}, which is not meaningful and could be improved. So instead of showing this date, here proposed to not display this column since it is not required for in-progress applications. > HistoryServer page shows incorrect complete date of inprogress apps > --- > > Key: SPARK-20275 > URL: https://issues.apache.org/jira/browse/SPARK-20275 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Saisai Shao >Priority: Minor > Attachments: screenshot-1.png > > > The HistoryServer's incomplete page shows in-progress application's completed > date as {{1969-12-31 23:59:59}}, which is not meaningful and could be > improved. > !https://issues.apache.org/jira/secure/attachment/12862656/screenshot-1.png! > So instead of showing this date, here proposed to not display this column > since it is not required for in-progress applications. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20275) HistoryServer page shows incorrect complete date of inprogress apps
[ https://issues.apache.org/jira/browse/SPARK-20275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated SPARK-20275: Attachment: screenshot-1.png > HistoryServer page shows incorrect complete date of inprogress apps > --- > > Key: SPARK-20275 > URL: https://issues.apache.org/jira/browse/SPARK-20275 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Saisai Shao >Priority: Minor > Attachments: screenshot-1.png > > > The HistoryServer's incomplete page shows in-progress application's completed > date as {{1969-12-31 23:59:59}}, which is not meaningful and could be > improved. > So instead of showing this date, here proposed to not display this column > since it is not required for in-progress applications. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20249) Add summary for LinearSVCModel
[ https://issues.apache.org/jira/browse/SPARK-20249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962481#comment-15962481 ] Apache Spark commented on SPARK-20249: -- User 'zjffdu' has created a pull request for this issue: https://github.com/apache/spark/pull/17586 > Add summary for LinearSVCModel > -- > > Key: SPARK-20249 > URL: https://issues.apache.org/jira/browse/SPARK-20249 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.1.0 >Reporter: Jeff Zhang >Priority: Minor > > I'd like to get the ObjectiveHistory of each iteration. So I want to add > summary for LinearSVCModel as LogisticRegression. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20273) Disallow Non-deterministic Filter push-down into Join Conditions
[ https://issues.apache.org/jira/browse/SPARK-20273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-20273: Affects Version/s: 2.0.2 > Disallow Non-deterministic Filter push-down into Join Conditions > > > Key: SPARK-20273 > URL: https://issues.apache.org/jira/browse/SPARK-20273 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.0 >Reporter: Xiao Li >Assignee: Xiao Li > > {noformat} > sql("SELECT t1.b, rand(0) as r FROM cachedData, cachedData t1 GROUP BY t1.b > having r > 0.5").show() > {noformat} > We will get the following error: > {noformat} > Job aborted due to stage failure: Task 1 in stage 4.0 failed 1 times, most > recent failure: Lost task 1.0 in stage 4.0 (TID 8, localhost, executor > driver): java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown > Source) > at > org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$1.apply(BroadcastNestedLoopJoinExec.scala:87) > at > org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$1.apply(BroadcastNestedLoopJoinExec.scala:87) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463) > {noformat} > Filters could be pushed down to the join conditions by the optimizer rule > {{PushPredicateThroughJoin}}. However, we block users to add > non-deterministics conditions by the analyzer (For details, see the PR > https://github.com/apache/spark/pull/7535). > We should not push down non-deterministic conditions; otherwise, we should > allow users to do it by explicitly initialize the non-deterministic > expressions -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20249) Add summary for LinearSVCModel
[ https://issues.apache.org/jira/browse/SPARK-20249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20249: Assignee: (was: Apache Spark) > Add summary for LinearSVCModel > -- > > Key: SPARK-20249 > URL: https://issues.apache.org/jira/browse/SPARK-20249 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.1.0 >Reporter: Jeff Zhang >Priority: Minor > > I'd like to get the ObjectiveHistory of each iteration. So I want to add > summary for LinearSVCModel as LogisticRegression. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20249) Add summary for LinearSVCModel
[ https://issues.apache.org/jira/browse/SPARK-20249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20249: Assignee: Apache Spark > Add summary for LinearSVCModel > -- > > Key: SPARK-20249 > URL: https://issues.apache.org/jira/browse/SPARK-20249 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.1.0 >Reporter: Jeff Zhang >Assignee: Apache Spark >Priority: Minor > > I'd like to get the ObjectiveHistory of each iteration. So I want to add > summary for LinearSVCModel as LogisticRegression. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20269) add java class 'JavaWordCountProducer' to 'provide java word count producer'.
[ https://issues.apache.org/jira/browse/SPARK-20269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-20269. --- Resolution: Won't Fix > add java class 'JavaWordCountProducer' to 'provide java word count producer'. > - > > Key: SPARK-20269 > URL: https://issues.apache.org/jira/browse/SPARK-20269 > Project: Spark > Issue Type: Improvement > Components: Examples, Structured Streaming >Affects Versions: 2.1.0 >Reporter: guoxiaolongzte >Priority: Minor > > 1.run example of streaming kafka,currently missing java word count > producer,not conducive to java developers to learn and test. > add a class JavaKafkaWordCountProducer. > 2.run example of JavaKafkaWordCount.I find no java word count producer. > run example of KafkaWordCount.I find have scala word count producer. > I think we should provide the corresponding example code to facilitate java > developers to learn and test. > 3.My project team develops spark applications,basically with java statements > and java API. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20278) Disable 'multiple_dots_linter; lint rule that is against project's code style
Hyukjin Kwon created SPARK-20278: Summary: Disable 'multiple_dots_linter; lint rule that is against project's code style Key: SPARK-20278 URL: https://issues.apache.org/jira/browse/SPARK-20278 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 2.2.0 Reporter: Hyukjin Kwon Priority: Minor Currently, multi-dot separated variables in R is not allowed. For example, {code} setMethod("from_json", signature(x = "Column", schema = "structType"), - function(x, schema, asJsonArray = FALSE, ...) { + function(x, schema, as.json.array = FALSE, ...) { if (asJsonArray) { jschema <- callJStatic("org.apache.spark.sql.types.DataTypes", "createArrayType", {code} produces an error as below: {code} R/functions.R:2462:31: style: Words within variable and function names should be separated by '_' rather than '.'. function(x, schema, as.json.array = FALSE, ...) { ^ {code} This seems against https://google.github.io/styleguide/Rguide.xml#identifiers which says {quote} The preferred form for variable names is all lower case letters and words separated with dots {quote} This looks because lintr https://github.com/jimhester/lintr follows http://r-pkgs.had.co.nz/style.html as written in the README.md. This guide line seems against the rule. Per SPARK-6813, we follow Google's R Style Guide with few exceptions https://google.github.io/styleguide/Rguide.xml. This is also merged into Spark's website - https://github.com/apache/spark-website/pull/43 Also, we have no limit on function name. This rule also looks affecting to the name of functions as written in the README.md. {quote} multiple_dots_linter: check that function and variable names are separated by _ rather than .. {quote} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20275) HistoryServer page shows incorrect complete date of inprogress apps
[ https://issues.apache.org/jira/browse/SPARK-20275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20275: Assignee: (was: Apache Spark) > HistoryServer page shows incorrect complete date of inprogress apps > --- > > Key: SPARK-20275 > URL: https://issues.apache.org/jira/browse/SPARK-20275 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Saisai Shao >Priority: Minor > Attachments: screenshot-1.png > > > The HistoryServer's incomplete page shows in-progress application's completed > date as {{1969-12-31 23:59:59}}, which is not meaningful and could be > improved. > !https://issues.apache.org/jira/secure/attachment/12862656/screenshot-1.png! > So instead of showing this date, here proposed to not display this column > since it is not required for in-progress applications. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20275) HistoryServer page shows incorrect complete date of inprogress apps
[ https://issues.apache.org/jira/browse/SPARK-20275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20275: Assignee: Apache Spark > HistoryServer page shows incorrect complete date of inprogress apps > --- > > Key: SPARK-20275 > URL: https://issues.apache.org/jira/browse/SPARK-20275 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Saisai Shao >Assignee: Apache Spark >Priority: Minor > Attachments: screenshot-1.png > > > The HistoryServer's incomplete page shows in-progress application's completed > date as {{1969-12-31 23:59:59}}, which is not meaningful and could be > improved. > !https://issues.apache.org/jira/secure/attachment/12862656/screenshot-1.png! > So instead of showing this date, here proposed to not display this column > since it is not required for in-progress applications. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20275) HistoryServer page shows incorrect complete date of inprogress apps
[ https://issues.apache.org/jira/browse/SPARK-20275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962566#comment-15962566 ] Apache Spark commented on SPARK-20275: -- User 'jerryshao' has created a pull request for this issue: https://github.com/apache/spark/pull/17588 > HistoryServer page shows incorrect complete date of inprogress apps > --- > > Key: SPARK-20275 > URL: https://issues.apache.org/jira/browse/SPARK-20275 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Saisai Shao >Priority: Minor > Attachments: screenshot-1.png > > > The HistoryServer's incomplete page shows in-progress application's completed > date as {{1969-12-31 23:59:59}}, which is not meaningful and could be > improved. > !https://issues.apache.org/jira/secure/attachment/12862656/screenshot-1.png! > So instead of showing this date, here proposed to not display this column > since it is not required for in-progress applications. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20272) about graph shortestPath algorithm question
[ https://issues.apache.org/jira/browse/SPARK-20272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-20272. --- Resolution: Invalid Questions should go to u...@spark.apache.org > about graph shortestPath algorithm question > --- > > Key: SPARK-20272 > URL: https://issues.apache.org/jira/browse/SPARK-20272 > Project: Spark > Issue Type: Question > Components: GraphX >Affects Versions: 2.1.0 >Reporter: huangjunjun > > we all know that shortestPath algorithm should be to comput the distance > between source vertex id and destination vertex id.In fact, the shortestPath > algorithm in graphX is computting the least passed vertex number from source > to destination. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20276) ScheduleExecutorService control sleep interval
darion yaphet created SPARK-20276: - Summary: ScheduleExecutorService control sleep interval Key: SPARK-20276 URL: https://issues.apache.org/jira/browse/SPARK-20276 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.1.0 Reporter: darion yaphet Priority: Minor The SessionManager startup timeout checker periodicity . We can using ScheduleExecutorService to control time interval and to replacement Thread.sleep . It's seem seay and elegant . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20277) Allow Spark on YARN to be launched with Docker
[ https://issues.apache.org/jira/browse/SPARK-20277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962595#comment-15962595 ] Sean Owen commented on SPARK-20277: --- What version of YARN is that going to require? it may be a non-starter for the near future if it requires 2.9 or 3.0 or something. > Allow Spark on YARN to be launched with Docker > -- > > Key: SPARK-20277 > URL: https://issues.apache.org/jira/browse/SPARK-20277 > Project: Spark > Issue Type: New Feature > Components: YARN >Affects Versions: 1.6.0, 2.0.0 >Reporter: Zhankun Tang > > Currently YARN is going to support Docker(YARN-3611). > We want to empower Spark to support launching Executors via a Docker image > that resolving the user's dependencies. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20276) ScheduleExecutorService control sleep interval
[ https://issues.apache.org/jira/browse/SPARK-20276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962630#comment-15962630 ] darion yaphet commented on SPARK-20276: --- thanks for you response :) *org.apache.hive.service.cli.session.SessionManager* is seems copy from hive source code . I will close this issue . > ScheduleExecutorService control sleep interval > --- > > Key: SPARK-20276 > URL: https://issues.apache.org/jira/browse/SPARK-20276 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: darion yaphet >Priority: Minor > > The SessionManager startup timeout checker periodicity . We can using > ScheduleExecutorService to control time interval and to replacement > Thread.sleep . It's seem seay and elegant . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20252) java.lang.ClassNotFoundException: $line22.$read$$iwC$$iwC$movie_row
[ https://issues.apache.org/jira/browse/SPARK-20252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962636#comment-15962636 ] Sean Owen commented on SPARK-20252: --- It's likely related to other spark-shell + case class issues, whether exactly the same or not. Those are really issues with the Scala shell. Stopping and starting contexts isn't supported. If you have a lead on a reliable fix, propose it, but otherwise this is why I closed this. Generally, don't reopen issues without new info. > java.lang.ClassNotFoundException: $line22.$read$$iwC$$iwC$movie_row > --- > > Key: SPARK-20252 > URL: https://issues.apache.org/jira/browse/SPARK-20252 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 1.6.3 > Environment: Datastax DSE dual node SPARK cluster > [cqlsh 5.0.1 | Cassandra 3.0.12.1586 | DSE 5.0.7 | CQL spec 3.4.0 | Native > protocol v4] >Reporter: Peter Mead > > After starting a spark shell using DSE -u -p x spark > scala> case class movie_row (actor: String, character_name: String, video_id: > java.util.UUID, video_year: Int, title: String) > defined class movie_row > scala> val vids=sc.cassandraTable("hcl","videos_by_actor") > vids: > com.datastax.spark.connector.rdd.CassandraTableScanRDD[com.datastax.spark.connector.CassandraRow] > = CassandraTableScanRDD[0] at RDD at CassandraRDD.scala:15 > scala> val vids=sc.cassandraTable[movie_row]("hcl","videos_by_actor") > vids: com.datastax.spark.connector.rdd.CassandraTableScanRDD[movie_row] = > CassandraTableScanRDD[1] at RDD at CassandraRDD.scala:15 > scala> vids.count > res0: Long = 114961 > Works OK!! > BUT if the spark context is stopped and recreated THEN: > scala> sc.stop() > scala> import org.apache.spark.SparkContext, org.apache.spark.SparkContext._, > org.apache.spark.SparkConf > import org.apache.spark.SparkContext > import org.apache.spark.SparkContext._ > import org.apache.spark.SparkConf > scala> :paste > // Entering paste mode (ctrl-D to finish) > val conf = new SparkConf(true) > .set("spark.cassandra.connection.host", "redacted") > .set("spark.cassandra.auth.username", "redacted") > .set("spark.cassandra.auth.password", "redacted") > // Exiting paste mode, now interpreting. > conf: org.apache.spark.SparkConf = org.apache.spark.SparkConf@e207342 > scala> val sc = new SparkContext("spark://192.168.1.84:7077", "pjm", conf) > sc: org.apache.spark.SparkContext = org.apache.spark.SparkContext@12b8b7c8 > scala> case class movie_row (actor: String, character_name: String, video_id: > java.util.UUID, video_year: Int, title: String) > defined class movie_row > scala> val vids=sc.cassandraTable[movie_row]("hcl","videos_by_actor") > vids: com.datastax.spark.connector.rdd.CassandraTableScanRDD[movie_row] = > CassandraTableScanRDD[0] at RDD at CassandraRDD.scala:15 > scala> vids.count > [Stage 0:> (0 + 2) / > 2]WARN 2017-04-07 12:52:03,277 org.apache.spark.scheduler.TaskSetManager: > Lost task 0.0 in stage 0.0 (TID 0, cassandra183): > java.lang.ClassNotFoundException: $line22.$read$$iwC$$iwC$movie_row > FAILS!! > I have been unable to get this to work from a remote SPARK shell! -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20276) ScheduleExecutorService control sleep interval
[ https://issues.apache.org/jira/browse/SPARK-20276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962585#comment-15962585 ] Sean Owen commented on SPARK-20276: --- Is this code cloned from Hive? if so it might not be worth customizing it. Otherwise, there's already a pool executing this thread. Are you proposing to make it a ScheduleExecutorService? > ScheduleExecutorService control sleep interval > --- > > Key: SPARK-20276 > URL: https://issues.apache.org/jira/browse/SPARK-20276 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: darion yaphet >Priority: Minor > > The SessionManager startup timeout checker periodicity . We can using > ScheduleExecutorService to control time interval and to replacement > Thread.sleep . It's seem seay and elegant . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-20276) ScheduleExecutorService control sleep interval
[ https://issues.apache.org/jira/browse/SPARK-20276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] darion yaphet closed SPARK-20276. - Resolution: Won't Fix > ScheduleExecutorService control sleep interval > --- > > Key: SPARK-20276 > URL: https://issues.apache.org/jira/browse/SPARK-20276 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: darion yaphet >Priority: Minor > > The SessionManager startup timeout checker periodicity . We can using > ScheduleExecutorService to control time interval and to replacement > Thread.sleep . It's seem seay and elegant . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20278) Disable 'multiple_dots_linter; lint rule that is against project's code style
[ https://issues.apache.org/jira/browse/SPARK-20278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-20278: - Description: Currently, multi-dot separated variables in R is not allowed. For example, {code} setMethod("from_json", signature(x = "Column", schema = "structType"), - function(x, schema, asJsonArray = FALSE, ...) { + function(x, schema, as.json.array = FALSE, ...) { if (asJsonArray) { jschema <- callJStatic("org.apache.spark.sql.types.DataTypes", "createArrayType", {code} produces an error as below: {code} R/functions.R:2462:31: style: Words within variable and function names should be separated by '_' rather than '.'. function(x, schema, as.json.array = FALSE, ...) { ^ {code} This seems against https://google.github.io/styleguide/Rguide.xml#identifiers which says {quote} The preferred form for variable names is all lower case letters and words separated with dots {quote} This looks because lintr https://github.com/jimhester/lintr follows http://r-pkgs.had.co.nz/style.html as written in the README.md. Few cases seems not following Google's one. Per SPARK-6813, we follow Google's R Style Guide with few exceptions https://google.github.io/styleguide/Rguide.xml. This is also merged into Spark's website - https://github.com/apache/spark-website/pull/43 Also, we have no limit on function name. This rule also looks affecting to the name of functions as written in the README.md. {quote} multiple_dots_linter: check that function and variable names are separated by _ rather than .. {quote} was: Currently, multi-dot separated variables in R is not allowed. For example, {code} setMethod("from_json", signature(x = "Column", schema = "structType"), - function(x, schema, asJsonArray = FALSE, ...) { + function(x, schema, as.json.array = FALSE, ...) { if (asJsonArray) { jschema <- callJStatic("org.apache.spark.sql.types.DataTypes", "createArrayType", {code} produces an error as below: {code} R/functions.R:2462:31: style: Words within variable and function names should be separated by '_' rather than '.'. function(x, schema, as.json.array = FALSE, ...) { ^ {code} This seems against https://google.github.io/styleguide/Rguide.xml#identifiers which says {quote} The preferred form for variable names is all lower case letters and words separated with dots {quote} This looks because lintr https://github.com/jimhester/lintr follows http://r-pkgs.had.co.nz/style.html as written in the README.md. This guide line seems against the rule. Per SPARK-6813, we follow Google's R Style Guide with few exceptions https://google.github.io/styleguide/Rguide.xml. This is also merged into Spark's website - https://github.com/apache/spark-website/pull/43 Also, we have no limit on function name. This rule also looks affecting to the name of functions as written in the README.md. {quote} multiple_dots_linter: check that function and variable names are separated by _ rather than .. {quote} > Disable 'multiple_dots_linter; lint rule that is against project's code style > - > > Key: SPARK-20278 > URL: https://issues.apache.org/jira/browse/SPARK-20278 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon >Priority: Minor > > Currently, multi-dot separated variables in R is not allowed. For example, > {code} > setMethod("from_json", signature(x = "Column", schema = "structType"), > - function(x, schema, asJsonArray = FALSE, ...) { > + function(x, schema, as.json.array = FALSE, ...) { > if (asJsonArray) { >jschema <- callJStatic("org.apache.spark.sql.types.DataTypes", > "createArrayType", > {code} > produces an error as below: > {code} > R/functions.R:2462:31: style: Words within variable and function names should > be separated by '_' rather than '.'. > function(x, schema, as.json.array = FALSE, ...) { > ^ > {code} > This seems against https://google.github.io/styleguide/Rguide.xml#identifiers > which says > {quote} > The preferred form for variable names is all lower case letters and words > separated with dots > {quote} > This looks because lintr https://github.com/jimhester/lintr follows > http://r-pkgs.had.co.nz/style.html as written in the README.md. Few cases > seems not following Google's one. > Per SPARK-6813, we follow Google's R Style Guide with few exceptions >
[jira] [Updated] (SPARK-20278) Disable 'multiple_dots_linter' lint rule that is against project's code style
[ https://issues.apache.org/jira/browse/SPARK-20278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-20278: - Summary: Disable 'multiple_dots_linter' lint rule that is against project's code style (was: Disable 'multiple_dots_linter; lint rule that is against project's code style) > Disable 'multiple_dots_linter' lint rule that is against project's code style > - > > Key: SPARK-20278 > URL: https://issues.apache.org/jira/browse/SPARK-20278 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon >Priority: Minor > > Currently, multi-dot separated variables in R is not allowed. For example, > {code} > setMethod("from_json", signature(x = "Column", schema = "structType"), > - function(x, schema, asJsonArray = FALSE, ...) { > + function(x, schema, as.json.array = FALSE, ...) { > if (asJsonArray) { >jschema <- callJStatic("org.apache.spark.sql.types.DataTypes", > "createArrayType", > {code} > produces an error as below: > {code} > R/functions.R:2462:31: style: Words within variable and function names should > be separated by '_' rather than '.'. > function(x, schema, as.json.array = FALSE, ...) { > ^ > {code} > This seems against https://google.github.io/styleguide/Rguide.xml#identifiers > which says > {quote} > The preferred form for variable names is all lower case letters and words > separated with dots > {quote} > This looks because lintr https://github.com/jimhester/lintr follows > http://r-pkgs.had.co.nz/style.html as written in the README.md. Few cases > seems not following Google's one. > Per SPARK-6813, we follow Google's R Style Guide with few exceptions > https://google.github.io/styleguide/Rguide.xml. This is also merged into > Spark's website - https://github.com/apache/spark-website/pull/43 > Also, we have no limit on function name. This rule also looks affecting to > the name of functions as written in the README.md. > {quote} > multiple_dots_linter: check that function and variable names are separated by > _ rather than .. > {quote} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20278) Disable 'multiple_dots_linter' lint rule that is against project's code style
[ https://issues.apache.org/jira/browse/SPARK-20278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962656#comment-15962656 ] Apache Spark commented on SPARK-20278: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/17590 > Disable 'multiple_dots_linter' lint rule that is against project's code style > - > > Key: SPARK-20278 > URL: https://issues.apache.org/jira/browse/SPARK-20278 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon >Priority: Minor > > Currently, multi-dot separated variables in R is not allowed. For example, > {code} > setMethod("from_json", signature(x = "Column", schema = "structType"), > - function(x, schema, asJsonArray = FALSE, ...) { > + function(x, schema, as.json.array = FALSE, ...) { > if (asJsonArray) { >jschema <- callJStatic("org.apache.spark.sql.types.DataTypes", > "createArrayType", > {code} > produces an error as below: > {code} > R/functions.R:2462:31: style: Words within variable and function names should > be separated by '_' rather than '.'. > function(x, schema, as.json.array = FALSE, ...) { > ^ > {code} > This seems against https://google.github.io/styleguide/Rguide.xml#identifiers > which says > {quote} > The preferred form for variable names is all lower case letters and words > separated with dots > {quote} > This looks because lintr https://github.com/jimhester/lintr follows > http://r-pkgs.had.co.nz/style.html as written in the README.md. Few cases > seems not following Google's one. > Per SPARK-6813, we follow Google's R Style Guide with few exceptions > https://google.github.io/styleguide/Rguide.xml. This is also merged into > Spark's website - https://github.com/apache/spark-website/pull/43 > Also, we have no limit on function name. This rule also looks affecting to > the name of functions as written in the README.md. > {quote} > multiple_dots_linter: check that function and variable names are separated by > _ rather than .. > {quote} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20279) In web ui,'Only showing 200' should be changed to 'only showing last 200' in the page of 'jobs' or'stages'.
[ https://issues.apache.org/jira/browse/SPARK-20279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] guoxiaolongzte updated SPARK-20279: --- Summary: In web ui,'Only showing 200' should be changed to 'only showing last 200' in the page of 'jobs' or'stages'. (was: In web ui,'Only showing 200' should be changed to 'only showing last 200'.) > In web ui,'Only showing 200' should be changed to 'only showing last 200' in > the page of 'jobs' or'stages'. > --- > > Key: SPARK-20279 > URL: https://issues.apache.org/jira/browse/SPARK-20279 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.1.0 >Reporter: guoxiaolongzte >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20279) In web ui,'Only showing 200' should be changed to 'only showing last 200'.
[ https://issues.apache.org/jira/browse/SPARK-20279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20279: Assignee: (was: Apache Spark) > In web ui,'Only showing 200' should be changed to 'only showing last 200'. > -- > > Key: SPARK-20279 > URL: https://issues.apache.org/jira/browse/SPARK-20279 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.1.0 >Reporter: guoxiaolongzte >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20279) In web ui,'Only showing 200' should be changed to 'only showing last 200'.
[ https://issues.apache.org/jira/browse/SPARK-20279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962763#comment-15962763 ] Apache Spark commented on SPARK-20279: -- User 'guoxiaolongzte' has created a pull request for this issue: https://github.com/apache/spark/pull/17593 > In web ui,'Only showing 200' should be changed to 'only showing last 200'. > -- > > Key: SPARK-20279 > URL: https://issues.apache.org/jira/browse/SPARK-20279 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.1.0 >Reporter: guoxiaolongzte >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20279) In web ui,'Only showing 200' should be changed to 'only showing last 200'.
[ https://issues.apache.org/jira/browse/SPARK-20279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20279: Assignee: Apache Spark > In web ui,'Only showing 200' should be changed to 'only showing last 200'. > -- > > Key: SPARK-20279 > URL: https://issues.apache.org/jira/browse/SPARK-20279 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.1.0 >Reporter: guoxiaolongzte >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12837) Spark driver requires large memory space for serialized results even there are no data collected to the driver
[ https://issues.apache.org/jira/browse/SPARK-12837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962731#comment-15962731 ] Wenchen Fan commented on SPARK-12837: - [~jbherman] I tried and debugged your example, the actual task result is around 1kb, when the number of shuffle partitions is 1k, the total result size will be around 1mb. I'm trying to reduce the overhead of accumulator, but you have to increase the `spark.driver.maxResultSize` anyway. > Spark driver requires large memory space for serialized results even there > are no data collected to the driver > -- > > Key: SPARK-12837 > URL: https://issues.apache.org/jira/browse/SPARK-12837 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 1.5.2, 1.6.0 >Reporter: Tien-Dung LE >Assignee: Wenchen Fan >Priority: Critical > Fix For: 2.0.0 > > > Executing a sql statement with a large number of partitions requires a high > memory space for the driver even there are no requests to collect data back > to the driver. > Here are steps to re-produce the issue. > 1. Start spark shell with a spark.driver.maxResultSize setting > {code:java} > bin/spark-shell --driver-memory=1g --conf spark.driver.maxResultSize=1m > {code} > 2. Execute the code > {code:java} > case class Toto( a: Int, b: Int) > val df = sc.parallelize( 1 to 1e6.toInt).map( i => Toto( i, i)).toDF > sqlContext.setConf( "spark.sql.shuffle.partitions", "200" ) > df.groupBy("a").count().saveAsParquetFile( "toto1" ) // OK > sqlContext.setConf( "spark.sql.shuffle.partitions", 1e3.toInt.toString ) > df.repartition(1e3.toInt).groupBy("a").count().repartition(1e3.toInt).saveAsParquetFile( > "toto2" ) // ERROR > {code} > The error message is > {code:java} > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Total size of serialized results of 393 tasks (1025.9 KB) is bigger than > spark.driver.maxResultSize (1024.0 KB) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20280) SharedInMemoryCache Weigher integer overflow
[ https://issues.apache.org/jira/browse/SPARK-20280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bogdan Raducanu updated SPARK-20280: Description: in FileStatusCache.scala: {code} .weigher(new Weigher[(ClientId, Path), Array[FileStatus]] { override def weigh(key: (ClientId, Path), value: Array[FileStatus]): Int = { (SizeEstimator.estimate(key) + SizeEstimator.estimate(value)).toInt }}) {code} Weigher.weigh returns Int but the size of an Array[FileStatus] could be bigger than Int.maxValue. Then, a negative value is returned, leading to this exception: {code} * [info] java.lang.IllegalStateException: Weights must be non-negative * [info] at com.google.common.base.Preconditions.checkState(Preconditions.java:149) * [info] at com.google.common.cache.LocalCache$Segment.setValue(LocalCache.java:2223) * [info] at com.google.common.cache.LocalCache$Segment.put(LocalCache.java:2944) * [info] at com.google.common.cache.LocalCache.put(LocalCache.java:4212) * [info] at com.google.common.cache.LocalCache$LocalManualCache.put(LocalCache.java:4804) * [info] at org.apache.spark.sql.execution.datasources.SharedInMemoryCache$$anon$3.putLeafFiles(FileStatusCache.scala:131) {code} was: {code} .weigher(new Weigher[(ClientId, Path), Array[FileStatus]] { override def weigh(key: (ClientId, Path), value: Array[FileStatus]): Int = { (SizeEstimator.estimate(key) + SizeEstimator.estimate(value)).toInt }}) {code} Weigher.weigh returns Int but the size of an Array[FileStatus] could be bigger than Int.maxValue. Then, a negative value is returned, leading to this exception: {code} * [info] java.lang.IllegalStateException: Weights must be non-negative * [info] at com.google.common.base.Preconditions.checkState(Preconditions.java:149) * [info] at com.google.common.cache.LocalCache$Segment.setValue(LocalCache.java:2223) * [info] at com.google.common.cache.LocalCache$Segment.put(LocalCache.java:2944) * [info] at com.google.common.cache.LocalCache.put(LocalCache.java:4212) * [info] at com.google.common.cache.LocalCache$LocalManualCache.put(LocalCache.java:4804) * [info] at org.apache.spark.sql.execution.datasources.SharedInMemoryCache$$anon$3.putLeafFiles(FileStatusCache.scala:131) {code} > SharedInMemoryCache Weigher integer overflow > > > Key: SPARK-20280 > URL: https://issues.apache.org/jira/browse/SPARK-20280 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.2.0 >Reporter: Bogdan Raducanu > > in FileStatusCache.scala: > {code} > .weigher(new Weigher[(ClientId, Path), Array[FileStatus]] { > override def weigh(key: (ClientId, Path), value: Array[FileStatus]): > Int = { > (SizeEstimator.estimate(key) + SizeEstimator.estimate(value)).toInt > }}) > {code} > Weigher.weigh returns Int but the size of an Array[FileStatus] could be > bigger than Int.maxValue. Then, a negative value is returned, leading to this > exception: > {code} > * [info] java.lang.IllegalStateException: Weights must be non-negative > * [info] at > com.google.common.base.Preconditions.checkState(Preconditions.java:149) > * [info] at > com.google.common.cache.LocalCache$Segment.setValue(LocalCache.java:2223) > * [info] at > com.google.common.cache.LocalCache$Segment.put(LocalCache.java:2944) > * [info] at com.google.common.cache.LocalCache.put(LocalCache.java:4212) > * [info] at > com.google.common.cache.LocalCache$LocalManualCache.put(LocalCache.java:4804) > * [info] at > org.apache.spark.sql.execution.datasources.SharedInMemoryCache$$anon$3.putLeafFiles(FileStatusCache.scala:131) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20243) DebugFilesystem.assertNoOpenStreams thread race
[ https://issues.apache.org/jira/browse/SPARK-20243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20243: Assignee: Apache Spark > DebugFilesystem.assertNoOpenStreams thread race > --- > > Key: SPARK-20243 > URL: https://issues.apache.org/jira/browse/SPARK-20243 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.2.0 >Reporter: Bogdan Raducanu >Assignee: Apache Spark > > Introduced by SPARK-19946. > DebugFilesystem.assertNoOpenStreams gets the size of the openStreams > ConcurrentHashMap and then later, if the size was > 0, accesses the first > element in openStreams.values. But, the ConcurrentHashMap might be cleared by > another thread between getting its size and accessing it, resulting in an > exception when trying to call .head on an empty collection. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org