[jira] [Commented] (SPARK-5134) Bump default Hadoop version to 2+
[ https://issues.apache.org/jira/browse/SPARK-5134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352268#comment-14352268 ] Patrick Wendell commented on SPARK-5134: Hey [~rdub] [~srowen], As part of the 1.3 release cycle I did some more forensics on the actual artifacts we publish. It turns out that because of the changes made for Scala 2.11 with the way our publishing works, we've actually been publishing poms that link against Hadoop 2.2 as of Spark 1.2. And in general, the published pom Hadoop version is decoupled now from the default one in the build itself, because of our use of the effective pom plugin. https://github.com/apache/spark/blob/master/dev/create-release/create-release.sh#L119 I'm actually a bit bummed that we (unintentionally) made this change in 1.2 because I do fear it likely screwed things up for some users. But on the plus side, since we no decouple the publishing from the default version in the pom, I don't see a big issue with updating the POM. So I withdraw my objection on the PR. Bump default Hadoop version to 2+ - Key: SPARK-5134 URL: https://issues.apache.org/jira/browse/SPARK-5134 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 1.2.0 Reporter: Ryan Williams Priority: Minor [~srowen] and I discussed bumping [the default hadoop version in the parent POM|https://github.com/apache/spark/blob/bb38ebb1abd26b57525d7d29703fd449e40cd6de/pom.xml#L122] from {{1.0.4}} to something more recent. There doesn't seem to be a good reason that it was set/kept at {{1.0.4}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6208) executor-memory does not work when using local cluster
[ https://issues.apache.org/jira/browse/SPARK-6208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352153#comment-14352153 ] Yin Huai commented on SPARK-6208: - [~pwendell] oh, i see. I was trying to increase the executor memory so I can cache a larger RDD. Since --conf spark.executor.memory, should we resolve it as not a problem? executor-memory does not work when using local cluster -- Key: SPARK-6208 URL: https://issues.apache.org/jira/browse/SPARK-6208 Project: Spark Issue Type: New Feature Components: Spark Submit Reporter: Yin Huai Priority: Minor Seems executor memory set with a local cluster is not correctly set (see https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L377). Also, totalExecutorCores seems has the same issue (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L379). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6216) Check Python version in worker before run PySpark job
Davies Liu created SPARK-6216: - Summary: Check Python version in worker before run PySpark job Key: SPARK-6216 URL: https://issues.apache.org/jira/browse/SPARK-6216 Project: Spark Issue Type: Improvement Reporter: Davies Liu PySpark can only run with the same major version both in driver and worker ( both of the are 2.6 or 2.7), it will cause random error if it have 2.7 in driver or 2.6 in worker (or vice). For example: {code} davies@localhost:~/work/spark$ PYSPARK_PYTHON=python2.6 PYSPARK_DRIVER_PYTHON=python2.7 bin/pyspark Using Python version 2.7.7 (default, Jun 2 2014 12:48:16) SparkContext available as sc, SQLContext available as sqlCtx. sc.textFile('LICENSE').map(lambda l: l.split()).count() org.apache.spark.api.python.PythonException: Traceback (most recent call last): File /Users/davies/work/spark/python/pyspark/worker.py, line 101, in main process() File /Users/davies/work/spark/python/pyspark/worker.py, line 96, in process serializer.dump_stream(func(split_index, iterator), outfile) File /Users/davies/work/spark/python/pyspark/rdd.py, line 2251, in pipeline_func return func(split, prev_func(split, iterator)) File /Users/davies/work/spark/python/pyspark/rdd.py, line 2251, in pipeline_func return func(split, prev_func(split, iterator)) File /Users/davies/work/spark/python/pyspark/rdd.py, line 2251, in pipeline_func return func(split, prev_func(split, iterator)) File /Users/davies/work/spark/python/pyspark/rdd.py, line 281, in func return f(iterator) File /Users/davies/work/spark/python/pyspark/rdd.py, line 931, in lambda return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File /Users/davies/work/spark/python/pyspark/rdd.py, line 931, in genexpr return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File stdin, line 1, in lambda TypeError: 'bool' object is not callable at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:136) at org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:177) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:95) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6216) Check Python version in worker before run PySpark job
[ https://issues.apache.org/jira/browse/SPARK-6216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reassigned SPARK-6216: - Assignee: Davies Liu Check Python version in worker before run PySpark job - Key: SPARK-6216 URL: https://issues.apache.org/jira/browse/SPARK-6216 Project: Spark Issue Type: Improvement Reporter: Davies Liu Assignee: Davies Liu PySpark can only run with the same major version both in driver and worker ( both of the are 2.6 or 2.7), it will cause random error if it have 2.7 in driver or 2.6 in worker (or vice). For example: {code} davies@localhost:~/work/spark$ PYSPARK_PYTHON=python2.6 PYSPARK_DRIVER_PYTHON=python2.7 bin/pyspark Using Python version 2.7.7 (default, Jun 2 2014 12:48:16) SparkContext available as sc, SQLContext available as sqlCtx. sc.textFile('LICENSE').map(lambda l: l.split()).count() org.apache.spark.api.python.PythonException: Traceback (most recent call last): File /Users/davies/work/spark/python/pyspark/worker.py, line 101, in main process() File /Users/davies/work/spark/python/pyspark/worker.py, line 96, in process serializer.dump_stream(func(split_index, iterator), outfile) File /Users/davies/work/spark/python/pyspark/rdd.py, line 2251, in pipeline_func return func(split, prev_func(split, iterator)) File /Users/davies/work/spark/python/pyspark/rdd.py, line 2251, in pipeline_func return func(split, prev_func(split, iterator)) File /Users/davies/work/spark/python/pyspark/rdd.py, line 2251, in pipeline_func return func(split, prev_func(split, iterator)) File /Users/davies/work/spark/python/pyspark/rdd.py, line 281, in func return f(iterator) File /Users/davies/work/spark/python/pyspark/rdd.py, line 931, in lambda return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File /Users/davies/work/spark/python/pyspark/rdd.py, line 931, in genexpr return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File stdin, line 1, in lambda TypeError: 'bool' object is not callable at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:136) at org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:177) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:95) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6215) Shorten apply and update funcs in GenerateProjection
[ https://issues.apache.org/jira/browse/SPARK-6215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14351987#comment-14351987 ] Apache Spark commented on SPARK-6215: - User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/4940 Shorten apply and update funcs in GenerateProjection Key: SPARK-6215 URL: https://issues.apache.org/jira/browse/SPARK-6215 Project: Spark Issue Type: Improvement Reporter: Liang-Chi Hsieh Priority: Minor Some codes in GenerateProjection look redundant and can be shortened. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6215) Shorten apply and update funcs in GenerateProjection
Liang-Chi Hsieh created SPARK-6215: -- Summary: Shorten apply and update funcs in GenerateProjection Key: SPARK-6215 URL: https://issues.apache.org/jira/browse/SPARK-6215 Project: Spark Issue Type: Improvement Reporter: Liang-Chi Hsieh Priority: Minor Some codes in GenerateProjection look redundant and can be shortened. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1239) Don't fetch all map output statuses at each reducer during shuffles
[ https://issues.apache.org/jira/browse/SPARK-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352306#comment-14352306 ] Kostas Sakellis commented on SPARK-1239: How many reduce side tasks do you have? Can you please attach your your logs that show the OOM errors/ Don't fetch all map output statuses at each reducer during shuffles --- Key: SPARK-1239 URL: https://issues.apache.org/jira/browse/SPARK-1239 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Affects Versions: 1.0.2, 1.1.0 Reporter: Patrick Wendell Instead we should modify the way we fetch map output statuses to take both a mapper and a reducer - or we should just piggyback the statuses on each task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5134) Bump default Hadoop version to 2+
[ https://issues.apache.org/jira/browse/SPARK-5134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352317#comment-14352317 ] Shivaram Venkataraman commented on SPARK-5134: -- Yeah so this did change in 1.2 and I think I mentioned it to Patrick when it affected a couple of other projects of mine. The main problem there was that even if you have an explicit Hadoop 1 dependency in your project, SBT picks up the highest version required while building an assembly jar for the project -- Thus with Spark linked against Hadoop 2.2, one would require an exclusion rule to use Hadoop 1. It might be good to add this to the docs or to some of the example Quick Start documentation we have Bump default Hadoop version to 2+ - Key: SPARK-5134 URL: https://issues.apache.org/jira/browse/SPARK-5134 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 1.2.0 Reporter: Ryan Williams Priority: Minor [~srowen] and I discussed bumping [the default hadoop version in the parent POM|https://github.com/apache/spark/blob/bb38ebb1abd26b57525d7d29703fd449e40cd6de/pom.xml#L122] from {{1.0.4}} to something more recent. There doesn't seem to be a good reason that it was set/kept at {{1.0.4}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1239) Don't fetch all map output statuses at each reducer during shuffles
[ https://issues.apache.org/jira/browse/SPARK-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352356#comment-14352356 ] Patrick Wendell commented on SPARK-1239: It would be helpful if any users who have observed this could comment on the JIRA and give workload information. This has been more on the back burner since we've heard few reports of it on the mailing list, etc... Don't fetch all map output statuses at each reducer during shuffles --- Key: SPARK-1239 URL: https://issues.apache.org/jira/browse/SPARK-1239 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Affects Versions: 1.0.2, 1.1.0 Reporter: Patrick Wendell Instead we should modify the way we fetch map output statuses to take both a mapper and a reducer - or we should just piggyback the statuses on each task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5682) Reuse hadoop encrypted shuffle algorithm to enable spark encrypted shuffle
[ https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352396#comment-14352396 ] liyunzhang_intel commented on SPARK-5682: - Hi [~srowen]: Encrypted shuffle can make the process of shuffle more safer. I think it is necessary in spark. Previous design is reusing hadoop encrypted shuffle algorithm to enable spark encrypted shuffle. The design has a big problem that it imports many crypto classes like CryptoInputStream and CryptoOutputStream which is marked private in hadoop. Now my teammates and i decided to write the crypto classes in spark so no dependance to hadoop 2.6. Not directly copying hadoop code to spark. we only reference the crypto algorithm like JCE/AES-NI which is used in hadoop to spark. Maybe i need rename the jira name from Reuse hadoop encrypted shuffle algorithm to enable spark encrypted shuffle to Add encrypted shuffle in spark. Any advices are welcome. Reuse hadoop encrypted shuffle algorithm to enable spark encrypted shuffle -- Key: SPARK-5682 URL: https://issues.apache.org/jira/browse/SPARK-5682 Project: Spark Issue Type: New Feature Components: Shuffle Reporter: liyunzhang_intel Attachments: Design Document of Encrypted Spark Shuffle_20150209.docx Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle data safer. This feature is necessary in spark. We reuse hadoop encrypted shuffle feature to spark and because ugi credential info is necessary in encrypted shuffle, we first enable encrypted shuffle on spark-on-yarn framework. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6217) insertInto doesn't work in PySpark
[ https://issues.apache.org/jira/browse/SPARK-6217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Charles Cloud updated SPARK-6217: - Summary: insertInto doesn't work in PySpark (was: insertInto doesn't work) insertInto doesn't work in PySpark -- Key: SPARK-6217 URL: https://issues.apache.org/jira/browse/SPARK-6217 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.3.0 Environment: Mac OS X Yosemite 10.10.2 Python 2.7.9 Spark 1.3.0 Reporter: Charles Cloud The following code, running in an IPython shell throws an error: {code:none} In [1]: from pyspark import SparkContext, HiveContext In [2]: sc = SparkContext('local[*]', 'test') Spark assembly has been built with Hive, including Datanucleus jars on classpath In [3]: sql = HiveContext(sc) In [4]: import pandas as pd In [5]: df = pd.DataFrame({'a': [1.0, 2.0, 3.0], 'b': [1, 2, 3], 'c': list('abc')}) In [6]: df2 = pd.DataFrame({'a': [2.0, 3.0, 4.0], 'b': [4, 5, 6], 'c': list('def')}) In [7]: sdf = sql.createDataFrame(df) In [8]: sdf2 = sql.createDataFrame(df2) In [9]: sql.registerDataFrameAsTable(sdf, 'sdf') In [10]: sql.registerDataFrameAsTable(sdf2, 'sdf2') In [11]: sql.cacheTable('sdf') In [12]: sql.cacheTable('sdf2') In [13]: sdf2.insertInto('sdf') # throws an error {code} Here's the Java traceback: {code:none} Py4JJavaError: An error occurred while calling o270.insertInto. : java.lang.AssertionError: assertion failed: No plan for InsertIntoTable (LogicalRDD [a#0,b#1L,c#2], MapPartitionsRDD[13] at mapPartitions at SQLContext.scala:1167), Map(), false InMemoryRelation [a#6,b#7L,c#8], true, 1, StorageLevel(true, true, false, true, 1), (PhysicalRDD [a#6,b#7L,c#8], MapPartitionsRDD[41] at mapPartitions at SQLContext.scala:1167), Some(sdf2) at scala.Predef$.assert(Predef.scala:179) at org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:1085) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:1083) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:1089) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:1089) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:1092) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:1092) at org.apache.spark.sql.DataFrame.insertInto(DataFrame.scala:1134) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:745) {code} I'd be ecstatic if this was my own fault, and I'm somehow using it incorrectly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5134) Bump default Hadoop version to 2+
[ https://issues.apache.org/jira/browse/SPARK-5134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352268#comment-14352268 ] Patrick Wendell edited comment on SPARK-5134 at 3/8/15 11:27 PM: - Hey [~rdub] [~srowen], As part of the 1.3 release cycle I did some more forensics on the actual artifacts we publish. It turns out that because of the changes made for Scala 2.11 with the way our publishing works, we've actually been publishing poms that link against Hadoop 2.2 as of Spark 1.2. And in general, the published pom Hadoop version is decoupled now from the default one in the build itself, because of our use of the effective pom plugin. https://github.com/apache/spark/blob/master/dev/create-release/create-release.sh#L119 I'm actually a bit bummed that we (unintentionally) made this change in 1.2 because I do fear it likely screwed things up for some users. But on the plus side, since we now decouple the publishing from the default version in the pom, I don't see a big issue with updating the POM. So I withdraw my objection on the PR. was (Author: pwendell): Hey [~rdub] [~srowen], As part of the 1.3 release cycle I did some more forensics on the actual artifacts we publish. It turns out that because of the changes made for Scala 2.11 with the way our publishing works, we've actually been publishing poms that link against Hadoop 2.2 as of Spark 1.2. And in general, the published pom Hadoop version is decoupled now from the default one in the build itself, because of our use of the effective pom plugin. https://github.com/apache/spark/blob/master/dev/create-release/create-release.sh#L119 I'm actually a bit bummed that we (unintentionally) made this change in 1.2 because I do fear it likely screwed things up for some users. But on the plus side, since we no decouple the publishing from the default version in the pom, I don't see a big issue with updating the POM. So I withdraw my objection on the PR. Bump default Hadoop version to 2+ - Key: SPARK-5134 URL: https://issues.apache.org/jira/browse/SPARK-5134 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 1.2.0 Reporter: Ryan Williams Priority: Minor [~srowen] and I discussed bumping [the default hadoop version in the parent POM|https://github.com/apache/spark/blob/bb38ebb1abd26b57525d7d29703fd449e40cd6de/pom.xml#L122] from {{1.0.4}} to something more recent. There doesn't seem to be a good reason that it was set/kept at {{1.0.4}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5134) Bump default Hadoop version to 2+
[ https://issues.apache.org/jira/browse/SPARK-5134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352341#comment-14352341 ] Patrick Wendell commented on SPARK-5134: [~shivaram] did it end up working alright if you just excluded Spark's Hadoop dependency? If so we can just document this. Bump default Hadoop version to 2+ - Key: SPARK-5134 URL: https://issues.apache.org/jira/browse/SPARK-5134 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 1.2.0 Reporter: Ryan Williams Priority: Minor [~srowen] and I discussed bumping [the default hadoop version in the parent POM|https://github.com/apache/spark/blob/bb38ebb1abd26b57525d7d29703fd449e40cd6de/pom.xml#L122] from {{1.0.4}} to something more recent. There doesn't seem to be a good reason that it was set/kept at {{1.0.4}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1239) Don't fetch all map output statuses at each reducer during shuffles
[ https://issues.apache.org/jira/browse/SPARK-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352377#comment-14352377 ] Mridul Muralidharan commented on SPARK-1239: Hitting akka framesize for map outputtracker is very easy since we fetch whole output (m * r) - while I cant get into specifics of our jobs or share logs; but it is easy to see this hitting 1G for 100k mappers and 50k reducers. If this is not being looked into currently, I can add it to my list of things to fix - but if there is already work being done, I dont want to duplicate it. Even something trivial like what was done in task result would suffice (if we dont want the additional overhead of per per reduce map output generation at master). Don't fetch all map output statuses at each reducer during shuffles --- Key: SPARK-1239 URL: https://issues.apache.org/jira/browse/SPARK-1239 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Affects Versions: 1.0.2, 1.1.0 Reporter: Patrick Wendell Instead we should modify the way we fetch map output statuses to take both a mapper and a reducer - or we should just piggyback the statuses on each task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6154) Build error with Scala 2.11 for v1.3.0-rc2
[ https://issues.apache.org/jira/browse/SPARK-6154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352417#comment-14352417 ] Jianshi Huang commented on SPARK-6154: -- I see. Here's my build flag: -Pyarn -Phive -Phive-0.13.1 -Phive-thriftserver -Phadoop-2.4 -Djava.version=1.7 -DskipTests BTW, when will Kafka and JDBC be supported in 2.11 build? Jianshi Build error with Scala 2.11 for v1.3.0-rc2 -- Key: SPARK-6154 URL: https://issues.apache.org/jira/browse/SPARK-6154 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.3.0 Reporter: Jianshi Huang Build v1.3.0-rc2 with Scala 2.11 using instructions in the documentation failed when -Phive-thriftserver is enabled. [info] Compiling 9 Scala sources to /home/hjs/workspace/spark/sql/hive-thriftserver/target/scala-2.11/classes... [error] /home/hjs/workspace/spark/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala:2 5: object ConsoleReader is not a member of package jline [error] import jline.{ConsoleReader, History} [error]^ [warn] Class jline.Completor not found - continuing with a stub. [warn] Class jline.ConsoleReader not found - continuing with a stub. [error] /home/hjs/workspace/spark/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala:1 65: not found: type ConsoleReader [error] val reader = new ConsoleReader() Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3066) Support recommendAll in matrix factorization model
[ https://issues.apache.org/jira/browse/SPARK-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352416#comment-14352416 ] Joseph K. Bradley commented on SPARK-3066: -- It's similar, I believe, for ALS. The cosine similarity metric you get with the dot product for ALS is a distance metric, right? So finding the top K products to recommend a given user is essentially the same as finding the K product feature vectors which are closest to the user's feature vector. This optimization could be used both for recommending for a single user and for recommendAll. I'm not sure about how effective these approximate nearest neighbor methods are. My understanding is that they work reasonable well as long as the feature space is fairly low-dimensional, which should often be the case for ALS. My hope is that these approximate nearest neighbor data structures can reduce communication. The ones I've seen are based on feature space partitioning, which could potentially allow you to figure out a subset of product partitions to check for each user. Using level 3 BLAS might be better; I'm really not sure. It won't reduce communication, though. These 2 types of optimizations might be orthogonal, anyways. Support recommendAll in matrix factorization model -- Key: SPARK-3066 URL: https://issues.apache.org/jira/browse/SPARK-3066 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Assignee: Debasish Das ALS returns a matrix factorization model, which we can use to predict ratings for individual queries as well as small batches. In practice, users may want to compute top-k recommendations offline for all users. It is very expensive but a common problem. We can do some optimization like 1) collect one side (either user or product) and broadcast it as a matrix 2) use level-3 BLAS to compute inner products 3) use Utils.takeOrdered to find top-k -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6192) Enhance MLlib's Python API (GSoC 2015)
[ https://issues.apache.org/jira/browse/SPARK-6192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352459#comment-14352459 ] Yan Ni commented on SPARK-6192: --- hello, I am a senior year undergraduate student and had experience in python ML. Now I am interested in distributed platforms like spark but don't have any experience. I want to take this project as my starting point in spark. Any advice? Enhance MLlib's Python API (GSoC 2015) -- Key: SPARK-6192 URL: https://issues.apache.org/jira/browse/SPARK-6192 Project: Spark Issue Type: Umbrella Components: ML, MLlib, PySpark Reporter: Xiangrui Meng Assignee: Manoj Kumar Labels: gsoc, gsoc2015, mentor This is an umbrella JIRA for [~MechCoder]'s GSoC 2015 project. The main theme is to enhance MLlib's Python API, to make it on par with the Scala/Java API. The main tasks are: 1. For all models in MLlib, provide save/load method. This also includes save/load in Scala. 2. Python API for evaluation metrics. 3. Python API for streaming ML algorithms. 4. Python API for distributed linear algebra. 5. Simplify MLLibPythonAPI using DataFrames. Currently, we use customized serialization, making MLLibPythonAPI hard to maintain. It would be nice to use the DataFrames for serialization. I'll link the JIRAs for each of the tasks. Note that this doesn't mean all these JIRAs are pre-assigned to [~MechCoder]. The TODO list will be dynamic based on the backlog. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6192) Enhance MLlib's Python API (GSoC 2015)
[ https://issues.apache.org/jira/browse/SPARK-6192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352459#comment-14352459 ] Yan Ni edited comment on SPARK-6192 at 3/9/15 2:45 AM: --- hello, I am a senior year undergraduate student and had experience in python ML. Now I am interested in distributed platforms like spark but don't have any experience. I want to take this project as my starting point in spark. Any advice? Thanks! was (Author: leckie-chn): hello, I am a senior year undergraduate student and had experience in python ML. Now I am interested in distributed platforms like spark but don't have any experience. I want to take this project as my starting point in spark. Any advice? Enhance MLlib's Python API (GSoC 2015) -- Key: SPARK-6192 URL: https://issues.apache.org/jira/browse/SPARK-6192 Project: Spark Issue Type: Umbrella Components: ML, MLlib, PySpark Reporter: Xiangrui Meng Assignee: Manoj Kumar Labels: gsoc, gsoc2015, mentor This is an umbrella JIRA for [~MechCoder]'s GSoC 2015 project. The main theme is to enhance MLlib's Python API, to make it on par with the Scala/Java API. The main tasks are: 1. For all models in MLlib, provide save/load method. This also includes save/load in Scala. 2. Python API for evaluation metrics. 3. Python API for streaming ML algorithms. 4. Python API for distributed linear algebra. 5. Simplify MLLibPythonAPI using DataFrames. Currently, we use customized serialization, making MLLibPythonAPI hard to maintain. It would be nice to use the DataFrames for serialization. I'll link the JIRAs for each of the tasks. Note that this doesn't mean all these JIRAs are pre-assigned to [~MechCoder]. The TODO list will be dynamic based on the backlog. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6192) Enhance MLlib's Python API (GSoC 2015)
[ https://issues.apache.org/jira/browse/SPARK-6192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352459#comment-14352459 ] Yan Ni edited comment on SPARK-6192 at 3/9/15 2:45 AM: --- hello, I am a senior year undergraduate student and had experience in python ML. Now I am interested in distributed platforms like spark but don't have any experience. I would like to take this project as my starting point in spark. Any advice? Thanks! was (Author: leckie-chn): hello, I am a senior year undergraduate student and had experience in python ML. Now I am interested in distributed platforms like spark but don't have any experience. I want to take this project as my starting point in spark. Any advice? Thanks! Enhance MLlib's Python API (GSoC 2015) -- Key: SPARK-6192 URL: https://issues.apache.org/jira/browse/SPARK-6192 Project: Spark Issue Type: Umbrella Components: ML, MLlib, PySpark Reporter: Xiangrui Meng Assignee: Manoj Kumar Labels: gsoc, gsoc2015, mentor This is an umbrella JIRA for [~MechCoder]'s GSoC 2015 project. The main theme is to enhance MLlib's Python API, to make it on par with the Scala/Java API. The main tasks are: 1. For all models in MLlib, provide save/load method. This also includes save/load in Scala. 2. Python API for evaluation metrics. 3. Python API for streaming ML algorithms. 4. Python API for distributed linear algebra. 5. Simplify MLLibPythonAPI using DataFrames. Currently, we use customized serialization, making MLLibPythonAPI hard to maintain. It would be nice to use the DataFrames for serialization. I'll link the JIRAs for each of the tasks. Note that this doesn't mean all these JIRAs are pre-assigned to [~MechCoder]. The TODO list will be dynamic based on the backlog. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6192) Enhance MLlib's Python API (GSoC 2015)
[ https://issues.apache.org/jira/browse/SPARK-6192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352459#comment-14352459 ] Yan Ni edited comment on SPARK-6192 at 3/9/15 3:02 AM: --- hello, I am a senior year undergraduate student and had experience in python ML. Now I am interested in distributed computation platforms like spark but don't have any experience. I would like to take this project as my starting point in spark. Any advice? Thanks! was (Author: leckie-chn): hello, I am a senior year undergraduate student and had experience in python ML. Now I am interested in distributed platforms like spark but don't have any experience. I would like to take this project as my starting point in spark. Any advice? Thanks! Enhance MLlib's Python API (GSoC 2015) -- Key: SPARK-6192 URL: https://issues.apache.org/jira/browse/SPARK-6192 Project: Spark Issue Type: Umbrella Components: ML, MLlib, PySpark Reporter: Xiangrui Meng Assignee: Manoj Kumar Labels: gsoc, gsoc2015, mentor This is an umbrella JIRA for [~MechCoder]'s GSoC 2015 project. The main theme is to enhance MLlib's Python API, to make it on par with the Scala/Java API. The main tasks are: 1. For all models in MLlib, provide save/load method. This also includes save/load in Scala. 2. Python API for evaluation metrics. 3. Python API for streaming ML algorithms. 4. Python API for distributed linear algebra. 5. Simplify MLLibPythonAPI using DataFrames. Currently, we use customized serialization, making MLLibPythonAPI hard to maintain. It would be nice to use the DataFrames for serialization. I'll link the JIRAs for each of the tasks. Note that this doesn't mean all these JIRAs are pre-assigned to [~MechCoder]. The TODO list will be dynamic based on the backlog. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6206) spark-ec2 script reporting SSL error?
[ https://issues.apache.org/jira/browse/SPARK-6206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352481#comment-14352481 ] Nicholas Chammas commented on SPARK-6206: - OK, let us know what you find, [~Joe6521]. In general, please try to validate your issue on the user list or on Stack Overflow before reporting it here, unless you are really sure you've found a problem with Spark (as opposed to your environment). spark-ec2 script reporting SSL error? - Key: SPARK-6206 URL: https://issues.apache.org/jira/browse/SPARK-6206 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.2.0 Reporter: Joe O I have been using the spark-ec2 script for several months with no problems. Recently, when executing a script to launch a cluster I got the following error: {code} [Errno 185090050] _ssl.c:344: error:0B084002:x509 certificate routines:X509_load_cert_crl_file:system lib {code} Nothing launches, the script exits. I am not sure if something on machine changed, this is a problem with EC2's certs, or a problem with Python. It occurs 100% of the time, and has been occurring over at least the last two days. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6220) Allow extended EC2 options to be passed through spark-ec2
[ https://issues.apache.org/jira/browse/SPARK-6220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-6220: Description: There are many EC2 options exposed by the boto library that spark-ec2 uses. Over time, many of these EC2 options have been bubbled up here and there to become spark-ec2 options. Examples: * spot prices * placement groups * VPC, subnet, and security group assignments It's likely that more and more EC2 options will trickle up like this to become spark-ec2 options. While major options are well suited to this type of promotion, we should probably allow users to pass through EC2 options they want to use through spark-ec2 in some generic way. Let's add two options: * {{--ec2-instance-option}} - [{{boto::run_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.run_instances] * {{--ec2-spot-instance-option}} - [{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances] Each option can be specified multiple times and is simply passed directly to the underlying boto call. For example: {code} spark-ec2 --ec2-instance-option {code} was: There are many EC2 options exposed by the boto library that spark-ec2 uses. Over time, many of these EC2 options have been bubbled up here and there to become spark-ec2 options. Examples: * spot prices * placement groups * VPC, subnet, and security group assignments It's likely that more and more EC2 options will trickle up like this to become spark-ec2 options. While major options are well suited to this type of promotion, we should probably allow users to pass through EC2 options they want to use through spark-ec2 in some generic way. Let's add two options: * {{--ec2-instance-option}} - [{{boto::run_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.run_instances] * {{--ec2-spot-instance-option}} - [{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances] Each option can be specified multiple times and is simply passed directly to the underlying boto call. Allow extended EC2 options to be passed through spark-ec2 - Key: SPARK-6220 URL: https://issues.apache.org/jira/browse/SPARK-6220 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Nicholas Chammas Priority: Minor There are many EC2 options exposed by the boto library that spark-ec2 uses. Over time, many of these EC2 options have been bubbled up here and there to become spark-ec2 options. Examples: * spot prices * placement groups * VPC, subnet, and security group assignments It's likely that more and more EC2 options will trickle up like this to become spark-ec2 options. While major options are well suited to this type of promotion, we should probably allow users to pass through EC2 options they want to use through spark-ec2 in some generic way. Let's add two options: * {{--ec2-instance-option}} - [{{boto::run_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.run_instances] * {{--ec2-spot-instance-option}} - [{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances] Each option can be specified multiple times and is simply passed directly to the underlying boto call. For example: {code} spark-ec2 --ec2-instance-option {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6220) Allow extended EC2 options to be passed through spark-ec2
[ https://issues.apache.org/jira/browse/SPARK-6220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352489#comment-14352489 ] Nicholas Chammas commented on SPARK-6220: - cc [~joshrosen] and [~shivaram] for feedback. The immediate motivation for this is the work I'm doing on automating spark-perf runs. As part of an automated spark-perf run, I'd like to: * set {{instance_initiated_shutdown_behavior=terminate}} for the non-spot instances launched by spark-ec2 (i.e. the master), so that the cluster can self-terminate without needing outside input * set {{instance_profile_arn}} for the master so that spark-perf results can be uploaded to S3 without having to handle AWS user credentials, via use of IAM profiles Since my use case is specialized, I didn't think it was worth adding top-level options for these EC2 features. So I generalized the idea to support any EC2 option supported by boto. Allow extended EC2 options to be passed through spark-ec2 - Key: SPARK-6220 URL: https://issues.apache.org/jira/browse/SPARK-6220 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Nicholas Chammas Priority: Minor There are many EC2 options exposed by the boto library that spark-ec2 uses. Over time, many of these EC2 options have been bubbled up here and there to become spark-ec2 options. Examples: * spot prices * placement groups * VPC, subnet, and security group assignments It's likely that more and more EC2 options will trickle up like this to become spark-ec2 options. While major options are well suited to this type of promotion, we should probably allow users to pass through EC2 options they want to use through spark-ec2 in some generic way. Let's add two options: * {{--ec2-instance-option}} - [{{boto::run_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.run_instances] * {{--ec2-spot-instance-option}} - [{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances] Each option can be specified multiple times and is simply passed directly to the underlying boto call. For example: {code} spark-ec2 --ec2-instance-option {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6220) Allow extended EC2 options to be passed through spark-ec2
[ https://issues.apache.org/jira/browse/SPARK-6220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-6220: Description: There are many EC2 options exposed by the boto library that spark-ec2 uses. Over time, many of these EC2 options have been bubbled up here and there to become spark-ec2 options. Examples: * spot prices * placement groups * VPC, subnet, and security group assignments It's likely that more and more EC2 options will trickle up like this to become spark-ec2 options. While major options are well suited to this type of promotion, we should probably allow users to pass through EC2 options they want to use through spark-ec2 in some generic way. Let's add two options: * {{--ec2-instance-option}} - [{{boto::run_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.run_instances] * {{--ec2-spot-instance-option}} - [{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances] Each option can be specified multiple times and is simply passed directly to the underlying boto call. For example: {code} spark-ec2 \ ... --ec2-instance-option instance_initiated_shutdown_behavior=terminate \ --ec2-instance-option ebs_optimized=True {code} I'm not sure about the exact syntax of the extended options, but something like this will do the trick. I followed the example of {{ssh}}, which supports multiple extended options similarly. {code} ssh -o LogLevel=ERROR -o UserKnowHostsFile=/dev/null ... {code} was: There are many EC2 options exposed by the boto library that spark-ec2 uses. Over time, many of these EC2 options have been bubbled up here and there to become spark-ec2 options. Examples: * spot prices * placement groups * VPC, subnet, and security group assignments It's likely that more and more EC2 options will trickle up like this to become spark-ec2 options. While major options are well suited to this type of promotion, we should probably allow users to pass through EC2 options they want to use through spark-ec2 in some generic way. Let's add two options: * {{--ec2-instance-option}} - [{{boto::run_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.run_instances] * {{--ec2-spot-instance-option}} - [{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances] Each option can be specified multiple times and is simply passed directly to the underlying boto call. For example: {code} spark-ec2 --ec2-instance-option {code} Allow extended EC2 options to be passed through spark-ec2 - Key: SPARK-6220 URL: https://issues.apache.org/jira/browse/SPARK-6220 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Nicholas Chammas Priority: Minor There are many EC2 options exposed by the boto library that spark-ec2 uses. Over time, many of these EC2 options have been bubbled up here and there to become spark-ec2 options. Examples: * spot prices * placement groups * VPC, subnet, and security group assignments It's likely that more and more EC2 options will trickle up like this to become spark-ec2 options. While major options are well suited to this type of promotion, we should probably allow users to pass through EC2 options they want to use through spark-ec2 in some generic way. Let's add two options: * {{--ec2-instance-option}} - [{{boto::run_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.run_instances] * {{--ec2-spot-instance-option}} - [{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances] Each option can be specified multiple times and is simply passed directly to the underlying boto call. For example: {code} spark-ec2 \ ... --ec2-instance-option instance_initiated_shutdown_behavior=terminate \ --ec2-instance-option ebs_optimized=True {code} I'm not sure about the exact syntax of the extended options, but something like this will do the trick. I followed the example of {{ssh}}, which supports multiple extended options similarly. {code} ssh -o LogLevel=ERROR -o UserKnowHostsFile=/dev/null ... {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6192) Enhance MLlib's Python API (GSoC 2015)
[ https://issues.apache.org/jira/browse/SPARK-6192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352459#comment-14352459 ] Yan Ni edited comment on SPARK-6192 at 3/9/15 3:30 AM: --- hello, I am a senior year undergraduate student and had experience in python ML. Now I am interested in distributed computation platforms like spark but don't have any experience. I would like to take this gsoc project as my starting point in spark. Any advice? Thanks! was (Author: leckie-chn): hello, I am a senior year undergraduate student and had experience in python ML. Now I am interested in distributed computation platforms like spark but don't have any experience. I would like to take this project as my starting point in spark. Any advice? Thanks! Enhance MLlib's Python API (GSoC 2015) -- Key: SPARK-6192 URL: https://issues.apache.org/jira/browse/SPARK-6192 Project: Spark Issue Type: Umbrella Components: ML, MLlib, PySpark Reporter: Xiangrui Meng Assignee: Manoj Kumar Labels: gsoc, gsoc2015, mentor This is an umbrella JIRA for [~MechCoder]'s GSoC 2015 project. The main theme is to enhance MLlib's Python API, to make it on par with the Scala/Java API. The main tasks are: 1. For all models in MLlib, provide save/load method. This also includes save/load in Scala. 2. Python API for evaluation metrics. 3. Python API for streaming ML algorithms. 4. Python API for distributed linear algebra. 5. Simplify MLLibPythonAPI using DataFrames. Currently, we use customized serialization, making MLLibPythonAPI hard to maintain. It would be nice to use the DataFrames for serialization. I'll link the JIRAs for each of the tasks. Note that this doesn't mean all these JIRAs are pre-assigned to [~MechCoder]. The TODO list will be dynamic based on the backlog. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6183) Skip bad workers when re-launching executors
[ https://issues.apache.org/jira/browse/SPARK-6183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352492#comment-14352492 ] Peng Zhen commented on SPARK-6183: -- [~srowen] they are related, but not the same. SPARK-4609 works on re-scheduling tasks, and SPARK-6183 works on re-launching executors. @davies Skip bad workers when re-launching executors Key: SPARK-6183 URL: https://issues.apache.org/jira/browse/SPARK-6183 Project: Spark Issue Type: Improvement Components: Deploy Reporter: Peng Zhen In standalone cluster, when an executor launch fails, the master should avoid re-launching it on the same worker. According to the current scheduling logic, the failed executor will be highly possible re-launched on the same worker, and finally cause the application removed from the master. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6183) Skip bad workers when re-launching executors
[ https://issues.apache.org/jira/browse/SPARK-6183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352492#comment-14352492 ] Peng Zhen edited comment on SPARK-6183 at 3/9/15 3:36 AM: -- [~srowen] they are related, but not the same. SPARK-4609 works on re-scheduling tasks, and SPARK-6183 works on re-launching executors. [~davies] was (Author: zhpengg): [~srowen] they are related, but not the same. SPARK-4609 works on re-scheduling tasks, and SPARK-6183 works on re-launching executors. @davies Skip bad workers when re-launching executors Key: SPARK-6183 URL: https://issues.apache.org/jira/browse/SPARK-6183 Project: Spark Issue Type: Improvement Components: Deploy Reporter: Peng Zhen In standalone cluster, when an executor launch fails, the master should avoid re-launching it on the same worker. According to the current scheduling logic, the failed executor will be highly possible re-launched on the same worker, and finally cause the application removed from the master. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6220) Allow extended EC2 options to be passed through spark-ec2
[ https://issues.apache.org/jira/browse/SPARK-6220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-6220: Description: There are many EC2 options exposed by the boto library that spark-ec2 uses. Over time, many of these EC2 options have been bubbled up here and there to become spark-ec2 options. Examples: * spot prices * placement groups * VPC, subnet, and security group assignments It's likely that more and more EC2 options will trickle up like this to become spark-ec2 options. While major options are well suited to this type of promotion, we should probably allow users to pass through EC2 options they want to use through spark-ec2 in some generic way. Let's add two options: * {{--ec2-instance-option}} - [{{boto::run_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.run_instances] * {{--ec2-spot-instance-option}} - [{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances] Each option can be specified multiple times and is simply passed directly to the underlying boto call. For example: {code} spark-ec2 \ ... --ec2-instance-option instance_initiated_shutdown_behavior=terminate \ --ec2-instance-option ebs_optimized=True {code} I'm not sure about the exact syntax of the extended options, but something like this will do the trick as long as it can be made to pass the options correctly to boto in most cases. I followed the example of {{ssh}}, which supports multiple extended options similarly. {code} ssh -o LogLevel=ERROR -o UserKnowHostsFile=/dev/null ... {code} was: There are many EC2 options exposed by the boto library that spark-ec2 uses. Over time, many of these EC2 options have been bubbled up here and there to become spark-ec2 options. Examples: * spot prices * placement groups * VPC, subnet, and security group assignments It's likely that more and more EC2 options will trickle up like this to become spark-ec2 options. While major options are well suited to this type of promotion, we should probably allow users to pass through EC2 options they want to use through spark-ec2 in some generic way. Let's add two options: * {{--ec2-instance-option}} - [{{boto::run_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.run_instances] * {{--ec2-spot-instance-option}} - [{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances] Each option can be specified multiple times and is simply passed directly to the underlying boto call. For example: {code} spark-ec2 \ ... --ec2-instance-option instance_initiated_shutdown_behavior=terminate \ --ec2-instance-option ebs_optimized=True {code} I'm not sure about the exact syntax of the extended options, but something like this will do the trick. I followed the example of {{ssh}}, which supports multiple extended options similarly. {code} ssh -o LogLevel=ERROR -o UserKnowHostsFile=/dev/null ... {code} Allow extended EC2 options to be passed through spark-ec2 - Key: SPARK-6220 URL: https://issues.apache.org/jira/browse/SPARK-6220 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Nicholas Chammas Priority: Minor There are many EC2 options exposed by the boto library that spark-ec2 uses. Over time, many of these EC2 options have been bubbled up here and there to become spark-ec2 options. Examples: * spot prices * placement groups * VPC, subnet, and security group assignments It's likely that more and more EC2 options will trickle up like this to become spark-ec2 options. While major options are well suited to this type of promotion, we should probably allow users to pass through EC2 options they want to use through spark-ec2 in some generic way. Let's add two options: * {{--ec2-instance-option}} - [{{boto::run_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.run_instances] * {{--ec2-spot-instance-option}} - [{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances] Each option can be specified multiple times and is simply passed directly to the underlying boto call. For example: {code} spark-ec2 \ ... --ec2-instance-option instance_initiated_shutdown_behavior=terminate \ --ec2-instance-option ebs_optimized=True {code} I'm not sure about the exact syntax of the extended options, but something like this will do the trick as long as it can be made to pass the options correctly to boto in most cases. I followed the
[jira] [Commented] (SPARK-6220) Allow extended EC2 options to be passed through spark-ec2
[ https://issues.apache.org/jira/browse/SPARK-6220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352514#comment-14352514 ] Shivaram Venkataraman commented on SPARK-6220: -- Seems like a good idea and the syntax sounds good to me. Just curious: Are these the only two boto calls we use ? Allow extended EC2 options to be passed through spark-ec2 - Key: SPARK-6220 URL: https://issues.apache.org/jira/browse/SPARK-6220 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Nicholas Chammas Priority: Minor There are many EC2 options exposed by the boto library that spark-ec2 uses. Over time, many of these EC2 options have been bubbled up here and there to become spark-ec2 options. Examples: * spot prices * placement groups * VPC, subnet, and security group assignments It's likely that more and more EC2 options will trickle up like this to become spark-ec2 options. While major options are well suited to this type of promotion, we should probably allow users to pass through EC2 options they want to use through spark-ec2 in some generic way. Let's add two options: * {{--ec2-instance-option}} - [{{boto::run_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.run_instances] * {{--ec2-spot-instance-option}} - [{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances] Each option can be specified multiple times and is simply passed directly to the underlying boto call. For example: {code} spark-ec2 \ ... --ec2-instance-option instance_initiated_shutdown_behavior=terminate \ --ec2-instance-option ebs_optimized=True {code} I'm not sure about the exact syntax of the extended options, but something like this will do the trick as long as it can be made to pass the options correctly to boto in most cases. I followed the example of {{ssh}}, which supports multiple extended options similarly. {code} ssh -o LogLevel=ERROR -o UserKnowHostsFile=/dev/null ... {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6220) Allow extended EC2 options to be passed through spark-ec2
[ https://issues.apache.org/jira/browse/SPARK-6220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-6220: Description: There are many EC2 options exposed by the boto library that spark-ec2 uses. Over time, many of these EC2 options have been bubbled up here and there to become spark-ec2 options. Examples: * spot prices * placement groups * VPC, subnet, and security group assignments It's likely that more and more EC2 options will trickle up like this to become spark-ec2 options. While major options are well suited to this type of promotion, we should probably allow users to pass through EC2 options they want to use through spark-ec2 in some generic way. Let's add two options: * {{--ec2-instance-option}} - [{{boto::run}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.image.Image.run] * {{--ec2-spot-instance-option}} - [{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances] Each option can be specified multiple times and is simply passed directly to the underlying boto call. For example: {code} spark-ec2 \ ... --ec2-instance-option instance_initiated_shutdown_behavior=terminate \ --ec2-instance-option ebs_optimized=True {code} I'm not sure about the exact syntax of the extended options, but something like this will do the trick as long as it can be made to pass the options correctly to boto in most cases. I followed the example of {{ssh}}, which supports multiple extended options similarly. {code} ssh -o LogLevel=ERROR -o UserKnowHostsFile=/dev/null ... {code} was: There are many EC2 options exposed by the boto library that spark-ec2 uses. Over time, many of these EC2 options have been bubbled up here and there to become spark-ec2 options. Examples: * spot prices * placement groups * VPC, subnet, and security group assignments It's likely that more and more EC2 options will trickle up like this to become spark-ec2 options. While major options are well suited to this type of promotion, we should probably allow users to pass through EC2 options they want to use through spark-ec2 in some generic way. Let's add two options: * {{--ec2-instance-option}} - [{{boto::run_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.run_instances] * {{--ec2-spot-instance-option}} - [{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances] Each option can be specified multiple times and is simply passed directly to the underlying boto call. For example: {code} spark-ec2 \ ... --ec2-instance-option instance_initiated_shutdown_behavior=terminate \ --ec2-instance-option ebs_optimized=True {code} I'm not sure about the exact syntax of the extended options, but something like this will do the trick as long as it can be made to pass the options correctly to boto in most cases. I followed the example of {{ssh}}, which supports multiple extended options similarly. {code} ssh -o LogLevel=ERROR -o UserKnowHostsFile=/dev/null ... {code} Allow extended EC2 options to be passed through spark-ec2 - Key: SPARK-6220 URL: https://issues.apache.org/jira/browse/SPARK-6220 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Nicholas Chammas Priority: Minor There are many EC2 options exposed by the boto library that spark-ec2 uses. Over time, many of these EC2 options have been bubbled up here and there to become spark-ec2 options. Examples: * spot prices * placement groups * VPC, subnet, and security group assignments It's likely that more and more EC2 options will trickle up like this to become spark-ec2 options. While major options are well suited to this type of promotion, we should probably allow users to pass through EC2 options they want to use through spark-ec2 in some generic way. Let's add two options: * {{--ec2-instance-option}} - [{{boto::run}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.image.Image.run] * {{--ec2-spot-instance-option}} - [{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances] Each option can be specified multiple times and is simply passed directly to the underlying boto call. For example: {code} spark-ec2 \ ... --ec2-instance-option instance_initiated_shutdown_behavior=terminate \ --ec2-instance-option ebs_optimized=True {code} I'm not sure about the exact syntax of the extended options, but something like this will do the trick as long as it can be made to pass the options correctly to boto in most cases. I
[jira] [Commented] (SPARK-1239) Don't fetch all map output statuses at each reducer during shuffles
[ https://issues.apache.org/jira/browse/SPARK-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352284#comment-14352284 ] Mridul Muralidharan commented on SPARK-1239: [~pwendell] Is there any update on this ? This is fairly commonly hitting us, and we are at 1Gig for framesize already now ... Don't fetch all map output statuses at each reducer during shuffles --- Key: SPARK-1239 URL: https://issues.apache.org/jira/browse/SPARK-1239 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Affects Versions: 1.0.2, 1.1.0 Reporter: Patrick Wendell Instead we should modify the way we fetch map output statuses to take both a mapper and a reducer - or we should just piggyback the statuses on each task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6217) insertInto doesn't work
Charles Cloud created SPARK-6217: Summary: insertInto doesn't work Key: SPARK-6217 URL: https://issues.apache.org/jira/browse/SPARK-6217 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.3.0 Environment: Mac OS X Yosemite 10.10.2 Python 2.7.9 Spark 1.3.0 Reporter: Charles Cloud The following code, running in an IPython shell throws an error: {code:none} In [1]: from pyspark import SparkContext, HiveContext In [2]: sc = SparkContext('local[*]', 'test') Spark assembly has been built with Hive, including Datanucleus jars on classpath In [3]: sql = HiveContext(sc) In [4]: import pandas as pd In [5]: df = pd.DataFrame({'a': [1.0, 2.0, 3.0], 'b': [1, 2, 3], 'c': list('abc')}) In [6]: df2 = pd.DataFrame({'a': [2.0, 3.0, 4.0], 'b': [4, 5, 6], 'c': list('def')}) In [7]: sdf = sql.createDataFrame(df) In [8]: sdf2 = sql.createDataFrame(df2) In [9]: sql.registerDataFrameAsTable(sdf, 'sdf') In [10]: sql.registerDataFrameAsTable(sdf2, 'sdf2') In [11]: sql.cacheTable('sdf') In [12]: sql.cacheTable('sdf2') In [13]: sdf2.insertInto('sdf') # throws an error {code} Here's the Java traceback: {code:none} Py4JJavaError: An error occurred while calling o270.insertInto. : java.lang.AssertionError: assertion failed: No plan for InsertIntoTable (LogicalRDD [a#0,b#1L,c#2], MapPartitionsRDD[13] at mapPartitions at SQLContext.scala:1167), Map(), false InMemoryRelation [a#6,b#7L,c#8], true, 1, StorageLevel(true, true, false, true, 1), (PhysicalRDD [a#6,b#7L,c#8], MapPartitionsRDD[41] at mapPartitions at SQLContext.scala:1167), Some(sdf2) at scala.Predef$.assert(Predef.scala:179) at org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:1085) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:1083) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:1089) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:1089) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:1092) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:1092) at org.apache.spark.sql.DataFrame.insertInto(DataFrame.scala:1134) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:745) {code} I'd be ecstatic if this was my own fault, and I'm somehow using it incorrectly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6218) Upgrade spark-ec2 from optparse to argparse
Nicholas Chammas created SPARK-6218: --- Summary: Upgrade spark-ec2 from optparse to argparse Key: SPARK-6218 URL: https://issues.apache.org/jira/browse/SPARK-6218 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Nicholas Chammas Priority: Minor spark-ec2 [currently uses optparse|https://github.com/apache/spark/blob/55b1b32dc8b9b25deea8e5864b53fe802bb92741/ec2/spark_ec2.py#L43]. In Python 2.7, optparse was [deprecated in favor of argparse|https://docs.python.org/2/library/optparse.html]. This is the main motivation for moving away from optparse. Additionally, upgrading to argparse provides some [additional benefits noted in the docs|https://argparse.googlecode.com/svn/trunk/doc/argparse-vs-optparse.html]. The one we are mostly likely to benefit from is the better input validation. argparse is not include with Python 2.6, which is currently the minimum version of Python we support in Spark, but it can easily be downloaded by spark-ec2 with the work that has already been done in SPARK-6191. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6218) Upgrade spark-ec2 from optparse to argparse
[ https://issues.apache.org/jira/browse/SPARK-6218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352331#comment-14352331 ] Nicholas Chammas commented on SPARK-6218: - [~shivaram], [~joshrosen]: What do you think? Upgrade spark-ec2 from optparse to argparse --- Key: SPARK-6218 URL: https://issues.apache.org/jira/browse/SPARK-6218 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Nicholas Chammas Priority: Minor spark-ec2 [currently uses optparse|https://github.com/apache/spark/blob/55b1b32dc8b9b25deea8e5864b53fe802bb92741/ec2/spark_ec2.py#L43]. In Python 2.7, optparse was [deprecated in favor of argparse|https://docs.python.org/2/library/optparse.html]. This is the main motivation for moving away from optparse. Additionally, upgrading to argparse provides some [additional benefits noted in the docs|https://argparse.googlecode.com/svn/trunk/doc/argparse-vs-optparse.html]. The one we are mostly likely to benefit from is the better input validation. Specifically, being able to cleanly tie each input parameter to a validation method will cut down the input validation code currently spread out across the script. argparse is not include with Python 2.6, which is currently the minimum version of Python we support in Spark, but it can easily be downloaded by spark-ec2 with the work that has already been done in SPARK-6191. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6218) Upgrade spark-ec2 from optparse to argparse
[ https://issues.apache.org/jira/browse/SPARK-6218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-6218: Description: spark-ec2 [currently uses optparse|https://github.com/apache/spark/blob/55b1b32dc8b9b25deea8e5864b53fe802bb92741/ec2/spark_ec2.py#L43]. In Python 2.7, optparse was [deprecated in favor of argparse|https://docs.python.org/2/library/optparse.html]. This is the main motivation for moving away from optparse. Additionally, upgrading to argparse provides some [additional benefits noted in the docs|https://argparse.googlecode.com/svn/trunk/doc/argparse-vs-optparse.html]. The one we are mostly likely to benefit from is the better input validation. Specifically, being able to cleanly tie each input parameter to a validation method will cut down the input validation code currently spread out across the script. argparse is not include with Python 2.6, which is currently the minimum version of Python we support in Spark, but it can easily be downloaded by spark-ec2 with the work that has already been done in SPARK-6191. was: spark-ec2 [currently uses optparse|https://github.com/apache/spark/blob/55b1b32dc8b9b25deea8e5864b53fe802bb92741/ec2/spark_ec2.py#L43]. In Python 2.7, optparse was [deprecated in favor of argparse|https://docs.python.org/2/library/optparse.html]. This is the main motivation for moving away from optparse. Additionally, upgrading to argparse provides some [additional benefits noted in the docs|https://argparse.googlecode.com/svn/trunk/doc/argparse-vs-optparse.html]. The one we are mostly likely to benefit from is the better input validation. argparse is not include with Python 2.6, which is currently the minimum version of Python we support in Spark, but it can easily be downloaded by spark-ec2 with the work that has already been done in SPARK-6191. Upgrade spark-ec2 from optparse to argparse --- Key: SPARK-6218 URL: https://issues.apache.org/jira/browse/SPARK-6218 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Nicholas Chammas Priority: Minor spark-ec2 [currently uses optparse|https://github.com/apache/spark/blob/55b1b32dc8b9b25deea8e5864b53fe802bb92741/ec2/spark_ec2.py#L43]. In Python 2.7, optparse was [deprecated in favor of argparse|https://docs.python.org/2/library/optparse.html]. This is the main motivation for moving away from optparse. Additionally, upgrading to argparse provides some [additional benefits noted in the docs|https://argparse.googlecode.com/svn/trunk/doc/argparse-vs-optparse.html]. The one we are mostly likely to benefit from is the better input validation. Specifically, being able to cleanly tie each input parameter to a validation method will cut down the input validation code currently spread out across the script. argparse is not include with Python 2.6, which is currently the minimum version of Python we support in Spark, but it can easily be downloaded by spark-ec2 with the work that has already been done in SPARK-6191. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6219) Expand Python lint checks to check for compilation errors
[ https://issues.apache.org/jira/browse/SPARK-6219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352398#comment-14352398 ] Apache Spark commented on SPARK-6219: - User 'nchammas' has created a pull request for this issue: https://github.com/apache/spark/pull/4941 Expand Python lint checks to check for compilation errors -- Key: SPARK-6219 URL: https://issues.apache.org/jira/browse/SPARK-6219 Project: Spark Issue Type: Improvement Components: Build Reporter: Nicholas Chammas Priority: Minor An easy lint check for Python would be to make sure the stuff at least compiles. That will catch only the most egregious errors, but it should help. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6209) ExecutorClassLoader can leak connections after failing to load classes from the REPL class server
[ https://issues.apache.org/jira/browse/SPARK-6209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352535#comment-14352535 ] Apache Spark commented on SPARK-6209: - User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/4944 ExecutorClassLoader can leak connections after failing to load classes from the REPL class server - Key: SPARK-6209 URL: https://issues.apache.org/jira/browse/SPARK-6209 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0, 1.0.3, 1.3.0, 1.1.2, 1.2.1, 1.4.0 Reporter: Josh Rosen Assignee: Josh Rosen Priority: Critical ExecutorClassLoader does not ensure proper cleanup of network connections that it opens. If it fails to load a class, it may leak partially-consumed InputStreams that are connected to the REPL's HTTP class server, causing that server to exhaust its thread pool, which can cause the entire job to hang. Here is a simple reproduction: With {code} ./bin/spark-shell --master local-cluster[8,8,512] {code} run the following command: {code} sc.parallelize(1 to 1000, 1000).map { x = try { Class.forName(some.class.that.does.not.Exist) } catch { case e: Exception = // do nothing } x }.count() {code} This job will run 253 tasks, then will completely freeze without any errors or failed tasks. It looks like the driver has 253 threads blocked in socketRead0() calls: {code} [joshrosen ~]$ jstack 16765 | grep socketRead0 | wc 253 759 14674 {code} e.g. {code} qtp1287429402-13 daemon prio=5 tid=0x7f868a1c nid=0x5b03 runnable [0x0001159bd000] java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:152) at java.net.SocketInputStream.read(SocketInputStream.java:122) at org.eclipse.jetty.io.ByteArrayBuffer.readFrom(ByteArrayBuffer.java:391) at org.eclipse.jetty.io.bio.StreamEndPoint.fill(StreamEndPoint.java:141) at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.fill(SocketConnector.java:227) at org.eclipse.jetty.http.HttpParser.fill(HttpParser.java:1044) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:280) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72) at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang.Thread.run(Thread.java:745) {code} Jstack on the executors shows blocking in loadClass / findClass, where a single thread is RUNNABLE and waiting to hear back from the driver and other executor threads are BLOCKED on object monitor synchronization at Class.forName0(). Remotely triggering a GC on a hanging executor allows the job to progress and complete more tasks before hanging again. If I repeatedly trigger GC on all of the executors, then the job runs to completion: {code} jps | grep CoarseGra | cut -d ' ' -f 1 | xargs -I {} -n 1 -P100 jcmd {} GC.run {code} The culprit is a {{catch}} block that ignores all exceptions and performs no cleanup: https://github.com/apache/spark/blob/v1.2.0/repl/src/main/scala/org/apache/spark/repl/ExecutorClassLoader.scala#L94 This bug has been present since Spark 1.0.0, but I suspect that we haven't seen it before because it's pretty hard to reproduce. Triggering this error requires a job with tasks that trigger ClassNotFoundExceptions yet are still able to run to completion. It also requires that executors are able to leak enough open connections to exhaust the class server's Jetty thread pool limit, which requires that there are a large number of tasks (253+) and either a large number of executors or a very low amount of GC pressure on those executors (since GC will cause the leaked connections to be closed). The fix here is pretty simple: add proper resource cleanup to this class. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6192) Enhance MLlib's Python API (GSoC 2015)
[ https://issues.apache.org/jira/browse/SPARK-6192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352546#comment-14352546 ] Manoj Kumar commented on SPARK-6192: [~Manglano] [~leckie-chn] Hi, I am actually not a mentor but a student whom this GSoC project is preassigned to by Xiangrui (since I've been working on the Spark codebase for about a couple of months right now) . This project idea was actually a result of brainstorming across different Pull Requests. I would suggest you have a look at different issues which would help you gain familiarity with the API and help to propose a project proposal. Hope that helps. Enhance MLlib's Python API (GSoC 2015) -- Key: SPARK-6192 URL: https://issues.apache.org/jira/browse/SPARK-6192 Project: Spark Issue Type: Umbrella Components: ML, MLlib, PySpark Reporter: Xiangrui Meng Assignee: Manoj Kumar Labels: gsoc, gsoc2015, mentor This is an umbrella JIRA for [~MechCoder]'s GSoC 2015 project. The main theme is to enhance MLlib's Python API, to make it on par with the Scala/Java API. The main tasks are: 1. For all models in MLlib, provide save/load method. This also includes save/load in Scala. 2. Python API for evaluation metrics. 3. Python API for streaming ML algorithms. 4. Python API for distributed linear algebra. 5. Simplify MLLibPythonAPI using DataFrames. Currently, we use customized serialization, making MLLibPythonAPI hard to maintain. It would be nice to use the DataFrames for serialization. I'll link the JIRAs for each of the tasks. Note that this doesn't mean all these JIRAs are pre-assigned to [~MechCoder]. The TODO list will be dynamic based on the backlog. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6155) Support latest Scala (2.11.6+)
[ https://issues.apache.org/jira/browse/SPARK-6155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jianshi Huang updated SPARK-6155: - Summary: Support latest Scala (2.11.6+) (was: Support Scala 2.11.6+) Support latest Scala (2.11.6+) -- Key: SPARK-6155 URL: https://issues.apache.org/jira/browse/SPARK-6155 Project: Spark Issue Type: New Feature Components: Build Affects Versions: 1.3.0 Reporter: Jianshi Huang Just tried to build with Scala 2.11.5. failed with following error message: [INFO] Compiling 9 Scala sources to /Users/jianshuang/workspace/others/spark/repl/target/scala-2.11/classes... [ERROR] /Users/jianshuang/workspace/others/spark/repl/scala-2.11/src/main/scala/org/apache/spark/repl/SparkIMain.scala:1132: value withIncompleteHandler is not a member of SparkIMain.this.global.PerRunReporting [ERROR] currentRun.reporting.withIncompleteHandler((_, _) = isIncomplete = true) { [ERROR]^ Looks like PerRunParsing has been changed from Reporting to Parsing in 2.11.5 http://fossies.org/diffs/scala-sources/2.11.2_vs_2.11.5/src/compiler/scala/tools/nsc/Reporting.scala-diff.html Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6155) Support Scala 2.11.5+
[ https://issues.apache.org/jira/browse/SPARK-6155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jianshi Huang updated SPARK-6155: - Issue Type: New Feature (was: Improvement) Support Scala 2.11.5+ - Key: SPARK-6155 URL: https://issues.apache.org/jira/browse/SPARK-6155 Project: Spark Issue Type: New Feature Components: Build Affects Versions: 1.3.0 Reporter: Jianshi Huang Priority: Minor Just tried to build with Scala 2.11.5. failed with following error message: [INFO] Compiling 9 Scala sources to /Users/jianshuang/workspace/others/spark/repl/target/scala-2.11/classes... [ERROR] /Users/jianshuang/workspace/others/spark/repl/scala-2.11/src/main/scala/org/apache/spark/repl/SparkIMain.scala:1132: value withIncompleteHandler is not a member of SparkIMain.this.global.PerRunReporting [ERROR] currentRun.reporting.withIncompleteHandler((_, _) = isIncomplete = true) { [ERROR]^ Looks like PerRunParsing has been changed from Reporting to Parsing in 2.11.5 http://fossies.org/diffs/scala-sources/2.11.2_vs_2.11.5/src/compiler/scala/tools/nsc/Reporting.scala-diff.html Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6155) Support Scala 2.11.5+
[ https://issues.apache.org/jira/browse/SPARK-6155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jianshi Huang updated SPARK-6155: - Priority: Major (was: Minor) Support Scala 2.11.5+ - Key: SPARK-6155 URL: https://issues.apache.org/jira/browse/SPARK-6155 Project: Spark Issue Type: New Feature Components: Build Affects Versions: 1.3.0 Reporter: Jianshi Huang Just tried to build with Scala 2.11.5. failed with following error message: [INFO] Compiling 9 Scala sources to /Users/jianshuang/workspace/others/spark/repl/target/scala-2.11/classes... [ERROR] /Users/jianshuang/workspace/others/spark/repl/scala-2.11/src/main/scala/org/apache/spark/repl/SparkIMain.scala:1132: value withIncompleteHandler is not a member of SparkIMain.this.global.PerRunReporting [ERROR] currentRun.reporting.withIncompleteHandler((_, _) = isIncomplete = true) { [ERROR]^ Looks like PerRunParsing has been changed from Reporting to Parsing in 2.11.5 http://fossies.org/diffs/scala-sources/2.11.2_vs_2.11.5/src/compiler/scala/tools/nsc/Reporting.scala-diff.html Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6155) Support Scala 2.11.6+
[ https://issues.apache.org/jira/browse/SPARK-6155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jianshi Huang updated SPARK-6155: - Summary: Support Scala 2.11.6+ (was: Support Scala 2.11.5+) Support Scala 2.11.6+ - Key: SPARK-6155 URL: https://issues.apache.org/jira/browse/SPARK-6155 Project: Spark Issue Type: New Feature Components: Build Affects Versions: 1.3.0 Reporter: Jianshi Huang Just tried to build with Scala 2.11.5. failed with following error message: [INFO] Compiling 9 Scala sources to /Users/jianshuang/workspace/others/spark/repl/target/scala-2.11/classes... [ERROR] /Users/jianshuang/workspace/others/spark/repl/scala-2.11/src/main/scala/org/apache/spark/repl/SparkIMain.scala:1132: value withIncompleteHandler is not a member of SparkIMain.this.global.PerRunReporting [ERROR] currentRun.reporting.withIncompleteHandler((_, _) = isIncomplete = true) { [ERROR]^ Looks like PerRunParsing has been changed from Reporting to Parsing in 2.11.5 http://fossies.org/diffs/scala-sources/2.11.2_vs_2.11.5/src/compiler/scala/tools/nsc/Reporting.scala-diff.html Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5463) Fix Parquet filter push-down
[ https://issues.apache.org/jira/browse/SPARK-5463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352572#comment-14352572 ] Yin Huai edited comment on SPARK-5463 at 3/9/15 5:37 AM: - Seems [~liancheng]'s fix for SPARK-5451 has been released with Parquet [RC5|https://github.com/apache/incubator-parquet-mr/blob/parquet-1.6.0rc5/parquet-hadoop/src/main/java/parquet/filter2/statisticslevel/StatisticsFilter.java#L244]. was (Author: yhuai): Seems [~liancheng]'s fix has been released with Parquet [RC5|https://github.com/apache/incubator-parquet-mr/blob/parquet-1.6.0rc5/parquet-hadoop/src/main/java/parquet/filter2/statisticslevel/StatisticsFilter.java#L244]. Fix Parquet filter push-down Key: SPARK-5463 URL: https://issues.apache.org/jira/browse/SPARK-5463 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.2.1, 1.2.2 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6220) Allow extended EC2 options to be passed through spark-ec2
[ https://issues.apache.org/jira/browse/SPARK-6220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352524#comment-14352524 ] Nicholas Chammas commented on SPARK-6220: - As far as places where we create instances, yes, those are the 2 calls we use. Allow extended EC2 options to be passed through spark-ec2 - Key: SPARK-6220 URL: https://issues.apache.org/jira/browse/SPARK-6220 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Nicholas Chammas Priority: Minor There are many EC2 options exposed by the boto library that spark-ec2 uses. Over time, many of these EC2 options have been bubbled up here and there to become spark-ec2 options. Examples: * spot prices * placement groups * VPC, subnet, and security group assignments It's likely that more and more EC2 options will trickle up like this to become spark-ec2 options. While major options are well suited to this type of promotion, we should probably allow users to pass through EC2 options they want to use through spark-ec2 in some generic way. Let's add two options: * {{--ec2-instance-option}} - [{{boto::run}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.image.Image.run] * {{--ec2-spot-instance-option}} - [{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances] Each option can be specified multiple times and is simply passed directly to the underlying boto call. For example: {code} spark-ec2 \ ... --ec2-instance-option instance_initiated_shutdown_behavior=terminate \ --ec2-instance-option ebs_optimized=True {code} I'm not sure about the exact syntax of the extended options, but something like this will do the trick as long as it can be made to pass the options correctly to boto in most cases. I followed the example of {{ssh}}, which supports multiple extended options similarly. {code} ssh -o LogLevel=ERROR -o UserKnowHostsFile=/dev/null ... {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6192) Enhance MLlib's Python API (GSoC 2015)
[ https://issues.apache.org/jira/browse/SPARK-6192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352546#comment-14352546 ] Manoj Kumar edited comment on SPARK-6192 at 3/9/15 4:51 AM: [~Manglano] [~leckie-chn] Hi, I am actually not a mentor but a student whom this GSoC project is preassigned to by Xiangrui (since I've been working on the Spark codebase for about a couple of months right now) . This project idea was actually a result of brainstorming across different Pull Requests. I would suggest you have a look at different issues which would help you gain familiarity with Spark and help to propose a project proposal. Hope that helps. was (Author: mechcoder): [~Manglano] [~leckie-chn] Hi, I am actually not a mentor but a student whom this GSoC project is preassigned to by Xiangrui (since I've been working on the Spark codebase for about a couple of months right now) . This project idea was actually a result of brainstorming across different Pull Requests. I would suggest you have a look at different issues which would help you gain familiarity with the API and help to propose a project proposal. Hope that helps. Enhance MLlib's Python API (GSoC 2015) -- Key: SPARK-6192 URL: https://issues.apache.org/jira/browse/SPARK-6192 Project: Spark Issue Type: Umbrella Components: ML, MLlib, PySpark Reporter: Xiangrui Meng Assignee: Manoj Kumar Labels: gsoc, gsoc2015, mentor This is an umbrella JIRA for [~MechCoder]'s GSoC 2015 project. The main theme is to enhance MLlib's Python API, to make it on par with the Scala/Java API. The main tasks are: 1. For all models in MLlib, provide save/load method. This also includes save/load in Scala. 2. Python API for evaluation metrics. 3. Python API for streaming ML algorithms. 4. Python API for distributed linear algebra. 5. Simplify MLLibPythonAPI using DataFrames. Currently, we use customized serialization, making MLLibPythonAPI hard to maintain. It would be nice to use the DataFrames for serialization. I'll link the JIRAs for each of the tasks. Note that this doesn't mean all these JIRAs are pre-assigned to [~MechCoder]. The TODO list will be dynamic based on the backlog. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6193) Speed up how spark-ec2 searches for clusters
[ https://issues.apache.org/jira/browse/SPARK-6193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6193: - Assignee: Nicholas Chammas Speed up how spark-ec2 searches for clusters Key: SPARK-6193 URL: https://issues.apache.org/jira/browse/SPARK-6193 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Nicholas Chammas Assignee: Nicholas Chammas Priority: Minor Fix For: 1.4.0 {{spark-ec2}} currently pulls down [info for all instances|https://github.com/apache/spark/blob/eb48fd6e9d55fb034c00e61374bb9c2a86a82fb8/ec2/spark_ec2.py#L620] and searches locally for the target cluster. Instead, it should push those filters up when querying EC2. For AWS accounts with hundreds of active instances, there is a difference of many seconds between the two. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6193) Speed up how spark-ec2 searches for clusters
[ https://issues.apache.org/jira/browse/SPARK-6193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6193. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 4922 [https://github.com/apache/spark/pull/4922] Speed up how spark-ec2 searches for clusters Key: SPARK-6193 URL: https://issues.apache.org/jira/browse/SPARK-6193 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Nicholas Chammas Priority: Minor Fix For: 1.4.0 {{spark-ec2}} currently pulls down [info for all instances|https://github.com/apache/spark/blob/eb48fd6e9d55fb034c00e61374bb9c2a86a82fb8/ec2/spark_ec2.py#L620] and searches locally for the target cluster. Instead, it should push those filters up when querying EC2. For AWS accounts with hundreds of active instances, there is a difference of many seconds between the two. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6215) Shorten apply and update funcs in GenerateProjection
[ https://issues.apache.org/jira/browse/SPARK-6215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6215: - Component/s: SQL (Assign component to new JIRAs please) Shorten apply and update funcs in GenerateProjection Key: SPARK-6215 URL: https://issues.apache.org/jira/browse/SPARK-6215 Project: Spark Issue Type: Improvement Components: SQL Reporter: Liang-Chi Hsieh Priority: Minor Some codes in GenerateProjection look redundant and can be shortened. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4496) smallint (16 bit value) is being send as a 32 bit value in the thrift interface.
[ https://issues.apache.org/jira/browse/SPARK-4496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352046#comment-14352046 ] Sean Owen commented on SPARK-4496: -- Can you add any detail to this? where, and what is the manifestation of the problem? is it a bug or just suboptimal? smallint (16 bit value) is being send as a 32 bit value in the thrift interface. --- Key: SPARK-4496 URL: https://issues.apache.org/jira/browse/SPARK-4496 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 1.1.0 Reporter: Chip Sands -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3742) Link to Spark UI sometimes fails when using H/A RM's
[ https://issues.apache.org/jira/browse/SPARK-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-3742. -- Resolution: Duplicate Link to Spark UI sometimes fails when using H/A RM's Key: SPARK-3742 URL: https://issues.apache.org/jira/browse/SPARK-3742 Project: Spark Issue Type: Bug Components: YARN Reporter: meiyoula When running an application on yarn, the hyperlink on yarn page can't jump to sparkUI page. It happens sometimes. The error message is: This is standby RM. Redirecting to the current active RM: http://vm-181:8088/proxy/application_1409206382122_0037 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6183) Skip bad workers when re-launching executors
[ https://issues.apache.org/jira/browse/SPARK-6183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352048#comment-14352048 ] Sean Owen commented on SPARK-6183: -- Isn't this a duplicate of https://issues.apache.org/jira/browse/SPARK-4609 ? Skip bad workers when re-launching executors Key: SPARK-6183 URL: https://issues.apache.org/jira/browse/SPARK-6183 Project: Spark Issue Type: Improvement Components: Deploy Reporter: Peng Zhen In standalone cluster, when an executor launch fails, the master should avoid re-launching it on the same worker. According to the current scheduling logic, the failed executor will be highly possible re-launched on the same worker, and finally cause the application removed from the master. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-896) ADD_JARS does not add all classes to classpath in the spark-shell for cluster on Mesos.
[ https://issues.apache.org/jira/browse/SPARK-896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-896. - Resolution: Won't Fix I'm gonna call this WontFix as ADD_JARS has been deprecated for a while. ADD_JARS does not add all classes to classpath in the spark-shell for cluster on Mesos. --- Key: SPARK-896 URL: https://issues.apache.org/jira/browse/SPARK-896 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.7.3 Reporter: Gary Malouf I do not believe the issue is limited to scheduler/executors running on Mesos but added the information for debugging purposes. h3. Reproducing the issue: # Implement some custom functionalities and package them into a 'monster jar' with something like sbt assembly. # Drop this jar onto the Spark master box and specify the path to it in the ADD_JARS variable. # Start up the spark shell on same box as the master. You should be able to import packages/classes specified in the jar without any compilation trouble. # In a map function on an RDD, trying to call a class from within this jar (with fully qualified name) fails on a ClassNotFoundException. h3. Workaround Matei Zaharia suggested adding this jar to the SPARK_CLASSPATH environment variable - that resolved the issue. My understanding however is that the functionality should work using solely the ADD_JARS variable - the documentation does not capture this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6205) UISeleniumSuite fails for Hadoop 2.x test with NoClassDefFoundError
[ https://issues.apache.org/jira/browse/SPARK-6205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6205. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 4933 [https://github.com/apache/spark/pull/4933] UISeleniumSuite fails for Hadoop 2.x test with NoClassDefFoundError --- Key: SPARK-6205 URL: https://issues.apache.org/jira/browse/SPARK-6205 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 1.3.0 Reporter: Sean Owen Assignee: Sean Owen Priority: Minor Fix For: 1.4.0 {code} mvn -DskipTests -Pyarn -Phive -Phadoop-2.4 -Dhadoop.version=2.6.0 clean install mvn -Pyarn -Phive -Phadoop-2.4 -Dhadoop.version=2.6.0 test -DwildcardSuites=org.apache.spark.ui.UISeleniumSuite -Dtest=none -pl core/ {code} will produce: {code} UISeleniumSuite: *** RUN ABORTED *** java.lang.NoClassDefFoundError: org/w3c/dom/ElementTraversal ... {code} It doesn't seem to happen without the various profiles set above. The fix is simple, although sounds weird; Selenium's dependency on {{xml-apis:xml-apis}} must be manually included in core's test dependencies. This probably has something to do with Hadoop 2 vs 1 dependency changes and the fact that Maven test deps aren't transitive, AFAIK. PR coming... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5134) Bump default Hadoop version to 2+
[ https://issues.apache.org/jira/browse/SPARK-5134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352374#comment-14352374 ] Shivaram Venkataraman commented on SPARK-5134: -- Yeah if you exclude Spark's Hadoop dependency things work correctly for Hadoop1. There are some additional issues that come up in 1.2 if due to the Guava changes, but those are not related to the default Hadoop version change. I think the documentation to update would be [1] but I am thinking it would be good to mention this in the Quick Start guide [2] as well [1] https://github.com/apache/spark/blob/55b1b32dc8b9b25deea8e5864b53fe802bb92741/docs/hadoop-third-party-distributions.md#linking-applications-to-the-hadoop-version [2] https://github.com/apache/spark/blob/55b1b32dc8b9b25deea8e5864b53fe802bb92741/docs/quick-start.md#self-contained-applications Bump default Hadoop version to 2+ - Key: SPARK-5134 URL: https://issues.apache.org/jira/browse/SPARK-5134 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 1.2.0 Reporter: Ryan Williams Priority: Minor [~srowen] and I discussed bumping [the default hadoop version in the parent POM|https://github.com/apache/spark/blob/bb38ebb1abd26b57525d7d29703fd449e40cd6de/pom.xml#L122] from {{1.0.4}} to something more recent. There doesn't seem to be a good reason that it was set/kept at {{1.0.4}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6219) Expand Python lint checks to check for compilation errors
Nicholas Chammas created SPARK-6219: --- Summary: Expand Python lint checks to check for compilation errors Key: SPARK-6219 URL: https://issues.apache.org/jira/browse/SPARK-6219 Project: Spark Issue Type: Improvement Components: Build Reporter: Nicholas Chammas Priority: Minor An easy lint check for Python would be to make sure the stuff at least compiles. That will catch only the most egregious errors, but it should help. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6211) Test Python Kafka API using Python unit tests
[ https://issues.apache.org/jira/browse/SPARK-6211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352425#comment-14352425 ] Saisai Shao commented on SPARK-6211: Thanks [~tdas] for your suggestion, Let me understand how Python unit test works at first and then figure out how to add unit test to Python. Test Python Kafka API using Python unit tests - Key: SPARK-6211 URL: https://issues.apache.org/jira/browse/SPARK-6211 Project: Spark Issue Type: Test Components: Streaming, Tests Reporter: Tathagata Das Assignee: Saisai Shao Priority: Critical This is tricky in python because the KafkaStreamSuiteBase (which has the functionality of creating embedded kafka clusters) is in the test package, which is not in the python path. To fix that, we have to ways. 1. Add test jar to classpath in python test. Thats kind of trickier. 2. Bring that into the src package (maybe renamed as KafkaTestUtils), and then wrap that in python to use it from python. If (2) does not add any extra test dependencies to the main Kafka pom, then 2 should be simpler to do. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3287) When ResourceManager High Availability is enabled, ApplicationMaster webUI is not displayed.
[ https://issues.apache.org/jira/browse/SPARK-3287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-3287. -- Resolution: Not a Problem Last update on the PR a while ago says that this was likely already fixed. When ResourceManager High Availability is enabled, ApplicationMaster webUI is not displayed. Key: SPARK-3287 URL: https://issues.apache.org/jira/browse/SPARK-3287 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.0.2 Reporter: Benoy Antony Attachments: SPARK-3287.patch When ResourceManager High Availability is enabled, there will be multiple resource managers and each of them could act as a proxy. AmIpFilter is modified to accept multiple proxy hosts. But Spark ApplicationMaster fails to read the ResourceManager IPs properly from the configuration. So AmIpFilter is initialized with an empty set of proxy hosts. So any access to the ApplicationMaster WebUI will be redirected to port RM port on the local host. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5124) Standardize internal RPC interface
[ https://issues.apache.org/jira/browse/SPARK-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352102#comment-14352102 ] Shixiong Zhu commented on SPARK-5124: - The problem is that the message may come from caller.receive, but the callee wants to send the reply to caller.receiveAndReply. However, I cannot find a use case now. But I find some RpcEndpoint may need to know the sender's address. So I added the sender method to RpcCallContext. And I also removed replyWithSender since it can be replaced with RpcCallContext.sender.sendWithReply(msg, self) now. Standardize internal RPC interface -- Key: SPARK-5124 URL: https://issues.apache.org/jira/browse/SPARK-5124 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Reynold Xin Assignee: Shixiong Zhu Attachments: Pluggable RPC - draft 1.pdf, Pluggable RPC - draft 2.pdf In Spark we use Akka as the RPC layer. It would be great if we can standardize the internal RPC interface to facilitate testing. This will also provide the foundation to try other RPC implementations in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3457) ConcurrentModificationException starting up pyspark
[ https://issues.apache.org/jira/browse/SPARK-3457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-3457. -- Resolution: Duplicate Given that this also concerns accessing the system {{Properties}} object, it's the same as SPARK-4952 I'm sure. ConcurrentModificationException starting up pyspark --- Key: SPARK-3457 URL: https://issues.apache.org/jira/browse/SPARK-3457 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Environment: Hadoop 2.3 (CDH 5.1) on Ubuntu precise Reporter: Shay Rojansky Just downloaded Spark 1.1.0-rc4. Launching pyspark for the very first time in yarn-client mode (no additional params or anything), I got the exception below. Rerunning pyspark 5 times afterwards did not reproduce the issue. {code} 14/09/09 18:07:58 INFO YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: 0 appStartTime: 1410275267606 yarnAppState: RUNNING 14/09/09 18:07:58 INFO YarnClientSchedulerBackend: Add WebUI Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, PROXY_HOST=master. grid.eaglerd.local,PROXY_URI_BASE=http://master.grid.eaglerd.local:8088/proxy/application_1410268447887_0011, /proxy/application_1410268447887_0011 Traceback (most recent call last): File /opt/spark/python/pyspark/shell.py, line 44, in module 14/09/09 18:07:58 INFO JettyUtils: Adding filter: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter sc = SparkContext(appName=PySparkShell, pyFiles=add_files) File /opt/spark/python/pyspark/context.py, line 107, in __init__ conf) File /opt/spark/python/pyspark/context.py, line 155, in _do_init self._jsc = self._initialize_context(self._conf._jconf) File /opt/spark/python/pyspark/context.py, line 201, in _initialize_context return self._jvm.JavaSparkContext(jconf) File /opt/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 701, in __call__ File /opt/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext. : java.util.ConcurrentModificationException at java.util.Hashtable$Enumerator.next(Hashtable.java:1167) at scala.collection.convert.Wrappers$JPropertiesWrapper$$anon$3.next(Wrappers.scala:458) at scala.collection.convert.Wrappers$JPropertiesWrapper$$anon$3.next(Wrappers.scala:454) at scala.collection.Iterator$class.toStream(Iterator.scala:1143) at scala.collection.AbstractIterator.toStream(Iterator.scala:1157) at scala.collection.Iterator$$anonfun$toStream$1.apply(Iterator.scala:1143) at scala.collection.Iterator$$anonfun$toStream$1.apply(Iterator.scala:1143) at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1085) at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1077) at scala.collection.immutable.Stream$$anonfun$filteredTail$1.apply(Stream.scala:1149) at scala.collection.immutable.Stream$$anonfun$filteredTail$1.apply(Stream.scala:1149) at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1085) at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1077) at scala.collection.immutable.Stream.length(Stream.scala:284) at scala.collection.SeqLike$class.sorted(SeqLike.scala:608) at scala.collection.AbstractSeq.sorted(Seq.scala:40) at org.apache.spark.SparkEnv$.environmentDetails(SparkEnv.scala:324) at org.apache.spark.SparkContext.postEnvironmentUpdate(SparkContext.scala:1297) at org.apache.spark.SparkContext.init(SparkContext.scala:334) at org.apache.spark.api.java.JavaSparkContext.init(JavaSparkContext.scala:53) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:214) at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79) at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA
[jira] [Resolved] (SPARK-2541) Standalone mode can't access secure HDFS anymore
[ https://issues.apache.org/jira/browse/SPARK-2541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-2541. -- Resolution: Duplicate Standalone mode can't access secure HDFS anymore Key: SPARK-2541 URL: https://issues.apache.org/jira/browse/SPARK-2541 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.0.0, 1.0.1 Reporter: Thomas Graves Attachments: SPARK-2541-partial.patch In spark 0.9.x you could access secure HDFS from Standalone deploy, that doesn't work in 1.X anymore. It looks like the issues is in SparkHadoopUtil.runAsSparkUser. Previously it wouldn't do the doAs if the currentUser == user. Not sure how it affects when the daemons run as a super user but SPARK_USER is set to someone else. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2572) Can't delete local dir on executor automatically when running spark over Mesos.
[ https://issues.apache.org/jira/browse/SPARK-2572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-2572: - Component/s: (was: Spark Core) Mesos Priority: Minor (was: Major) I wonder if this is still an issue, since we've had a number of improvements to cleaning up the executors' work dir since, which might affect Mesos. Can't delete local dir on executor automatically when running spark over Mesos. --- Key: SPARK-2572 URL: https://issues.apache.org/jira/browse/SPARK-2572 Project: Spark Issue Type: Bug Components: Mesos Affects Versions: 1.0.0 Reporter: Yadong Qi Priority: Minor When running spark over Mesos in “fine-grained” modes or “coarse-grained” mode. After the application finished.The local dir(/tmp/spark-local-20140718114058-834c) on executor can't not delete automatically. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1985) SPARK_HOME shouldn't be required when spark.executor.uri is provided
[ https://issues.apache.org/jira/browse/SPARK-1985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-1985: - Component/s: (was: Spark Core) Mesos Labels: (was: mesos) The code in question at that point in time was: {code} val sparkHome = sc.getSparkHome().getOrElse(throw new SparkException( Spark home is not set; set it through the spark.home system + property, the SPARK_HOME environment variable or the SparkContext constructor)) {code} and it's now {code} val executorSparkHome = sc.conf.getOption(spark.mesos.executor.home) .orElse(sc.getSparkHome()) // Fall back to driver Spark home for backward compatibility .getOrElse { throw new SparkException(Executor Spark home `spark.mesos.executor.home` is not set!) } {code} So {{SPARK_HOME}} / {{spark.home}} are no longer required, although, they've just been replaced with another more specific value in SPARK-3264 / https://github.com/apache/spark/commit/41dc5987d9abeca6fc0f5935c780d48f517cdf95 Although the assembly is automatically added to the classpath by {{compute-classpath.sh}} too, that may not be 100% of what this is asking, which is to be able to not set a 'home' at all. My read of SPARK-3264 however is that we should have an explicit 'home' setting for Mesos executors. Or else I'm not clear how you find `bin/spark-class` for example (see the relevant change in https://github.com/apache/spark/commit/4a4f9ccba2b42b64356db7f94ed9019212fc7317 too) SPARK_HOME shouldn't be required when spark.executor.uri is provided Key: SPARK-1985 URL: https://issues.apache.org/jira/browse/SPARK-1985 Project: Spark Issue Type: Bug Components: Mesos Affects Versions: 1.0.0 Environment: MESOS Reporter: Gerard Maas When trying to run that simple example on a Mesos installation, I get an error that SPARK_HOME is not set. A local spark installation should not be required to run a job on Mesos. All that's needed is the executor package, being the assembly.tar.gz on a reachable location (HDFS/S3/HTTP). I went looking into the code and indeed there's a check on SPARK_HOME [2] regardless of the presence of the assembly but it's actually only used if the assembly is not provided (which is a kind-of best-effort recovery strategy). Current flow: if (!SPARK_HOME) fail(No SPARK_HOME) else if (assembly) { use assembly) } else { try use SPARK_HOME to build spark_executor } Should be: sparkExecutor = if (assembly) {assembly} else if (SPARK_HOME) {try use SPARK_HOME to build spark_executor} else { fail(No executor found. Please provide spark.executor.uri (preferred) or spark.home) [1] http://apache-spark-user-list.1001560.n3.nabble.com/ClassNotFoundException-with-Spark-Mesos-spark-shell-works-fine-td6165.html [2] https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerBackend.scala#L89 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1985) SPARK_HOME shouldn't be required when spark.executor.uri is provided
[ https://issues.apache.org/jira/browse/SPARK-1985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-1985. -- Resolution: Not a Problem SPARK_HOME shouldn't be required when spark.executor.uri is provided Key: SPARK-1985 URL: https://issues.apache.org/jira/browse/SPARK-1985 Project: Spark Issue Type: Bug Components: Mesos Affects Versions: 1.0.0 Environment: MESOS Reporter: Gerard Maas When trying to run that simple example on a Mesos installation, I get an error that SPARK_HOME is not set. A local spark installation should not be required to run a job on Mesos. All that's needed is the executor package, being the assembly.tar.gz on a reachable location (HDFS/S3/HTTP). I went looking into the code and indeed there's a check on SPARK_HOME [2] regardless of the presence of the assembly but it's actually only used if the assembly is not provided (which is a kind-of best-effort recovery strategy). Current flow: if (!SPARK_HOME) fail(No SPARK_HOME) else if (assembly) { use assembly) } else { try use SPARK_HOME to build spark_executor } Should be: sparkExecutor = if (assembly) {assembly} else if (SPARK_HOME) {try use SPARK_HOME to build spark_executor} else { fail(No executor found. Please provide spark.executor.uri (preferred) or spark.home) [1] http://apache-spark-user-list.1001560.n3.nabble.com/ClassNotFoundException-with-Spark-Mesos-spark-shell-works-fine-td6165.html [2] https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerBackend.scala#L89 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3685) Spark's local dir should accept only local paths
[ https://issues.apache.org/jira/browse/SPARK-3685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352068#comment-14352068 ] Sean Owen commented on SPARK-3685: -- This is 90% the same discussion as SPARK-1529, although, this concerns making the current behavior more explicit (e.g. fail an hdfs: URI) whereas SPARK-1529 (and the discussion below) discusses making other FS schemes work. I'd like to potentially address this issue, without prejudicing SPARK-1529. In fact this discussion usefully contains a good use case for putting a local dir on distributed storage, whereas I personally don't see it in the arguments in SPARK-1529. Spark's local dir should accept only local paths Key: SPARK-3685 URL: https://issues.apache.org/jira/browse/SPARK-3685 Project: Spark Issue Type: Bug Components: Spark Core, YARN Affects Versions: 1.1.0 Reporter: Andrew Or When you try to set local dirs to hdfs:/tmp/foo it doesn't work. What it will try to do is create a folder called hdfs: and put tmp inside it. This is because in Util#getOrCreateLocalRootDirs we use java.io.File instead of Hadoop's file system to parse this path. We also need to resolve the path appropriately. This may not have an urgent use case, but it fails silently and does what is least expected. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5124) Standardize internal RPC interface
[ https://issues.apache.org/jira/browse/SPARK-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352101#comment-14352101 ] Shixiong Zhu commented on SPARK-5124: - The problem is that the message may come from caller.receive, but the callee wants to send the reply to caller.receiveAndReply. However, I cannot find a use case now. But I find some RpcEndpoint may need to know the sender's address. So I added the sender method to RpcCallContext. And I also removed replyWithSender since it can be replaced with RpcCallContext.sender.sendWithReply(msg, self) now. Standardize internal RPC interface -- Key: SPARK-5124 URL: https://issues.apache.org/jira/browse/SPARK-5124 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Reynold Xin Assignee: Shixiong Zhu Attachments: Pluggable RPC - draft 1.pdf, Pluggable RPC - draft 2.pdf In Spark we use Akka as the RPC layer. It would be great if we can standardize the internal RPC interface to facilitate testing. This will also provide the foundation to try other RPC implementations in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-5124) Standardize internal RPC interface
[ https://issues.apache.org/jira/browse/SPARK-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-5124: Comment: was deleted (was: The problem is that the message may come from caller.receive, but the callee wants to send the reply to caller.receiveAndReply. However, I cannot find a use case now. But I find some RpcEndpoint may need to know the sender's address. So I added the sender method to RpcCallContext. And I also removed replyWithSender since it can be replaced with RpcCallContext.sender.sendWithReply(msg, self) now.) Standardize internal RPC interface -- Key: SPARK-5124 URL: https://issues.apache.org/jira/browse/SPARK-5124 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Reynold Xin Assignee: Shixiong Zhu Attachments: Pluggable RPC - draft 1.pdf, Pluggable RPC - draft 2.pdf In Spark we use Akka as the RPC layer. It would be great if we can standardize the internal RPC interface to facilitate testing. This will also provide the foundation to try other RPC implementations in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1444) Update branch-0.9's SBT to 0.13.1 so that it works with Java 8
[ https://issues.apache.org/jira/browse/SPARK-1444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-1444. -- Resolution: Won't Fix Target Version/s: (was: 0.9.3) I suggest we call this WontFix, as 0.9 is now 4 minor releases behind, SBT isn't the primary or only build, and the straightforward way to address this does not seem to work. Update branch-0.9's SBT to 0.13.1 so that it works with Java 8 -- Key: SPARK-1444 URL: https://issues.apache.org/jira/browse/SPARK-1444 Project: Spark Issue Type: Bug Components: Build Reporter: Matei Zaharia Apparently the older versions have problems if you compile on Java 8. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2326) DiskBlockManager could add DiskChecker function for kicking off bad directories
[ https://issues.apache.org/jira/browse/SPARK-2326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-2326. -- Resolution: Duplicate Essentially the same idea, that {{DiskStore}} / {{BlockManager}} could blacklist bad directories. DiskBlockManager could add DiskChecker function for kicking off bad directories --- Key: SPARK-2326 URL: https://issues.apache.org/jira/browse/SPARK-2326 Project: Spark Issue Type: Bug Components: Block Manager Reporter: YanTang Zhai If the disk failure happens when the spark cluster is running, DiskBlockManager should kick off bad directories automatically. DiskBlockManager could add DiskChecker function. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4450) SparkSQL producing incorrect answer when using --master yarn
[ https://issues.apache.org/jira/browse/SPARK-4450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-4450: - Component/s: (was: Spark Core) SQL SparkSQL producing incorrect answer when using --master yarn Key: SPARK-4450 URL: https://issues.apache.org/jira/browse/SPARK-4450 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Environment: CDH 5.1 Reporter: Rick Bischoff A simple summary program using spark-submit --master local MyJob.py vs. spark-submit --master yarn MyJob.py produces different answers--the output produced by local has been independently verified and is correct, but the output from yarn is incorrect. It does not appear to happen with smaller files, only large files. MyJob.py is from pyspark import SparkContext, SparkConf from pyspark.sql import * def maybeFloat(x): Convert NULLs into 0s if x=='': return 0. else: return float(x) def maybeInt(x): Convert NULLs into 0s if x=='': return 0 else: return int(x) def mapColl(p): return { f1: p[0], f2: p[1], f3: p[2], f4: int(p[3]), f5: int(p[4]), f6: p[5], f7: p[6], f8: p[7], f9: p[8], f10: maybeInt(p[9]), f11: p[10], f12: p[11], f13: p[12], f14: p[13], f15: maybeFloat(p[14]), f16: maybeInt(p[15]), f17: maybeFloat(p[16]) } sc = SparkContext() sqlContext = SQLContext(sc) lines = sc.textFile(sample.csv) fields = lines.map(lambda l: mapColl(l.split(,))) collTable = sqlContext.inferSchema(fields) collTable.registerAsTable(sample) test = sqlContext.sql(SELECT f9, COUNT(*) AS rows, SUM(f15) AS f15sum \ + FROM sample \ + GROUP BY f9) foo = test.collect() print foo sc.stop() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4876) An exception thrown when accessing a Spark SQL table using a JDBC driver from a standalone app.
[ https://issues.apache.org/jira/browse/SPARK-4876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-4876. -- Resolution: Not a Problem Agree that the problem is that the metastore config is not pointing to HDFS as it should. It's looking at a local path not an hdfs: path. An exception thrown when accessing a Spark SQL table using a JDBC driver from a standalone app. --- Key: SPARK-4876 URL: https://issues.apache.org/jira/browse/SPARK-4876 Project: Spark Issue Type: Bug Components: Spark Core, SQL Affects Versions: 1.1.1 Environment: Mac OS X 10.10.1, Apache Spark 1.1.1, Reporter: Leonid Mikhailov I am running Spark version 1.1.1 (built it on Mac using: mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -DskipTests clean package) I start JDBC server like this: ./sbin/start-thriftserver.sh In my IDE I am running the following example: {code:title= TestSparkSQLJdbcAccess.java|borderStyle=solid} package com.bla.spark.sql; import java.sql.Connection; import java.sql.DriverManager; import java.sql.ResultSet; import java.sql.SQLException; import java.sql.Statement; public class TestSparkSQLJdbcAccess { privatestatic String driverName = org.apache.hive.jdbc.HiveDriver; /** * @param args * @throws SQLException */ public static void main(String[] args) throws SQLException { try { Class.forName(driverName); } catch (ClassNotFoundException e) { // TODO Auto-generated catch block e.printStackTrace(); System.exit(1); } //replace hive here with the name of the user the queries should run as Connection con = DriverManager.getConnection(jdbc:hive2://localhost:1/default, , ); Statement stmt = con.createStatement(); String tableName = testHiveDriverTable; stmt.execute(drop table if exists + tableName); stmt.execute(create table + tableName + (key int, value string)); // show tables String sql = show tables ' + tableName + '; System.out.println(Running: + sql); ResultSet res = stmt.executeQuery(sql); if (res.next()) { System.out.println(res.getString(1)); } // describe table sql = describe + tableName; System.out.println(Running: + sql); res = stmt.executeQuery(sql); while (res.next()) { System.out.println(res.getString(1) + \t + res.getString(2)); } // load data into table // NOTE: filepath has to be local to the hive server // NOTE: /tmp/a.txt is a ctrl-A separated file with two fields per line String filepath = /tmp/a.txt; sql = load data local inpath ' + filepath + ' into table + tableName; System.out.println(Running: + sql); stmt.execute(sql); // select * query sql = select * from + tableName; System.out.println(Running: + sql); res = stmt.executeQuery(sql); while (res.next()) { System.out.println(String.valueOf(res.getInt(1)) + \t + res.getString(2)); } // regular hive query sql = select count(1) from + tableName; System.out.println(Running: + sql); res = stmt.executeQuery(sql); while (res.next()) { System.out.println(res.getString(1)); } } } {code} To pom.xml is as follows: {code:xml} projectxmlns=http://maven.apache.org/POM/4.0.0xmlns:xsi=http://www.w3.org/2001/XMLSchema-instancexsi:schemaLocation=http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd; modelVersion4.0.0/modelVersion groupIdcom.esri.spark/groupId artifactIdHiveJDBCTest/artifactId version0.0.1-SNAPSHOT/version nameHiveJDBCTest/name dependencies dependency groupIdorg.apache.hive/groupId artifactIdhive-jdbc/artifactId version0.12.0/version /dependency dependency groupIdorg.apache.hadoop/groupId artifactIdhadoop-core/artifactId version0.20.2/version /dependency /dependencies /project {code} I am getting an exception: {noformat} Exception in thread main java.sql.SQLException: org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:file:/user/hive/warehouse/testhivedrivertable is not a directory or unable to create one) at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:165) at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:153) at
[jira] [Updated] (SPARK-6208) executor-memory does not work when using local cluster
[ https://issues.apache.org/jira/browse/SPARK-6208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-6208: Priority: Minor (was: Major) executor-memory does not work when using local cluster -- Key: SPARK-6208 URL: https://issues.apache.org/jira/browse/SPARK-6208 Project: Spark Issue Type: New Feature Components: Spark Submit Reporter: Yin Huai Priority: Minor Seems executor memory set with a local cluster is not correctly set (see https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L377). Also, totalExecutorCores seems has the same issue (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L379). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org