date:20150308


[ 
https://issues.apache.org/jira/browse/SPARK-5134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352268#comment-14352268
 ] 

Patrick Wendell commented on SPARK-5134:


Hey [~rdub] [~srowen],

As part of the 1.3 release cycle I did some more forensics on the actual 
artifacts we publish. It turns out that because of the changes made for Scala 
2.11 with the way our publishing works, we've actually been publishing poms 
that link against Hadoop 2.2 as of Spark 1.2. And in general, the published pom 
Hadoop version is decoupled now from the default one in the build itself, 
because of our use of the effective pom plugin.

https://github.com/apache/spark/blob/master/dev/create-release/create-release.sh#L119

I'm actually a bit bummed that we (unintentionally) made this change in 1.2 
because I do fear it likely screwed things up for some users.

But on the plus side, since we no decouple the publishing from the default 
version in the pom, I don't see a big issue with updating the POM. So I 
withdraw my objection on the PR.

 Bump default Hadoop version to 2+
 -

 Key: SPARK-5134
 URL: https://issues.apache.org/jira/browse/SPARK-5134
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 1.2.0
Reporter: Ryan Williams
Priority: Minor

 [~srowen] and I discussed bumping [the default hadoop version in the parent 
 POM|https://github.com/apache/spark/blob/bb38ebb1abd26b57525d7d29703fd449e40cd6de/pom.xml#L122]
  from {{1.0.4}} to something more recent.
 There doesn't seem to be a good reason that it was set/kept at {{1.0.4}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6208) executor-memory does not work when using local cluster

2015-03-08 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352153#comment-14352153
 ] 

Yin Huai commented on SPARK-6208:
-

[~pwendell] oh, i see. 

I was trying to increase the executor memory so I can cache a larger RDD. Since 
--conf spark.executor.memory, should we resolve it as not a problem?

 executor-memory does not work when using local cluster
 --

 Key: SPARK-6208
 URL: https://issues.apache.org/jira/browse/SPARK-6208
 Project: Spark
  Issue Type: New Feature
  Components: Spark Submit
Reporter: Yin Huai
Priority: Minor

 Seems executor memory set with a local cluster is not correctly set (see 
 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L377).
  Also, totalExecutorCores seems has the same issue 
 (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L379).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6216) Check Python version in worker before run PySpark job

2015-03-08 Thread Davies Liu (JIRA)

Davies Liu created SPARK-6216:
-

 Summary: Check Python version in worker before run PySpark job
 Key: SPARK-6216
 URL: https://issues.apache.org/jira/browse/SPARK-6216
 Project: Spark
  Issue Type: Improvement
Reporter: Davies Liu


PySpark can only run with the same major version both in driver and worker ( 
both of the are 2.6 or 2.7), it will cause random error if it have 2.7 in 
driver or 2.6 in worker (or vice).

For example:
{code}
davies@localhost:~/work/spark$ PYSPARK_PYTHON=python2.6 
PYSPARK_DRIVER_PYTHON=python2.7 bin/pyspark
Using Python version 2.7.7 (default, Jun  2 2014 12:48:16)
SparkContext available as sc, SQLContext available as sqlCtx.
 sc.textFile('LICENSE').map(lambda l: l.split()).count()
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File /Users/davies/work/spark/python/pyspark/worker.py, line 101, in main
process()
  File /Users/davies/work/spark/python/pyspark/worker.py, line 96, in process
serializer.dump_stream(func(split_index, iterator), outfile)
  File /Users/davies/work/spark/python/pyspark/rdd.py, line 2251, in 
pipeline_func
return func(split, prev_func(split, iterator))
  File /Users/davies/work/spark/python/pyspark/rdd.py, line 2251, in 
pipeline_func
return func(split, prev_func(split, iterator))
  File /Users/davies/work/spark/python/pyspark/rdd.py, line 2251, in 
pipeline_func
return func(split, prev_func(split, iterator))
  File /Users/davies/work/spark/python/pyspark/rdd.py, line 281, in func
return f(iterator)
  File /Users/davies/work/spark/python/pyspark/rdd.py, line 931, in lambda
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
  File /Users/davies/work/spark/python/pyspark/rdd.py, line 931, in genexpr
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
  File stdin, line 1, in lambda
TypeError: 'bool' object is not callable

at 
org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:136)
at 
org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:177)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:95)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6216) Check Python version in worker before run PySpark job

2015-03-08 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-6216:
-

Assignee: Davies Liu

 Check Python version in worker before run PySpark job
 -

 Key: SPARK-6216
 URL: https://issues.apache.org/jira/browse/SPARK-6216
 Project: Spark
  Issue Type: Improvement
Reporter: Davies Liu
Assignee: Davies Liu

 PySpark can only run with the same major version both in driver and worker ( 
 both of the are 2.6 or 2.7), it will cause random error if it have 2.7 in 
 driver or 2.6 in worker (or vice).
 For example:
 {code}
 davies@localhost:~/work/spark$ PYSPARK_PYTHON=python2.6 
 PYSPARK_DRIVER_PYTHON=python2.7 bin/pyspark
 Using Python version 2.7.7 (default, Jun  2 2014 12:48:16)
 SparkContext available as sc, SQLContext available as sqlCtx.
  sc.textFile('LICENSE').map(lambda l: l.split()).count()
 org.apache.spark.api.python.PythonException: Traceback (most recent call 
 last):
   File /Users/davies/work/spark/python/pyspark/worker.py, line 101, in main
 process()
   File /Users/davies/work/spark/python/pyspark/worker.py, line 96, in 
 process
 serializer.dump_stream(func(split_index, iterator), outfile)
   File /Users/davies/work/spark/python/pyspark/rdd.py, line 2251, in 
 pipeline_func
 return func(split, prev_func(split, iterator))
   File /Users/davies/work/spark/python/pyspark/rdd.py, line 2251, in 
 pipeline_func
 return func(split, prev_func(split, iterator))
   File /Users/davies/work/spark/python/pyspark/rdd.py, line 2251, in 
 pipeline_func
 return func(split, prev_func(split, iterator))
   File /Users/davies/work/spark/python/pyspark/rdd.py, line 281, in func
 return f(iterator)
   File /Users/davies/work/spark/python/pyspark/rdd.py, line 931, in lambda
 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
   File /Users/davies/work/spark/python/pyspark/rdd.py, line 931, in 
 genexpr
 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
   File stdin, line 1, in lambda
 TypeError: 'bool' object is not callable
   at 
 org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:136)
   at 
 org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:177)
   at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:95)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
   at org.apache.spark.scheduler.Task.run(Task.scala:64)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6215) Shorten apply and update funcs in GenerateProjection

2015-03-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14351987#comment-14351987
 ] 

Apache Spark commented on SPARK-6215:
-

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/4940

 Shorten apply and update funcs in GenerateProjection
 

 Key: SPARK-6215
 URL: https://issues.apache.org/jira/browse/SPARK-6215
 Project: Spark
  Issue Type: Improvement
Reporter: Liang-Chi Hsieh
Priority: Minor

 Some codes in GenerateProjection look redundant and can be shortened.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6215) Shorten apply and update funcs in GenerateProjection

2015-03-08 Thread Liang-Chi Hsieh (JIRA)

Liang-Chi Hsieh created SPARK-6215:
--

 Summary: Shorten apply and update funcs in GenerateProjection
 Key: SPARK-6215
 URL: https://issues.apache.org/jira/browse/SPARK-6215
 Project: Spark
  Issue Type: Improvement
Reporter: Liang-Chi Hsieh
Priority: Minor


Some codes in GenerateProjection look redundant and can be shortened.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1239) Don't fetch all map output statuses at each reducer during shuffles

2015-03-08 Thread Kostas Sakellis (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352306#comment-14352306
 ] 

Kostas Sakellis commented on SPARK-1239:


How many reduce side tasks do you have? Can you please attach your your logs 
that show the OOM errors/

 Don't fetch all map output statuses at each reducer during shuffles
 ---

 Key: SPARK-1239
 URL: https://issues.apache.org/jira/browse/SPARK-1239
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Affects Versions: 1.0.2, 1.1.0
Reporter: Patrick Wendell

 Instead we should modify the way we fetch map output statuses to take both a 
 mapper and a reducer - or we should just piggyback the statuses on each task. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5134) Bump default Hadoop version to 2+

2015-03-08 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352317#comment-14352317
 ] 

Shivaram Venkataraman commented on SPARK-5134:
--

Yeah so this did change in 1.2 and I think I mentioned it to Patrick when it 
affected a couple of other projects of mine. The main problem there was that 
even if you have an explicit Hadoop 1 dependency in your project, SBT picks up 
the highest version required while building an assembly jar for the project -- 
Thus with Spark linked against Hadoop 2.2, one would require an exclusion rule 
to use Hadoop 1. It might be good to add this to the docs or to some of the 
example Quick Start documentation we have

 Bump default Hadoop version to 2+
 -

 Key: SPARK-5134
 URL: https://issues.apache.org/jira/browse/SPARK-5134
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 1.2.0
Reporter: Ryan Williams
Priority: Minor

 [~srowen] and I discussed bumping [the default hadoop version in the parent 
 POM|https://github.com/apache/spark/blob/bb38ebb1abd26b57525d7d29703fd449e40cd6de/pom.xml#L122]
  from {{1.0.4}} to something more recent.
 There doesn't seem to be a good reason that it was set/kept at {{1.0.4}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1239) Don't fetch all map output statuses at each reducer during shuffles


[ 
https://issues.apache.org/jira/browse/SPARK-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352356#comment-14352356
 ] 

Patrick Wendell commented on SPARK-1239:


It would be helpful if any users who have observed this could comment on the 
JIRA and give workload information. This has been more on the back burner since 
we've heard few reports of it on the mailing list, etc...

 Don't fetch all map output statuses at each reducer during shuffles
 ---

 Key: SPARK-1239
 URL: https://issues.apache.org/jira/browse/SPARK-1239
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Affects Versions: 1.0.2, 1.1.0
Reporter: Patrick Wendell

 Instead we should modify the way we fetch map output statuses to take both a 
 mapper and a reducer - or we should just piggyback the statuses on each task. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5682) Reuse hadoop encrypted shuffle algorithm to enable spark encrypted shuffle

2015-03-08 Thread liyunzhang_intel (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352396#comment-14352396
]

liyunzhang_intel commented on SPARK-5682:
-

Hi [~srowen]:
Encrypted shuffle can make the process of shuffle more safer. I think it is
necessary in spark. Previous design is reusing hadoop encrypted shuffle
algorithm to enable spark encrypted shuffle. The design has a big problem that
it imports many crypto classes like CryptoInputStream and CryptoOutputStream
which is marked private in hadoop. Now my teammates and i decided to write
the crypto classes in spark so no dependance to hadoop 2.6. Not directly
copying hadoop code to spark. we only reference the crypto algorithm like
JCE/AES-NI which is used in hadoop to spark. Maybe i need rename the jira name
from Reuse hadoop encrypted shuffle algorithm to enable spark encrypted
shuffle to Add encrypted shuffle in spark. Any advices are welcome.

Reuse hadoop encrypted shuffle algorithm to enable spark encrypted shuffle
--

Key: SPARK-5682
URL: https://issues.apache.org/jira/browse/SPARK-5682
Project: Spark
Issue Type: New Feature
Components: Shuffle
Reporter: liyunzhang_intel
Attachments: Design Document of Encrypted Spark Shuffle_20150209.docx

Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle
data safer. This feature is necessary in spark. We reuse hadoop encrypted
shuffle feature to spark and because ugi credential info is necessary in
encrypted shuffle, we first enable encrypted shuffle on spark-on-yarn
framework.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6217) insertInto doesn't work in PySpark

2015-03-08 Thread Charles Cloud (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Charles Cloud updated SPARK-6217:
-
Summary: insertInto doesn't work in PySpark  (was: insertInto doesn't work)

 insertInto doesn't work in PySpark
 --

 Key: SPARK-6217
 URL: https://issues.apache.org/jira/browse/SPARK-6217
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.3.0
 Environment: Mac OS X Yosemite 10.10.2
 Python 2.7.9
 Spark 1.3.0
Reporter: Charles Cloud

 The following code, running in an IPython shell throws an error:
 {code:none}
 In [1]: from pyspark import SparkContext, HiveContext
 In [2]: sc = SparkContext('local[*]', 'test')
 Spark assembly has been built with Hive, including Datanucleus jars on 
 classpath
 In [3]: sql = HiveContext(sc)
 In [4]: import pandas as pd
 In [5]: df = pd.DataFrame({'a': [1.0, 2.0, 3.0], 'b': [1, 2, 3], 'c': 
 list('abc')})
 In [6]: df2 = pd.DataFrame({'a': [2.0, 3.0, 4.0], 'b': [4, 5, 6], 'c': 
 list('def')})
 In [7]: sdf = sql.createDataFrame(df)
 In [8]: sdf2 = sql.createDataFrame(df2)
 In [9]: sql.registerDataFrameAsTable(sdf, 'sdf')
 In [10]: sql.registerDataFrameAsTable(sdf2, 'sdf2')
 In [11]: sql.cacheTable('sdf')
 In [12]: sql.cacheTable('sdf2')
 In [13]: sdf2.insertInto('sdf')  # throws an error
 {code}
 Here's the Java traceback:
 {code:none}
 Py4JJavaError: An error occurred while calling o270.insertInto.
 : java.lang.AssertionError: assertion failed: No plan for InsertIntoTable 
 (LogicalRDD [a#0,b#1L,c#2], MapPartitionsRDD[13] at mapPartitions at 
 SQLContext.scala:1167), Map(), false
  InMemoryRelation [a#6,b#7L,c#8], true, 1, StorageLevel(true, true, 
 false, true, 1), (PhysicalRDD [a#6,b#7L,c#8], MapPartitionsRDD[41] at 
 mapPartitions at SQLContext.scala:1167), Some(sdf2)
 at scala.Predef$.assert(Predef.scala:179)
 at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:1085)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:1083)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:1089)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:1089)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:1092)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:1092)
 at org.apache.spark.sql.DataFrame.insertInto(DataFrame.scala:1134)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:483)
 at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
 at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
 at py4j.Gateway.invoke(Gateway.java:259)
 at 
 py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
 at py4j.commands.CallCommand.execute(CallCommand.java:79)
 at py4j.GatewayConnection.run(GatewayConnection.java:207)
 at java.lang.Thread.run(Thread.java:745)
 {code}
 I'd be ecstatic if this was my own fault, and I'm somehow using it 
 incorrectly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-5134) Bump default Hadoop version to 2+

[
https://issues.apache.org/jira/browse/SPARK-5134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352268#comment-14352268
]

Patrick Wendell edited comment on SPARK-5134 at 3/8/15 11:27 PM:
-

Hey [~rdub] [~srowen],

As part of the 1.3 release cycle I did some more forensics on the actual
artifacts we publish. It turns out that because of the changes made for Scala
2.11 with the way our publishing works, we've actually been publishing poms
that link against Hadoop 2.2 as of Spark 1.2. And in general, the published pom
Hadoop version is decoupled now from the default one in the build itself,
because of our use of the effective pom plugin.

https://github.com/apache/spark/blob/master/dev/create-release/create-release.sh#L119

I'm actually a bit bummed that we (unintentionally) made this change in 1.2
because I do fear it likely screwed things up for some users.

But on the plus side, since we now decouple the publishing from the default
version in the pom, I don't see a big issue with updating the POM. So I
withdraw my objection on the PR.

was (Author: pwendell):
Hey [~rdub] [~srowen],

https://github.com/apache/spark/blob/master/dev/create-release/create-release.sh#L119

I'm actually a bit bummed that we (unintentionally) made this change in 1.2
because I do fear it likely screwed things up for some users.

But on the plus side, since we no decouple the publishing from the default
version in the pom, I don't see a big issue with updating the POM. So I
withdraw my objection on the PR.

Bump default Hadoop version to 2+
-

Key: SPARK-5134
URL: https://issues.apache.org/jira/browse/SPARK-5134
Project: Spark
Issue Type: Improvement
Components: Build
Affects Versions: 1.2.0
Reporter: Ryan Williams
Priority: Minor

[~srowen] and I discussed bumping [the default hadoop version in the parent
POM|https://github.com/apache/spark/blob/bb38ebb1abd26b57525d7d29703fd449e40cd6de/pom.xml#L122]
from {{1.0.4}} to something more recent.
There doesn't seem to be a good reason that it was set/kept at {{1.0.4}}.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5134) Bump default Hadoop version to 2+


[ 
https://issues.apache.org/jira/browse/SPARK-5134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352341#comment-14352341
 ] 

Patrick Wendell commented on SPARK-5134:


[~shivaram] did it end up working alright if you just excluded Spark's Hadoop 
dependency? If so we can just document this.

 Bump default Hadoop version to 2+
 -

 Key: SPARK-5134
 URL: https://issues.apache.org/jira/browse/SPARK-5134
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 1.2.0
Reporter: Ryan Williams
Priority: Minor

 [~srowen] and I discussed bumping [the default hadoop version in the parent 
 POM|https://github.com/apache/spark/blob/bb38ebb1abd26b57525d7d29703fd449e40cd6de/pom.xml#L122]
  from {{1.0.4}} to something more recent.
 There doesn't seem to be a good reason that it was set/kept at {{1.0.4}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1239) Don't fetch all map output statuses at each reducer during shuffles

2015-03-08 Thread Mridul Muralidharan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352377#comment-14352377
 ] 

Mridul Muralidharan commented on SPARK-1239:


Hitting akka framesize for map outputtracker is very easy since we fetch whole 
output (m * r) - while I cant get into specifics of our jobs or share logs; but 
it is easy to see this hitting 1G for 100k mappers and 50k reducers.
If this is not being looked into currently, I can add it to my list of things 
to fix - but if there is already work being done, I dont want to duplicate it.

Even something trivial like what was done in task result would suffice (if we 
dont want the additional overhead of per per reduce map output generation at 
master).

 Don't fetch all map output statuses at each reducer during shuffles
 ---

 Key: SPARK-1239
 URL: https://issues.apache.org/jira/browse/SPARK-1239
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Affects Versions: 1.0.2, 1.1.0
Reporter: Patrick Wendell

 Instead we should modify the way we fetch map output statuses to take both a 
 mapper and a reducer - or we should just piggyback the statuses on each task. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6154) Build error with Scala 2.11 for v1.3.0-rc2


[ 
https://issues.apache.org/jira/browse/SPARK-6154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352417#comment-14352417
 ] 

Jianshi Huang commented on SPARK-6154:
--

I see. Here's my build flag:

  -Pyarn -Phive -Phive-0.13.1 -Phive-thriftserver -Phadoop-2.4 
-Djava.version=1.7 -DskipTests

BTW, when will Kafka and JDBC be supported in 2.11 build?

Jianshi

 Build error with Scala 2.11 for v1.3.0-rc2
 --

 Key: SPARK-6154
 URL: https://issues.apache.org/jira/browse/SPARK-6154
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.3.0
Reporter: Jianshi Huang

 Build v1.3.0-rc2 with Scala 2.11 using instructions in the documentation 
 failed when -Phive-thriftserver is enabled.
 [info] Compiling 9 Scala sources to 
 /home/hjs/workspace/spark/sql/hive-thriftserver/target/scala-2.11/classes...
 [error] 
 /home/hjs/workspace/spark/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala:2
 5: object ConsoleReader is not a member of package jline
 [error] import jline.{ConsoleReader, History}
 [error]^
 [warn] Class jline.Completor not found - continuing with a stub.
 [warn] Class jline.ConsoleReader not found - continuing with a stub.
 [error] 
 /home/hjs/workspace/spark/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala:1
 65: not found: type ConsoleReader
 [error] val reader = new ConsoleReader()
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3066) Support recommendAll in matrix factorization model

2015-03-08 Thread Joseph K. Bradley (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352416#comment-14352416
]

Joseph K. Bradley commented on SPARK-3066:
--

It's similar, I believe, for ALS. The cosine similarity metric you get with
the dot product for ALS is a distance metric, right? So finding the top K
products to recommend a given user is essentially the same as finding the K
product feature vectors which are closest to the user's feature vector. This
optimization could be used both for recommending for a single user and for
recommendAll.

I'm not sure about how effective these approximate nearest neighbor methods
are. My understanding is that they work reasonable well as long as the feature
space is fairly low-dimensional, which should often be the case for ALS.

My hope is that these approximate nearest neighbor data structures can reduce
communication. The ones I've seen are based on feature space partitioning,
which could potentially allow you to figure out a subset of product partitions
to check for each user.

Using level 3 BLAS might be better; I'm really not sure. It won't reduce
communication, though. These 2 types of optimizations might be orthogonal,
anyways.

Support recommendAll in matrix factorization model
--

Key: SPARK-3066
URL: https://issues.apache.org/jira/browse/SPARK-3066
Project: Spark
Issue Type: New Feature
Components: MLlib
Reporter: Xiangrui Meng
Assignee: Debasish Das

ALS returns a matrix factorization model, which we can use to predict ratings
for individual queries as well as small batches. In practice, users may want
to compute top-k recommendations offline for all users. It is very expensive
but a common problem. We can do some optimization like
1) collect one side (either user or product) and broadcast it as a matrix
2) use level-3 BLAS to compute inner products
3) use Utils.takeOrdered to find top-k

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6192) Enhance MLlib's Python API (GSoC 2015)

[
https://issues.apache.org/jira/browse/SPARK-6192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352459#comment-14352459
]

Yan Ni commented on SPARK-6192:
---

hello, I am a senior year undergraduate student and had experience in python
ML. Now I am interested in distributed platforms like spark but don't have any
experience. I want to take this project as my starting point in spark. Any
advice?

Enhance MLlib's Python API (GSoC 2015)
--

Key: SPARK-6192
URL: https://issues.apache.org/jira/browse/SPARK-6192
Project: Spark
Issue Type: Umbrella
Components: ML, MLlib, PySpark
Reporter: Xiangrui Meng
Assignee: Manoj Kumar
Labels: gsoc, gsoc2015, mentor

This is an umbrella JIRA for [~MechCoder]'s GSoC 2015 project. The main theme
is to enhance MLlib's Python API, to make it on par with the Scala/Java API.
The main tasks are:
1. For all models in MLlib, provide save/load method. This also
includes save/load in Scala.
2. Python API for evaluation metrics.
3. Python API for streaming ML algorithms.
4. Python API for distributed linear algebra.
5. Simplify MLLibPythonAPI using DataFrames. Currently, we use
customized serialization, making MLLibPythonAPI hard to maintain. It
would be nice to use the DataFrames for serialization.
I'll link the JIRAs for each of the tasks.
Note that this doesn't mean all these JIRAs are pre-assigned to [~MechCoder].
The TODO list will be dynamic based on the backlog.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-6192) Enhance MLlib's Python API (GSoC 2015)

[
https://issues.apache.org/jira/browse/SPARK-6192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352459#comment-14352459
]

Yan Ni edited comment on SPARK-6192 at 3/9/15 2:45 AM:
---

Thanks!

was (Author: leckie-chn):
hello, I am a senior year undergraduate student and had experience in python
ML. Now I am interested in distributed platforms like spark but don't have any
experience. I want to take this project as my starting point in spark. Any
advice?

Enhance MLlib's Python API (GSoC 2015)
--

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-6192) Enhance MLlib's Python API (GSoC 2015)

[
https://issues.apache.org/jira/browse/SPARK-6192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352459#comment-14352459
]

Yan Ni edited comment on SPARK-6192 at 3/9/15 2:45 AM:
---

hello, I am a senior year undergraduate student and had experience in python
ML. Now I am interested in distributed platforms like spark but don't have any
experience. I would like to take this project as my starting point in spark.
Any advice?

Thanks!

Enhance MLlib's Python API (GSoC 2015)
--

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-6192) Enhance MLlib's Python API (GSoC 2015)

[
https://issues.apache.org/jira/browse/SPARK-6192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352459#comment-14352459
]

Yan Ni edited comment on SPARK-6192 at 3/9/15 3:02 AM:
---

hello, I am a senior year undergraduate student and had experience in python
ML. Now I am interested in distributed computation platforms like spark but
don't have any experience. I would like to take this project as my starting
point in spark. Any advice?

Thanks!

was (Author: leckie-chn):
hello, I am a senior year undergraduate student and had experience in python
ML. Now I am interested in distributed platforms like spark but don't have any
experience. I would like to take this project as my starting point in spark.
Any advice?

Thanks!

Enhance MLlib's Python API (GSoC 2015)
--

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6206) spark-ec2 script reporting SSL error?


[ 
https://issues.apache.org/jira/browse/SPARK-6206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352481#comment-14352481
 ] 

Nicholas Chammas commented on SPARK-6206:
-

OK, let us know what you find, [~Joe6521].

In general, please try to validate your issue on the user list or on Stack 
Overflow before reporting it here, unless you are really sure you've found a 
problem with Spark (as opposed to your environment).

 spark-ec2 script reporting SSL error?
 -

 Key: SPARK-6206
 URL: https://issues.apache.org/jira/browse/SPARK-6206
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.2.0
Reporter: Joe O

 I have been using the spark-ec2 script for several months with no problems.
 Recently, when executing a script to launch a cluster I got the following 
 error:
 {code}
 [Errno 185090050] _ssl.c:344: error:0B084002:x509 certificate 
 routines:X509_load_cert_crl_file:system lib
 {code}
 Nothing launches, the script exits.
 I am not sure if something on machine changed, this is a problem with EC2's 
 certs, or a problem with Python. 
 It occurs 100% of the time, and has been occurring over at least the last two 
 days. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6220) Allow extended EC2 options to be passed through spark-ec2


 [ 
https://issues.apache.org/jira/browse/SPARK-6220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-6220:

Description: 
There are many EC2 options exposed by the boto library that spark-ec2 uses. 

Over time, many of these EC2 options have been bubbled up here and there to 
become spark-ec2 options.

Examples:
* spot prices
* placement groups
* VPC, subnet, and security group assignments

It's likely that more and more EC2 options will trickle up like this to become 
spark-ec2 options.

While major options are well suited to this type of promotion, we should 
probably allow users to pass through EC2 options they want to use through 
spark-ec2 in some generic way.

Let's add two options:
* {{--ec2-instance-option}} - 
[{{boto::run_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.run_instances]
* {{--ec2-spot-instance-option}} - 
[{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances]

Each option can be specified multiple times and is simply passed directly to 
the underlying boto call.

For example:
{code}
spark-ec2 --ec2-instance-option 
{code}

  was:
There are many EC2 options exposed by the boto library that spark-ec2 uses. 

Over time, many of these EC2 options have been bubbled up here and there to 
become spark-ec2 options.

Examples:
* spot prices
* placement groups
* VPC, subnet, and security group assignments

It's likely that more and more EC2 options will trickle up like this to become 
spark-ec2 options.

While major options are well suited to this type of promotion, we should 
probably allow users to pass through EC2 options they want to use through 
spark-ec2 in some generic way.

Let's add two options:
* {{--ec2-instance-option}} - 
[{{boto::run_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.run_instances]
* {{--ec2-spot-instance-option}} - 
[{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances]

Each option can be specified multiple times and is simply passed directly to 
the underlying boto call.


 Allow extended EC2 options to be passed through spark-ec2
 -

 Key: SPARK-6220
 URL: https://issues.apache.org/jira/browse/SPARK-6220
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Nicholas Chammas
Priority: Minor

 There are many EC2 options exposed by the boto library that spark-ec2 uses. 
 Over time, many of these EC2 options have been bubbled up here and there to 
 become spark-ec2 options.
 Examples:
 * spot prices
 * placement groups
 * VPC, subnet, and security group assignments
 It's likely that more and more EC2 options will trickle up like this to 
 become spark-ec2 options.
 While major options are well suited to this type of promotion, we should 
 probably allow users to pass through EC2 options they want to use through 
 spark-ec2 in some generic way.
 Let's add two options:
 * {{--ec2-instance-option}} - 
 [{{boto::run_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.run_instances]
 * {{--ec2-spot-instance-option}} - 
 [{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances]
 Each option can be specified multiple times and is simply passed directly to 
 the underlying boto call.
 For example:
 {code}
 spark-ec2 --ec2-instance-option 
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6220) Allow extended EC2 options to be passed through spark-ec2


[ 
https://issues.apache.org/jira/browse/SPARK-6220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352489#comment-14352489
 ] 

Nicholas Chammas commented on SPARK-6220:
-

cc [~joshrosen] and [~shivaram] for feedback.

The immediate motivation for this is the work I'm doing on automating 
spark-perf runs.

As part of an automated spark-perf run, I'd like to:
* set {{instance_initiated_shutdown_behavior=terminate}} for the non-spot 
instances launched by spark-ec2 (i.e. the master), so that the cluster can 
self-terminate without needing outside input
* set {{instance_profile_arn}} for the master so that spark-perf results can be 
uploaded to S3 without having to handle AWS user credentials, via use of IAM 
profiles

Since my use case is specialized, I didn't think it was worth adding top-level 
options for these EC2 features. So I generalized the idea to support any EC2 
option supported by boto.

 Allow extended EC2 options to be passed through spark-ec2
 -

 Key: SPARK-6220
 URL: https://issues.apache.org/jira/browse/SPARK-6220
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Nicholas Chammas
Priority: Minor

 There are many EC2 options exposed by the boto library that spark-ec2 uses. 
 Over time, many of these EC2 options have been bubbled up here and there to 
 become spark-ec2 options.
 Examples:
 * spot prices
 * placement groups
 * VPC, subnet, and security group assignments
 It's likely that more and more EC2 options will trickle up like this to 
 become spark-ec2 options.
 While major options are well suited to this type of promotion, we should 
 probably allow users to pass through EC2 options they want to use through 
 spark-ec2 in some generic way.
 Let's add two options:
 * {{--ec2-instance-option}} - 
 [{{boto::run_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.run_instances]
 * {{--ec2-spot-instance-option}} - 
 [{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances]
 Each option can be specified multiple times and is simply passed directly to 
 the underlying boto call.
 For example:
 {code}
 spark-ec2 --ec2-instance-option 
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6220) Allow extended EC2 options to be passed through spark-ec2


 [ 
https://issues.apache.org/jira/browse/SPARK-6220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-6220:

Description: 
There are many EC2 options exposed by the boto library that spark-ec2 uses. 

Over time, many of these EC2 options have been bubbled up here and there to 
become spark-ec2 options.

Examples:
* spot prices
* placement groups
* VPC, subnet, and security group assignments

It's likely that more and more EC2 options will trickle up like this to become 
spark-ec2 options.

While major options are well suited to this type of promotion, we should 
probably allow users to pass through EC2 options they want to use through 
spark-ec2 in some generic way.

Let's add two options:
* {{--ec2-instance-option}} - 
[{{boto::run_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.run_instances]
* {{--ec2-spot-instance-option}} - 
[{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances]

Each option can be specified multiple times and is simply passed directly to 
the underlying boto call.

For example:
{code}
spark-ec2 \
...
--ec2-instance-option instance_initiated_shutdown_behavior=terminate \
--ec2-instance-option ebs_optimized=True
{code}

I'm not sure about the exact syntax of the extended options, but something like 
this will do the trick.


I followed the example of {{ssh}}, which supports multiple extended options 
similarly.

{code}
ssh -o LogLevel=ERROR -o UserKnowHostsFile=/dev/null ...
{code}

  was:
There are many EC2 options exposed by the boto library that spark-ec2 uses. 

Over time, many of these EC2 options have been bubbled up here and there to 
become spark-ec2 options.

Examples:
* spot prices
* placement groups
* VPC, subnet, and security group assignments

It's likely that more and more EC2 options will trickle up like this to become 
spark-ec2 options.

While major options are well suited to this type of promotion, we should 
probably allow users to pass through EC2 options they want to use through 
spark-ec2 in some generic way.

Let's add two options:
* {{--ec2-instance-option}} - 
[{{boto::run_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.run_instances]
* {{--ec2-spot-instance-option}} - 
[{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances]

Each option can be specified multiple times and is simply passed directly to 
the underlying boto call.

For example:
{code}
spark-ec2 --ec2-instance-option 
{code}


 Allow extended EC2 options to be passed through spark-ec2
 -

 Key: SPARK-6220
 URL: https://issues.apache.org/jira/browse/SPARK-6220
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Nicholas Chammas
Priority: Minor

 There are many EC2 options exposed by the boto library that spark-ec2 uses. 
 Over time, many of these EC2 options have been bubbled up here and there to 
 become spark-ec2 options.
 Examples:
 * spot prices
 * placement groups
 * VPC, subnet, and security group assignments
 It's likely that more and more EC2 options will trickle up like this to 
 become spark-ec2 options.
 While major options are well suited to this type of promotion, we should 
 probably allow users to pass through EC2 options they want to use through 
 spark-ec2 in some generic way.
 Let's add two options:
 * {{--ec2-instance-option}} - 
 [{{boto::run_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.run_instances]
 * {{--ec2-spot-instance-option}} - 
 [{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances]
 Each option can be specified multiple times and is simply passed directly to 
 the underlying boto call.
 For example:
 {code}
 spark-ec2 \
 ...
 --ec2-instance-option instance_initiated_shutdown_behavior=terminate \
 --ec2-instance-option ebs_optimized=True
 {code}
 I'm not sure about the exact syntax of the extended options, but something 
 like this will do the trick.
 I followed the example of {{ssh}}, which supports multiple extended options 
 similarly.
 {code}
 ssh -o LogLevel=ERROR -o UserKnowHostsFile=/dev/null ...
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-6192) Enhance MLlib's Python API (GSoC 2015)

[
https://issues.apache.org/jira/browse/SPARK-6192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352459#comment-14352459
]

Yan Ni edited comment on SPARK-6192 at 3/9/15 3:30 AM:
---

Thanks!

was (Author: leckie-chn):
hello, I am a senior year undergraduate student and had experience in python
ML. Now I am interested in distributed computation platforms like spark but
don't have any experience. I would like to take this project as my starting
point in spark. Any advice?

Thanks!

Enhance MLlib's Python API (GSoC 2015)
--

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6183) Skip bad workers when re-launching executors

2015-03-08 Thread Peng Zhen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352492#comment-14352492
 ] 

Peng Zhen commented on SPARK-6183:
--

[~srowen] they are related, but not the same. SPARK-4609 works on re-scheduling 
tasks, and  SPARK-6183 works on re-launching executors.
@davies 

 Skip bad workers when re-launching executors
 

 Key: SPARK-6183
 URL: https://issues.apache.org/jira/browse/SPARK-6183
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Reporter: Peng Zhen

 In standalone cluster, when an executor launch fails, the master should avoid 
 re-launching it on the same worker. 
 According to the current scheduling logic, the failed executor will be highly 
 possible re-launched on the same worker, and finally cause the application 
 removed from the master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-6183) Skip bad workers when re-launching executors

2015-03-08 Thread Peng Zhen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352492#comment-14352492
 ] 

Peng Zhen edited comment on SPARK-6183 at 3/9/15 3:36 AM:
--

[~srowen] they are related, but not the same. SPARK-4609 works on re-scheduling 
tasks, and  SPARK-6183 works on re-launching executors.
[~davies] 


was (Author: zhpengg):
[~srowen] they are related, but not the same. SPARK-4609 works on re-scheduling 
tasks, and  SPARK-6183 works on re-launching executors.
@davies 

 Skip bad workers when re-launching executors
 

 Key: SPARK-6183
 URL: https://issues.apache.org/jira/browse/SPARK-6183
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Reporter: Peng Zhen

 In standalone cluster, when an executor launch fails, the master should avoid 
 re-launching it on the same worker. 
 According to the current scheduling logic, the failed executor will be highly 
 possible re-launched on the same worker, and finally cause the application 
 removed from the master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6220) Allow extended EC2 options to be passed through spark-ec2


 [ 
https://issues.apache.org/jira/browse/SPARK-6220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-6220:

Description: 
There are many EC2 options exposed by the boto library that spark-ec2 uses. 

Over time, many of these EC2 options have been bubbled up here and there to 
become spark-ec2 options.

Examples:
* spot prices
* placement groups
* VPC, subnet, and security group assignments

It's likely that more and more EC2 options will trickle up like this to become 
spark-ec2 options.

While major options are well suited to this type of promotion, we should 
probably allow users to pass through EC2 options they want to use through 
spark-ec2 in some generic way.

Let's add two options:
* {{--ec2-instance-option}} - 
[{{boto::run_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.run_instances]
* {{--ec2-spot-instance-option}} - 
[{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances]

Each option can be specified multiple times and is simply passed directly to 
the underlying boto call.

For example:
{code}
spark-ec2 \
...
--ec2-instance-option instance_initiated_shutdown_behavior=terminate \
--ec2-instance-option ebs_optimized=True
{code}

I'm not sure about the exact syntax of the extended options, but something like 
this will do the trick as long as it can be made to pass the options correctly 
to boto in most cases.


I followed the example of {{ssh}}, which supports multiple extended options 
similarly.

{code}
ssh -o LogLevel=ERROR -o UserKnowHostsFile=/dev/null ...
{code}

  was:
There are many EC2 options exposed by the boto library that spark-ec2 uses. 

Over time, many of these EC2 options have been bubbled up here and there to 
become spark-ec2 options.

Examples:
* spot prices
* placement groups
* VPC, subnet, and security group assignments

It's likely that more and more EC2 options will trickle up like this to become 
spark-ec2 options.

While major options are well suited to this type of promotion, we should 
probably allow users to pass through EC2 options they want to use through 
spark-ec2 in some generic way.

Let's add two options:
* {{--ec2-instance-option}} - 
[{{boto::run_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.run_instances]
* {{--ec2-spot-instance-option}} - 
[{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances]

Each option can be specified multiple times and is simply passed directly to 
the underlying boto call.

For example:
{code}
spark-ec2 \
...
--ec2-instance-option instance_initiated_shutdown_behavior=terminate \
--ec2-instance-option ebs_optimized=True
{code}

I'm not sure about the exact syntax of the extended options, but something like 
this will do the trick.


I followed the example of {{ssh}}, which supports multiple extended options 
similarly.

{code}
ssh -o LogLevel=ERROR -o UserKnowHostsFile=/dev/null ...
{code}


 Allow extended EC2 options to be passed through spark-ec2
 -

 Key: SPARK-6220
 URL: https://issues.apache.org/jira/browse/SPARK-6220
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Nicholas Chammas
Priority: Minor

 There are many EC2 options exposed by the boto library that spark-ec2 uses. 
 Over time, many of these EC2 options have been bubbled up here and there to 
 become spark-ec2 options.
 Examples:
 * spot prices
 * placement groups
 * VPC, subnet, and security group assignments
 It's likely that more and more EC2 options will trickle up like this to 
 become spark-ec2 options.
 While major options are well suited to this type of promotion, we should 
 probably allow users to pass through EC2 options they want to use through 
 spark-ec2 in some generic way.
 Let's add two options:
 * {{--ec2-instance-option}} - 
 [{{boto::run_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.run_instances]
 * {{--ec2-spot-instance-option}} - 
 [{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances]
 Each option can be specified multiple times and is simply passed directly to 
 the underlying boto call.
 For example:
 {code}
 spark-ec2 \
 ...
 --ec2-instance-option instance_initiated_shutdown_behavior=terminate \
 --ec2-instance-option ebs_optimized=True
 {code}
 I'm not sure about the exact syntax of the extended options, but something 
 like this will do the trick as long as it can be made to pass the options 
 correctly to boto in most cases.
 I followed the

[jira] [Commented] (SPARK-6220) Allow extended EC2 options to be passed through spark-ec2

2015-03-08 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352514#comment-14352514
 ] 

Shivaram Venkataraman commented on SPARK-6220:
--

Seems like a good idea and the syntax sounds good to me.  Just curious: Are 
these the only two boto calls we use ?

 Allow extended EC2 options to be passed through spark-ec2
 -

 Key: SPARK-6220
 URL: https://issues.apache.org/jira/browse/SPARK-6220
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Nicholas Chammas
Priority: Minor

 There are many EC2 options exposed by the boto library that spark-ec2 uses. 
 Over time, many of these EC2 options have been bubbled up here and there to 
 become spark-ec2 options.
 Examples:
 * spot prices
 * placement groups
 * VPC, subnet, and security group assignments
 It's likely that more and more EC2 options will trickle up like this to 
 become spark-ec2 options.
 While major options are well suited to this type of promotion, we should 
 probably allow users to pass through EC2 options they want to use through 
 spark-ec2 in some generic way.
 Let's add two options:
 * {{--ec2-instance-option}} - 
 [{{boto::run_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.run_instances]
 * {{--ec2-spot-instance-option}} - 
 [{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances]
 Each option can be specified multiple times and is simply passed directly to 
 the underlying boto call.
 For example:
 {code}
 spark-ec2 \
 ...
 --ec2-instance-option instance_initiated_shutdown_behavior=terminate \
 --ec2-instance-option ebs_optimized=True
 {code}
 I'm not sure about the exact syntax of the extended options, but something 
 like this will do the trick as long as it can be made to pass the options 
 correctly to boto in most cases.
 I followed the example of {{ssh}}, which supports multiple extended options 
 similarly.
 {code}
 ssh -o LogLevel=ERROR -o UserKnowHostsFile=/dev/null ...
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6220) Allow extended EC2 options to be passed through spark-ec2


 [ 
https://issues.apache.org/jira/browse/SPARK-6220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-6220:

Description: 
There are many EC2 options exposed by the boto library that spark-ec2 uses. 

Over time, many of these EC2 options have been bubbled up here and there to 
become spark-ec2 options.

Examples:
* spot prices
* placement groups
* VPC, subnet, and security group assignments

It's likely that more and more EC2 options will trickle up like this to become 
spark-ec2 options.

While major options are well suited to this type of promotion, we should 
probably allow users to pass through EC2 options they want to use through 
spark-ec2 in some generic way.

Let's add two options:
* {{--ec2-instance-option}} - 
[{{boto::run}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.image.Image.run]
* {{--ec2-spot-instance-option}} - 
[{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances]

Each option can be specified multiple times and is simply passed directly to 
the underlying boto call.

For example:
{code}
spark-ec2 \
...
--ec2-instance-option instance_initiated_shutdown_behavior=terminate \
--ec2-instance-option ebs_optimized=True
{code}

I'm not sure about the exact syntax of the extended options, but something like 
this will do the trick as long as it can be made to pass the options correctly 
to boto in most cases.


I followed the example of {{ssh}}, which supports multiple extended options 
similarly.

{code}
ssh -o LogLevel=ERROR -o UserKnowHostsFile=/dev/null ...
{code}

  was:
There are many EC2 options exposed by the boto library that spark-ec2 uses. 

Over time, many of these EC2 options have been bubbled up here and there to 
become spark-ec2 options.

Examples:
* spot prices
* placement groups
* VPC, subnet, and security group assignments

It's likely that more and more EC2 options will trickle up like this to become 
spark-ec2 options.

While major options are well suited to this type of promotion, we should 
probably allow users to pass through EC2 options they want to use through 
spark-ec2 in some generic way.

Let's add two options:
* {{--ec2-instance-option}} - 
[{{boto::run_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.run_instances]
* {{--ec2-spot-instance-option}} - 
[{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances]

Each option can be specified multiple times and is simply passed directly to 
the underlying boto call.

For example:
{code}
spark-ec2 \
...
--ec2-instance-option instance_initiated_shutdown_behavior=terminate \
--ec2-instance-option ebs_optimized=True
{code}

I'm not sure about the exact syntax of the extended options, but something like 
this will do the trick as long as it can be made to pass the options correctly 
to boto in most cases.


I followed the example of {{ssh}}, which supports multiple extended options 
similarly.

{code}
ssh -o LogLevel=ERROR -o UserKnowHostsFile=/dev/null ...
{code}


 Allow extended EC2 options to be passed through spark-ec2
 -

 Key: SPARK-6220
 URL: https://issues.apache.org/jira/browse/SPARK-6220
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Nicholas Chammas
Priority: Minor

 There are many EC2 options exposed by the boto library that spark-ec2 uses. 
 Over time, many of these EC2 options have been bubbled up here and there to 
 become spark-ec2 options.
 Examples:
 * spot prices
 * placement groups
 * VPC, subnet, and security group assignments
 It's likely that more and more EC2 options will trickle up like this to 
 become spark-ec2 options.
 While major options are well suited to this type of promotion, we should 
 probably allow users to pass through EC2 options they want to use through 
 spark-ec2 in some generic way.
 Let's add two options:
 * {{--ec2-instance-option}} - 
 [{{boto::run}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.image.Image.run]
 * {{--ec2-spot-instance-option}} - 
 [{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances]
 Each option can be specified multiple times and is simply passed directly to 
 the underlying boto call.
 For example:
 {code}
 spark-ec2 \
 ...
 --ec2-instance-option instance_initiated_shutdown_behavior=terminate \
 --ec2-instance-option ebs_optimized=True
 {code}
 I'm not sure about the exact syntax of the extended options, but something 
 like this will do the trick as long as it can be made to pass the options 
 correctly to boto in most cases.
 I

[jira] [Commented] (SPARK-1239) Don't fetch all map output statuses at each reducer during shuffles

2015-03-08 Thread Mridul Muralidharan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352284#comment-14352284
 ] 

Mridul Muralidharan commented on SPARK-1239:


[~pwendell] Is there any update on this ? This is fairly commonly hitting us, 
and we are at 1Gig for framesize already now ...


 Don't fetch all map output statuses at each reducer during shuffles
 ---

 Key: SPARK-1239
 URL: https://issues.apache.org/jira/browse/SPARK-1239
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Affects Versions: 1.0.2, 1.1.0
Reporter: Patrick Wendell

 Instead we should modify the way we fetch map output statuses to take both a 
 mapper and a reducer - or we should just piggyback the statuses on each task. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6217) insertInto doesn't work

2015-03-08 Thread Charles Cloud (JIRA)

Charles Cloud created SPARK-6217:


 Summary: insertInto doesn't work
 Key: SPARK-6217
 URL: https://issues.apache.org/jira/browse/SPARK-6217
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.3.0
 Environment: Mac OS X Yosemite 10.10.2
Python 2.7.9
Spark 1.3.0
Reporter: Charles Cloud


The following code, running in an IPython shell throws an error:

{code:none}
In [1]: from pyspark import SparkContext, HiveContext

In [2]: sc = SparkContext('local[*]', 'test')
Spark assembly has been built with Hive, including Datanucleus jars on classpath

In [3]: sql = HiveContext(sc)

In [4]: import pandas as pd

In [5]: df = pd.DataFrame({'a': [1.0, 2.0, 3.0], 'b': [1, 2, 3], 'c': 
list('abc')})

In [6]: df2 = pd.DataFrame({'a': [2.0, 3.0, 4.0], 'b': [4, 5, 6], 'c': 
list('def')})

In [7]: sdf = sql.createDataFrame(df)

In [8]: sdf2 = sql.createDataFrame(df2)

In [9]: sql.registerDataFrameAsTable(sdf, 'sdf')

In [10]: sql.registerDataFrameAsTable(sdf2, 'sdf2')

In [11]: sql.cacheTable('sdf')

In [12]: sql.cacheTable('sdf2')

In [13]: sdf2.insertInto('sdf')  # throws an error
{code}

Here's the Java traceback:

{code:none}
Py4JJavaError: An error occurred while calling o270.insertInto.
: java.lang.AssertionError: assertion failed: No plan for InsertIntoTable 
(LogicalRDD [a#0,b#1L,c#2], MapPartitionsRDD[13] at mapPartitions at 
SQLContext.scala:1167), Map(), false
 InMemoryRelation [a#6,b#7L,c#8], true, 1, StorageLevel(true, true, false, 
true, 1), (PhysicalRDD [a#6,b#7L,c#8], MapPartitionsRDD[41] at mapPartitions at 
SQLContext.scala:1167), Some(sdf2)

at scala.Predef$.assert(Predef.scala:179)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:1085)
at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:1083)
at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:1089)
at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:1089)
at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:1092)
at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:1092)
at org.apache.spark.sql.DataFrame.insertInto(DataFrame.scala:1134)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
{code}

I'd be ecstatic if this was my own fault, and I'm somehow using it incorrectly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6218) Upgrade spark-ec2 from optparse to argparse

Nicholas Chammas created SPARK-6218:
---

 Summary: Upgrade spark-ec2 from optparse to argparse
 Key: SPARK-6218
 URL: https://issues.apache.org/jira/browse/SPARK-6218
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Nicholas Chammas
Priority: Minor


spark-ec2 [currently uses 
optparse|https://github.com/apache/spark/blob/55b1b32dc8b9b25deea8e5864b53fe802bb92741/ec2/spark_ec2.py#L43].

In Python 2.7, optparse was [deprecated in favor of 
argparse|https://docs.python.org/2/library/optparse.html]. This is the main 
motivation for moving away from optparse.

Additionally, upgrading to argparse provides some [additional benefits noted in 
the 
docs|https://argparse.googlecode.com/svn/trunk/doc/argparse-vs-optparse.html]. 
The one we are mostly likely to benefit from is the better input validation.

argparse is not include with Python 2.6, which is currently the minimum version 
of Python we support in Spark, but it can easily be downloaded by spark-ec2 
with the work that has already been done in SPARK-6191.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6218) Upgrade spark-ec2 from optparse to argparse


[ 
https://issues.apache.org/jira/browse/SPARK-6218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352331#comment-14352331
 ] 

Nicholas Chammas commented on SPARK-6218:
-

[~shivaram], [~joshrosen]: What do you think?

 Upgrade spark-ec2 from optparse to argparse
 ---

 Key: SPARK-6218
 URL: https://issues.apache.org/jira/browse/SPARK-6218
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Nicholas Chammas
Priority: Minor

 spark-ec2 [currently uses 
 optparse|https://github.com/apache/spark/blob/55b1b32dc8b9b25deea8e5864b53fe802bb92741/ec2/spark_ec2.py#L43].
 In Python 2.7, optparse was [deprecated in favor of 
 argparse|https://docs.python.org/2/library/optparse.html]. This is the main 
 motivation for moving away from optparse.
 Additionally, upgrading to argparse provides some [additional benefits noted 
 in the 
 docs|https://argparse.googlecode.com/svn/trunk/doc/argparse-vs-optparse.html].
  The one we are mostly likely to benefit from is the better input validation.
 Specifically, being able to cleanly tie each input parameter to a validation 
 method will cut down the input validation code currently spread out across 
 the script.
 argparse is not include with Python 2.6, which is currently the minimum 
 version of Python we support in Spark, but it can easily be downloaded by 
 spark-ec2 with the work that has already been done in SPARK-6191.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6218) Upgrade spark-ec2 from optparse to argparse

[
https://issues.apache.org/jira/browse/SPARK-6218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Nicholas Chammas updated SPARK-6218:

Description:
spark-ec2 [currently uses
optparse|https://github.com/apache/spark/blob/55b1b32dc8b9b25deea8e5864b53fe802bb92741/ec2/spark_ec2.py#L43].

In Python 2.7, optparse was [deprecated in favor of
argparse|https://docs.python.org/2/library/optparse.html]. This is the main
motivation for moving away from optparse.

Additionally, upgrading to argparse provides some [additional benefits noted in
the
docs|https://argparse.googlecode.com/svn/trunk/doc/argparse-vs-optparse.html].
The one we are mostly likely to benefit from is the better input validation.

Specifically, being able to cleanly tie each input parameter to a validation
method will cut down the input validation code currently spread out across the
script.

argparse is not include with Python 2.6, which is currently the minimum version
of Python we support in Spark, but it can easily be downloaded by spark-ec2
with the work that has already been done in SPARK-6191.

was:
spark-ec2 [currently uses
optparse|https://github.com/apache/spark/blob/55b1b32dc8b9b25deea8e5864b53fe802bb92741/ec2/spark_ec2.py#L43].

In Python 2.7, optparse was [deprecated in favor of
argparse|https://docs.python.org/2/library/optparse.html]. This is the main
motivation for moving away from optparse.

Upgrade spark-ec2 from optparse to argparse
---

Key: SPARK-6218
URL: https://issues.apache.org/jira/browse/SPARK-6218
Project: Spark
Issue Type: Improvement
Components: EC2
Reporter: Nicholas Chammas
Priority: Minor

spark-ec2 [currently uses
optparse|https://github.com/apache/spark/blob/55b1b32dc8b9b25deea8e5864b53fe802bb92741/ec2/spark_ec2.py#L43].
In Python 2.7, optparse was [deprecated in favor of
argparse|https://docs.python.org/2/library/optparse.html]. This is the main
motivation for moving away from optparse.
Additionally, upgrading to argparse provides some [additional benefits noted
in the
docs|https://argparse.googlecode.com/svn/trunk/doc/argparse-vs-optparse.html].
The one we are mostly likely to benefit from is the better input validation.
Specifically, being able to cleanly tie each input parameter to a validation
method will cut down the input validation code currently spread out across
the script.
argparse is not include with Python 2.6, which is currently the minimum
version of Python we support in Spark, but it can easily be downloaded by
spark-ec2 with the work that has already been done in SPARK-6191.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6219) Expand Python lint checks to check for compilation errors

2015-03-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352398#comment-14352398
 ] 

Apache Spark commented on SPARK-6219:
-

User 'nchammas' has created a pull request for this issue:
https://github.com/apache/spark/pull/4941

 Expand Python lint checks to check for  compilation errors
 --

 Key: SPARK-6219
 URL: https://issues.apache.org/jira/browse/SPARK-6219
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Nicholas Chammas
Priority: Minor

 An easy lint check for Python would be to make sure the stuff at least 
 compiles. That will catch only the most egregious errors, but it should help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6209) ExecutorClassLoader can leak connections after failing to load classes from the REPL class server

2015-03-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352535#comment-14352535
 ] 

Apache Spark commented on SPARK-6209:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/4944

 ExecutorClassLoader can leak connections after failing to load classes from 
 the REPL class server
 -

 Key: SPARK-6209
 URL: https://issues.apache.org/jira/browse/SPARK-6209
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0, 1.0.3, 1.3.0, 1.1.2, 1.2.1, 1.4.0
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Critical

 ExecutorClassLoader does not ensure proper cleanup of network connections 
 that it opens.  If it fails to load a class, it may leak partially-consumed 
 InputStreams that are connected to the REPL's HTTP class server, causing that 
 server to exhaust its thread pool, which can cause the entire job to hang.
 Here is a simple reproduction:
 With
 {code}
 ./bin/spark-shell --master local-cluster[8,8,512] 
 {code}
 run the following command:
 {code}
 sc.parallelize(1 to 1000, 1000).map { x =
   try {
   Class.forName(some.class.that.does.not.Exist)
   } catch {
   case e: Exception = // do nothing
   }
   x
 }.count()
 {code}
 This job will run 253 tasks, then will completely freeze without any errors 
 or failed tasks.
 It looks like the driver has 253 threads blocked in socketRead0() calls:
 {code}
 [joshrosen ~]$ jstack 16765 | grep socketRead0 | wc
  253 759   14674
 {code}
 e.g.
 {code}
 qtp1287429402-13 daemon prio=5 tid=0x7f868a1c nid=0x5b03 runnable 
 [0x0001159bd000]
java.lang.Thread.State: RUNNABLE
 at java.net.SocketInputStream.socketRead0(Native Method)
 at java.net.SocketInputStream.read(SocketInputStream.java:152)
 at java.net.SocketInputStream.read(SocketInputStream.java:122)
 at org.eclipse.jetty.io.ByteArrayBuffer.readFrom(ByteArrayBuffer.java:391)
 at org.eclipse.jetty.io.bio.StreamEndPoint.fill(StreamEndPoint.java:141)
 at 
 org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.fill(SocketConnector.java:227)
 at org.eclipse.jetty.http.HttpParser.fill(HttpParser.java:1044)
 at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:280)
 at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
 at 
 org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
 at 
 org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
 at 
 org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
 at 
 org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
 at java.lang.Thread.run(Thread.java:745) 
 {code}
 Jstack on the executors shows blocking in loadClass / findClass, where a 
 single thread is RUNNABLE and waiting to hear back from the driver and other 
 executor threads are BLOCKED on object monitor synchronization at 
 Class.forName0().
 Remotely triggering a GC on a hanging executor allows the job to progress and 
 complete more tasks before hanging again.  If I repeatedly trigger GC on all 
 of the executors, then the job runs to completion:
 {code}
 jps | grep CoarseGra | cut -d ' ' -f 1 | xargs -I {} -n 1 -P100 jcmd {} GC.run
 {code}
 The culprit is a {{catch}} block that ignores all exceptions and performs no 
 cleanup: 
 https://github.com/apache/spark/blob/v1.2.0/repl/src/main/scala/org/apache/spark/repl/ExecutorClassLoader.scala#L94
 This bug has been present since Spark 1.0.0, but I suspect that we haven't 
 seen it before because it's pretty hard to reproduce. Triggering this error 
 requires a job with tasks that trigger ClassNotFoundExceptions yet are still 
 able to run to completion.  It also requires that executors are able to leak 
 enough open connections to exhaust the class server's Jetty thread pool 
 limit, which requires that there are a large number of tasks (253+) and 
 either a large number of executors or a very low amount of GC pressure on 
 those executors (since GC will cause the leaked connections to be closed).
 The fix here is pretty simple: add proper resource cleanup to this class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6192) Enhance MLlib's Python API (GSoC 2015)

2015-03-08 Thread Manoj Kumar (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-6192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352546#comment-14352546
]

Manoj Kumar commented on SPARK-6192:

[~Manglano] [~leckie-chn] Hi, I am actually not a mentor but a student whom
this GSoC project is preassigned to by Xiangrui (since I've been working on the
Spark codebase for about a couple of months right now) . This project idea was
actually a result of brainstorming across different Pull Requests. I would
suggest you have a look at different issues which would help you gain
familiarity with the API and help to propose a project proposal. Hope that
helps.

Enhance MLlib's Python API (GSoC 2015)
--

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6155) Support latest Scala (2.11.6+)


 [ 
https://issues.apache.org/jira/browse/SPARK-6155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jianshi Huang updated SPARK-6155:
-
Summary: Support latest Scala (2.11.6+)  (was: Support Scala 2.11.6+)

 Support latest Scala (2.11.6+)
 --

 Key: SPARK-6155
 URL: https://issues.apache.org/jira/browse/SPARK-6155
 Project: Spark
  Issue Type: New Feature
  Components: Build
Affects Versions: 1.3.0
Reporter: Jianshi Huang

 Just tried to build with Scala 2.11.5. failed with following error message:
 [INFO] Compiling 9 Scala sources to 
 /Users/jianshuang/workspace/others/spark/repl/target/scala-2.11/classes...
 [ERROR] 
 /Users/jianshuang/workspace/others/spark/repl/scala-2.11/src/main/scala/org/apache/spark/repl/SparkIMain.scala:1132:
  value withIncompleteHandler is not a member of 
 SparkIMain.this.global.PerRunReporting
 [ERROR]   currentRun.reporting.withIncompleteHandler((_, _) = 
 isIncomplete = true) {
 [ERROR]^
 Looks like PerRunParsing has been changed from Reporting to Parsing in 2.11.5
 http://fossies.org/diffs/scala-sources/2.11.2_vs_2.11.5/src/compiler/scala/tools/nsc/Reporting.scala-diff.html
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6155) Support Scala 2.11.5+


 [ 
https://issues.apache.org/jira/browse/SPARK-6155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jianshi Huang updated SPARK-6155:
-
Issue Type: New Feature  (was: Improvement)

 Support Scala 2.11.5+
 -

 Key: SPARK-6155
 URL: https://issues.apache.org/jira/browse/SPARK-6155
 Project: Spark
  Issue Type: New Feature
  Components: Build
Affects Versions: 1.3.0
Reporter: Jianshi Huang
Priority: Minor

 Just tried to build with Scala 2.11.5. failed with following error message:
 [INFO] Compiling 9 Scala sources to 
 /Users/jianshuang/workspace/others/spark/repl/target/scala-2.11/classes...
 [ERROR] 
 /Users/jianshuang/workspace/others/spark/repl/scala-2.11/src/main/scala/org/apache/spark/repl/SparkIMain.scala:1132:
  value withIncompleteHandler is not a member of 
 SparkIMain.this.global.PerRunReporting
 [ERROR]   currentRun.reporting.withIncompleteHandler((_, _) = 
 isIncomplete = true) {
 [ERROR]^
 Looks like PerRunParsing has been changed from Reporting to Parsing in 2.11.5
 http://fossies.org/diffs/scala-sources/2.11.2_vs_2.11.5/src/compiler/scala/tools/nsc/Reporting.scala-diff.html
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6155) Support Scala 2.11.5+


 [ 
https://issues.apache.org/jira/browse/SPARK-6155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jianshi Huang updated SPARK-6155:
-
Priority: Major  (was: Minor)

 Support Scala 2.11.5+
 -

 Key: SPARK-6155
 URL: https://issues.apache.org/jira/browse/SPARK-6155
 Project: Spark
  Issue Type: New Feature
  Components: Build
Affects Versions: 1.3.0
Reporter: Jianshi Huang

 Just tried to build with Scala 2.11.5. failed with following error message:
 [INFO] Compiling 9 Scala sources to 
 /Users/jianshuang/workspace/others/spark/repl/target/scala-2.11/classes...
 [ERROR] 
 /Users/jianshuang/workspace/others/spark/repl/scala-2.11/src/main/scala/org/apache/spark/repl/SparkIMain.scala:1132:
  value withIncompleteHandler is not a member of 
 SparkIMain.this.global.PerRunReporting
 [ERROR]   currentRun.reporting.withIncompleteHandler((_, _) = 
 isIncomplete = true) {
 [ERROR]^
 Looks like PerRunParsing has been changed from Reporting to Parsing in 2.11.5
 http://fossies.org/diffs/scala-sources/2.11.2_vs_2.11.5/src/compiler/scala/tools/nsc/Reporting.scala-diff.html
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6155) Support Scala 2.11.6+


 [ 
https://issues.apache.org/jira/browse/SPARK-6155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jianshi Huang updated SPARK-6155:
-
Summary: Support Scala 2.11.6+  (was: Support Scala 2.11.5+)

 Support Scala 2.11.6+
 -

 Key: SPARK-6155
 URL: https://issues.apache.org/jira/browse/SPARK-6155
 Project: Spark
  Issue Type: New Feature
  Components: Build
Affects Versions: 1.3.0
Reporter: Jianshi Huang

 Just tried to build with Scala 2.11.5. failed with following error message:
 [INFO] Compiling 9 Scala sources to 
 /Users/jianshuang/workspace/others/spark/repl/target/scala-2.11/classes...
 [ERROR] 
 /Users/jianshuang/workspace/others/spark/repl/scala-2.11/src/main/scala/org/apache/spark/repl/SparkIMain.scala:1132:
  value withIncompleteHandler is not a member of 
 SparkIMain.this.global.PerRunReporting
 [ERROR]   currentRun.reporting.withIncompleteHandler((_, _) = 
 isIncomplete = true) {
 [ERROR]^
 Looks like PerRunParsing has been changed from Reporting to Parsing in 2.11.5
 http://fossies.org/diffs/scala-sources/2.11.2_vs_2.11.5/src/compiler/scala/tools/nsc/Reporting.scala-diff.html
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-5463) Fix Parquet filter push-down

2015-03-08 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352572#comment-14352572
 ] 

Yin Huai edited comment on SPARK-5463 at 3/9/15 5:37 AM:
-

Seems [~liancheng]'s fix for SPARK-5451 has been released with Parquet 
[RC5|https://github.com/apache/incubator-parquet-mr/blob/parquet-1.6.0rc5/parquet-hadoop/src/main/java/parquet/filter2/statisticslevel/StatisticsFilter.java#L244].


was (Author: yhuai):
Seems [~liancheng]'s fix has been released with Parquet 
[RC5|https://github.com/apache/incubator-parquet-mr/blob/parquet-1.6.0rc5/parquet-hadoop/src/main/java/parquet/filter2/statisticslevel/StatisticsFilter.java#L244].

 Fix Parquet filter push-down
 

 Key: SPARK-5463
 URL: https://issues.apache.org/jira/browse/SPARK-5463
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.2.1, 1.2.2
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Critical





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6220) Allow extended EC2 options to be passed through spark-ec2


[ 
https://issues.apache.org/jira/browse/SPARK-6220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352524#comment-14352524
 ] 

Nicholas Chammas commented on SPARK-6220:
-

As far as places where we create instances, yes, those are the 2 calls we use.

 Allow extended EC2 options to be passed through spark-ec2
 -

 Key: SPARK-6220
 URL: https://issues.apache.org/jira/browse/SPARK-6220
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Nicholas Chammas
Priority: Minor

 There are many EC2 options exposed by the boto library that spark-ec2 uses. 
 Over time, many of these EC2 options have been bubbled up here and there to 
 become spark-ec2 options.
 Examples:
 * spot prices
 * placement groups
 * VPC, subnet, and security group assignments
 It's likely that more and more EC2 options will trickle up like this to 
 become spark-ec2 options.
 While major options are well suited to this type of promotion, we should 
 probably allow users to pass through EC2 options they want to use through 
 spark-ec2 in some generic way.
 Let's add two options:
 * {{--ec2-instance-option}} - 
 [{{boto::run}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.image.Image.run]
 * {{--ec2-spot-instance-option}} - 
 [{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances]
 Each option can be specified multiple times and is simply passed directly to 
 the underlying boto call.
 For example:
 {code}
 spark-ec2 \
 ...
 --ec2-instance-option instance_initiated_shutdown_behavior=terminate \
 --ec2-instance-option ebs_optimized=True
 {code}
 I'm not sure about the exact syntax of the extended options, but something 
 like this will do the trick as long as it can be made to pass the options 
 correctly to boto in most cases.
 I followed the example of {{ssh}}, which supports multiple extended options 
 similarly.
 {code}
 ssh -o LogLevel=ERROR -o UserKnowHostsFile=/dev/null ...
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-6192) Enhance MLlib's Python API (GSoC 2015)

2015-03-08 Thread Manoj Kumar (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-6192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352546#comment-14352546
]

Manoj Kumar edited comment on SPARK-6192 at 3/9/15 4:51 AM:

[~Manglano] [~leckie-chn] Hi, I am actually not a mentor but a student whom
this GSoC project is preassigned to by Xiangrui (since I've been working on the
Spark codebase for about a couple of months right now) . This project idea was
actually a result of brainstorming across different Pull Requests. I would
suggest you have a look at different issues which would help you gain
familiarity with Spark and help to propose a project proposal. Hope that helps.

was (Author: mechcoder):
[~Manglano] [~leckie-chn] Hi, I am actually not a mentor but a student whom
this GSoC project is preassigned to by Xiangrui (since I've been working on the
Spark codebase for about a couple of months right now) . This project idea was
actually a result of brainstorming across different Pull Requests. I would
suggest you have a look at different issues which would help you gain
familiarity with the API and help to propose a project proposal. Hope that
helps.

Enhance MLlib's Python API (GSoC 2015)
--

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6193) Speed up how spark-ec2 searches for clusters


 [ 
https://issues.apache.org/jira/browse/SPARK-6193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6193:
-
Assignee: Nicholas Chammas

 Speed up how spark-ec2 searches for clusters
 

 Key: SPARK-6193
 URL: https://issues.apache.org/jira/browse/SPARK-6193
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Nicholas Chammas
Assignee: Nicholas Chammas
Priority: Minor
 Fix For: 1.4.0


 {{spark-ec2}} currently pulls down [info for all 
 instances|https://github.com/apache/spark/blob/eb48fd6e9d55fb034c00e61374bb9c2a86a82fb8/ec2/spark_ec2.py#L620]
  and searches locally for the target cluster. Instead, it should push those 
 filters up when querying EC2.
 For AWS accounts with hundreds of active instances, there is a difference of 
 many seconds between the two.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6193) Speed up how spark-ec2 searches for clusters


 [ 
https://issues.apache.org/jira/browse/SPARK-6193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6193.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 4922
[https://github.com/apache/spark/pull/4922]

 Speed up how spark-ec2 searches for clusters
 

 Key: SPARK-6193
 URL: https://issues.apache.org/jira/browse/SPARK-6193
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Nicholas Chammas
Priority: Minor
 Fix For: 1.4.0


 {{spark-ec2}} currently pulls down [info for all 
 instances|https://github.com/apache/spark/blob/eb48fd6e9d55fb034c00e61374bb9c2a86a82fb8/ec2/spark_ec2.py#L620]
  and searches locally for the target cluster. Instead, it should push those 
 filters up when querying EC2.
 For AWS accounts with hundreds of active instances, there is a difference of 
 many seconds between the two.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6215) Shorten apply and update funcs in GenerateProjection


 [ 
https://issues.apache.org/jira/browse/SPARK-6215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6215:
-
Component/s: SQL

(Assign component to new JIRAs please)

 Shorten apply and update funcs in GenerateProjection
 

 Key: SPARK-6215
 URL: https://issues.apache.org/jira/browse/SPARK-6215
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Liang-Chi Hsieh
Priority: Minor

 Some codes in GenerateProjection look redundant and can be shortened.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4496) smallint (16 bit value) is being send as a 32 bit value in the thrift interface.


[ 
https://issues.apache.org/jira/browse/SPARK-4496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352046#comment-14352046
 ] 

Sean Owen commented on SPARK-4496:
--

Can you add any detail to this? where, and what is the manifestation of the 
problem? is it a bug or just suboptimal?

 smallint (16 bit value)  is being send as a  32 bit  value in the thrift 
 interface.
 ---

 Key: SPARK-4496
 URL: https://issues.apache.org/jira/browse/SPARK-4496
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 1.1.0
Reporter: Chip Sands





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3742) Link to Spark UI sometimes fails when using H/A RM's


 [ 
https://issues.apache.org/jira/browse/SPARK-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3742.
--
Resolution: Duplicate

 Link to Spark UI sometimes fails when using H/A RM's
 

 Key: SPARK-3742
 URL: https://issues.apache.org/jira/browse/SPARK-3742
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: meiyoula

 When running an application on yarn, the hyperlink on yarn page can't jump to 
 sparkUI page. It happens sometimes.
 The error message is: This is standby RM. Redirecting to the current active 
 RM: http://vm-181:8088/proxy/application_1409206382122_0037



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6183) Skip bad workers when re-launching executors


[ 
https://issues.apache.org/jira/browse/SPARK-6183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352048#comment-14352048
 ] 

Sean Owen commented on SPARK-6183:
--

Isn't this a duplicate of https://issues.apache.org/jira/browse/SPARK-4609 ?

 Skip bad workers when re-launching executors
 

 Key: SPARK-6183
 URL: https://issues.apache.org/jira/browse/SPARK-6183
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Reporter: Peng Zhen

 In standalone cluster, when an executor launch fails, the master should avoid 
 re-launching it on the same worker. 
 According to the current scheduling logic, the failed executor will be highly 
 possible re-launched on the same worker, and finally cause the application 
 removed from the master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-896) ADD_JARS does not add all classes to classpath in the spark-shell for cluster on Mesos.


 [ 
https://issues.apache.org/jira/browse/SPARK-896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-896.
-
Resolution: Won't Fix

I'm gonna call this WontFix as ADD_JARS has been deprecated for a while.

 ADD_JARS does not add all classes to classpath in the spark-shell for cluster 
 on Mesos.
 ---

 Key: SPARK-896
 URL: https://issues.apache.org/jira/browse/SPARK-896
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.7.3
Reporter: Gary Malouf

 I do not believe the issue is limited to scheduler/executors running on Mesos 
 but added the information for debugging purposes.
 h3. Reproducing the issue:
 # Implement some custom functionalities and package them into a 'monster jar' 
 with something like sbt assembly.
 # Drop this jar onto the Spark master box and specify the path to it in the 
 ADD_JARS variable.
 # Start up the spark shell on same box as the master.  You should be able to 
 import packages/classes specified in the jar without any compilation trouble. 
  
 # In a map function on an RDD, trying to call a class from within this jar 
 (with fully qualified name) fails on a ClassNotFoundException.
 h3. Workaround
 Matei Zaharia suggested adding this jar to the SPARK_CLASSPATH environment 
 variable - that resolved the issue.  My understanding however is that the 
 functionality should work using solely the ADD_JARS variable - the 
 documentation does not capture this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6205) UISeleniumSuite fails for Hadoop 2.x test with NoClassDefFoundError


 [ 
https://issues.apache.org/jira/browse/SPARK-6205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6205.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 4933
[https://github.com/apache/spark/pull/4933]

 UISeleniumSuite fails for Hadoop 2.x test with NoClassDefFoundError
 ---

 Key: SPARK-6205
 URL: https://issues.apache.org/jira/browse/SPARK-6205
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 1.3.0
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Minor
 Fix For: 1.4.0


 {code}
 mvn -DskipTests -Pyarn -Phive -Phadoop-2.4 -Dhadoop.version=2.6.0 clean 
 install
 mvn -Pyarn -Phive -Phadoop-2.4 -Dhadoop.version=2.6.0 test 
 -DwildcardSuites=org.apache.spark.ui.UISeleniumSuite -Dtest=none -pl core/ 
 {code}
 will produce:
 {code}
 UISeleniumSuite:
 *** RUN ABORTED ***
   java.lang.NoClassDefFoundError: org/w3c/dom/ElementTraversal
   ...
 {code}
 It doesn't seem to happen without the various profiles set above.
 The fix is simple, although sounds weird; Selenium's dependency on 
 {{xml-apis:xml-apis}} must be manually included in core's test dependencies. 
 This probably has something to do with Hadoop 2 vs 1 dependency changes and 
 the fact that Maven test deps aren't transitive, AFAIK.
 PR coming...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5134) Bump default Hadoop version to 2+

2015-03-08 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352374#comment-14352374
 ] 

Shivaram Venkataraman commented on SPARK-5134:
--

Yeah if you exclude Spark's Hadoop dependency things work correctly for 
Hadoop1. There are some additional issues that come up in 1.2 if due to the 
Guava changes, but those are not related to the default Hadoop version change. 
I think the documentation to update would be [1] but I am thinking it would be 
good to mention this in the Quick Start guide [2] as well

[1] 
https://github.com/apache/spark/blob/55b1b32dc8b9b25deea8e5864b53fe802bb92741/docs/hadoop-third-party-distributions.md#linking-applications-to-the-hadoop-version
[2] 
https://github.com/apache/spark/blob/55b1b32dc8b9b25deea8e5864b53fe802bb92741/docs/quick-start.md#self-contained-applications

 Bump default Hadoop version to 2+
 -

 Key: SPARK-5134
 URL: https://issues.apache.org/jira/browse/SPARK-5134
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 1.2.0
Reporter: Ryan Williams
Priority: Minor

 [~srowen] and I discussed bumping [the default hadoop version in the parent 
 POM|https://github.com/apache/spark/blob/bb38ebb1abd26b57525d7d29703fd449e40cd6de/pom.xml#L122]
  from {{1.0.4}} to something more recent.
 There doesn't seem to be a good reason that it was set/kept at {{1.0.4}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6219) Expand Python lint checks to check for compilation errors

Nicholas Chammas created SPARK-6219:
---

 Summary: Expand Python lint checks to check for  compilation errors
 Key: SPARK-6219
 URL: https://issues.apache.org/jira/browse/SPARK-6219
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Nicholas Chammas
Priority: Minor


An easy lint check for Python would be to make sure the stuff at least 
compiles. That will catch only the most egregious errors, but it should help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6211) Test Python Kafka API using Python unit tests

2015-03-08 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352425#comment-14352425
 ] 

Saisai Shao commented on SPARK-6211:


Thanks [~tdas] for your suggestion, Let me understand how Python unit test 
works at first and then figure out how to add unit test to Python.

 Test Python Kafka API using Python unit tests
 -

 Key: SPARK-6211
 URL: https://issues.apache.org/jira/browse/SPARK-6211
 Project: Spark
  Issue Type: Test
  Components: Streaming, Tests
Reporter: Tathagata Das
Assignee: Saisai Shao
Priority: Critical

 This is tricky in python because the KafkaStreamSuiteBase (which has the 
 functionality of creating embedded kafka clusters) is in the test package, 
 which is not in the python path. To fix that, we have to ways. 
 1. Add test jar to classpath in python test. Thats kind of trickier.
 2. Bring that into the src package (maybe renamed as KafkaTestUtils), and 
 then wrap that in python to use it from python. 
 If (2) does not add any extra test dependencies to the main Kafka pom, then 2 
 should be simpler to do.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3287) When ResourceManager High Availability is enabled, ApplicationMaster webUI is not displayed.


 [ 
https://issues.apache.org/jira/browse/SPARK-3287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3287.
--
Resolution: Not a Problem

Last update on the PR a while ago says that this was likely already fixed.

 When ResourceManager High Availability is enabled, ApplicationMaster webUI is 
 not displayed.
 

 Key: SPARK-3287
 URL: https://issues.apache.org/jira/browse/SPARK-3287
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.2
Reporter: Benoy Antony
 Attachments: SPARK-3287.patch


 When ResourceManager High Availability is enabled, there will be multiple 
 resource managers and each of them could act as a proxy.
 AmIpFilter is modified to accept multiple proxy hosts. But Spark 
 ApplicationMaster fails to read the ResourceManager IPs properly from the 
 configuration.
 So AmIpFilter is initialized with an empty set of proxy hosts. So any access 
 to the ApplicationMaster WebUI will be redirected to port RM port on the 
 local host. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5124) Standardize internal RPC interface

2015-03-08 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352102#comment-14352102
 ] 

Shixiong Zhu commented on SPARK-5124:
-

The problem is that the message may come from caller.receive, but the callee 
wants to send the reply to caller.receiveAndReply.

However, I cannot find a use case now. But I find some RpcEndpoint may need to 
know the sender's address. So I added the sender method to RpcCallContext. And 
I also removed replyWithSender since it can be replaced with 
RpcCallContext.sender.sendWithReply(msg, self) now.

 Standardize internal RPC interface
 --

 Key: SPARK-5124
 URL: https://issues.apache.org/jira/browse/SPARK-5124
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Shixiong Zhu
 Attachments: Pluggable RPC - draft 1.pdf, Pluggable RPC - draft 2.pdf


 In Spark we use Akka as the RPC layer. It would be great if we can 
 standardize the internal RPC interface to facilitate testing. This will also 
 provide the foundation to try other RPC implementations in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3457) ConcurrentModificationException starting up pyspark


 [ 
https://issues.apache.org/jira/browse/SPARK-3457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3457.
--
Resolution: Duplicate

Given that this also concerns accessing the system {{Properties}} object, it's 
the same as SPARK-4952 I'm sure.

 ConcurrentModificationException starting up pyspark
 ---

 Key: SPARK-3457
 URL: https://issues.apache.org/jira/browse/SPARK-3457
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
 Environment: Hadoop 2.3 (CDH 5.1) on Ubuntu precise
Reporter: Shay Rojansky

 Just downloaded Spark 1.1.0-rc4. Launching pyspark for the very first time in 
 yarn-client mode (no additional params or anything), I got the exception 
 below. Rerunning pyspark 5 times afterwards did not reproduce the issue.
 {code}
 14/09/09 18:07:58 INFO YarnClientSchedulerBackend: Application report from 
 ASM:
  appMasterRpcPort: 0
  appStartTime: 1410275267606
  yarnAppState: RUNNING
 14/09/09 18:07:58 INFO YarnClientSchedulerBackend: Add WebUI Filter. 
 org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, PROXY_HOST=master.
 grid.eaglerd.local,PROXY_URI_BASE=http://master.grid.eaglerd.local:8088/proxy/application_1410268447887_0011,
  /proxy/application_1410268447887_0011
 Traceback (most recent call last):
   File /opt/spark/python/pyspark/shell.py, line 44, in module
 14/09/09 18:07:58 INFO JettyUtils: Adding filter: 
 org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
 sc = SparkContext(appName=PySparkShell, pyFiles=add_files)
   File /opt/spark/python/pyspark/context.py, line 107, in __init__
 conf)
   File /opt/spark/python/pyspark/context.py, line 155, in _do_init
 self._jsc = self._initialize_context(self._conf._jconf)
   File /opt/spark/python/pyspark/context.py, line 201, in 
 _initialize_context
 return self._jvm.JavaSparkContext(jconf)
   File /opt/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, 
 line 701, in __call__
   File /opt/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 
 300, in get_return_value
 py4j.protocol.Py4JJavaError: An error occurred while calling 
 None.org.apache.spark.api.java.JavaSparkContext.
 : java.util.ConcurrentModificationException
 at java.util.Hashtable$Enumerator.next(Hashtable.java:1167)
 at 
 scala.collection.convert.Wrappers$JPropertiesWrapper$$anon$3.next(Wrappers.scala:458)
 at 
 scala.collection.convert.Wrappers$JPropertiesWrapper$$anon$3.next(Wrappers.scala:454)
 at scala.collection.Iterator$class.toStream(Iterator.scala:1143)
 at scala.collection.AbstractIterator.toStream(Iterator.scala:1157)
 at 
 scala.collection.Iterator$$anonfun$toStream$1.apply(Iterator.scala:1143)
 at 
 scala.collection.Iterator$$anonfun$toStream$1.apply(Iterator.scala:1143)
 at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1085)
 at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1077)
 at 
 scala.collection.immutable.Stream$$anonfun$filteredTail$1.apply(Stream.scala:1149)
 at 
 scala.collection.immutable.Stream$$anonfun$filteredTail$1.apply(Stream.scala:1149)
 at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1085)
 at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1077)
 at scala.collection.immutable.Stream.length(Stream.scala:284)
 at scala.collection.SeqLike$class.sorted(SeqLike.scala:608)
 at scala.collection.AbstractSeq.sorted(Seq.scala:40)
 at org.apache.spark.SparkEnv$.environmentDetails(SparkEnv.scala:324)
 at 
 org.apache.spark.SparkContext.postEnvironmentUpdate(SparkContext.scala:1297)
 at org.apache.spark.SparkContext.init(SparkContext.scala:334)
 at 
 org.apache.spark.api.java.JavaSparkContext.init(JavaSparkContext.scala:53)
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
 Method)
 at 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
 at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
 at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
 at py4j.Gateway.invoke(Gateway.java:214)
 at 
 py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79)
 at 
 py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
 at py4j.GatewayConnection.run(GatewayConnection.java:207)
 at java.lang.Thread.run(Thread.java:745)
 {code}



--
This message was sent by Atlassian JIRA

[jira] [Resolved] (SPARK-2541) Standalone mode can't access secure HDFS anymore


 [ 
https://issues.apache.org/jira/browse/SPARK-2541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-2541.
--
Resolution: Duplicate

 Standalone mode can't access secure HDFS anymore
 

 Key: SPARK-2541
 URL: https://issues.apache.org/jira/browse/SPARK-2541
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.0.0, 1.0.1
Reporter: Thomas Graves
 Attachments: SPARK-2541-partial.patch


 In spark 0.9.x you could access secure HDFS from Standalone deploy, that 
 doesn't work in 1.X anymore. 
 It looks like the issues is in SparkHadoopUtil.runAsSparkUser.  Previously it 
 wouldn't do the doAs if the currentUser == user.  Not sure how it affects 
 when the daemons run as a super user but SPARK_USER is set to someone else.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2572) Can't delete local dir on executor automatically when running spark over Mesos.


 [ 
https://issues.apache.org/jira/browse/SPARK-2572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-2572:
-
Component/s: (was: Spark Core)
 Mesos
   Priority: Minor  (was: Major)

I wonder if this is still an issue, since we've had a number of improvements to 
cleaning up the executors' work dir since, which might affect Mesos.

 Can't delete local dir on executor automatically when running spark over 
 Mesos.
 ---

 Key: SPARK-2572
 URL: https://issues.apache.org/jira/browse/SPARK-2572
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 1.0.0
Reporter: Yadong Qi
Priority: Minor

 When running spark over Mesos in “fine-grained” modes or “coarse-grained” 
 mode. After the application finished.The local 
 dir(/tmp/spark-local-20140718114058-834c) on executor can't not delete 
 automatically.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1985) SPARK_HOME shouldn't be required when spark.executor.uri is provided


 [ 
https://issues.apache.org/jira/browse/SPARK-1985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-1985:
-
Component/s: (was: Spark Core)
 Mesos
 Labels:   (was: mesos)

The code in question at that point in time was:

{code}
val sparkHome = sc.getSparkHome().getOrElse(throw new SparkException(
  Spark home is not set; set it through the spark.home system  +
  property, the SPARK_HOME environment variable or the SparkContext 
constructor))
{code}

and it's now

{code}
val executorSparkHome = sc.conf.getOption(spark.mesos.executor.home)
  .orElse(sc.getSparkHome()) // Fall back to driver Spark home for backward 
compatibility
  .getOrElse {
throw new SparkException(Executor Spark home 
`spark.mesos.executor.home` is not set!)
  }
{code}

So {{SPARK_HOME}} / {{spark.home}} are no longer required, although, they've 
just been replaced with another more specific value in SPARK-3264 / 
https://github.com/apache/spark/commit/41dc5987d9abeca6fc0f5935c780d48f517cdf95

Although the assembly is automatically added to the classpath by 
{{compute-classpath.sh}} too, that may not be 100% of what this is asking, 
which is to be able to not set a 'home' at all.

My read of SPARK-3264 however is that we should have an explicit 'home' setting 
for Mesos executors. Or else I'm not clear how you find `bin/spark-class` for 
example (see the relevant change in 
https://github.com/apache/spark/commit/4a4f9ccba2b42b64356db7f94ed9019212fc7317 
too)

 SPARK_HOME shouldn't be required when spark.executor.uri is provided
 

 Key: SPARK-1985
 URL: https://issues.apache.org/jira/browse/SPARK-1985
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 1.0.0
 Environment: MESOS
Reporter: Gerard Maas

 When trying to run that simple example on a  Mesos installation,  I get an 
 error that SPARK_HOME is not set. A local spark installation should not be 
 required to run a job on Mesos. All that's needed is the executor package, 
 being the assembly.tar.gz on a reachable location (HDFS/S3/HTTP).
 I went looking into the code and indeed there's a check on SPARK_HOME [2] 
 regardless of the presence of the assembly but it's actually only used if the 
 assembly is not provided (which is a kind-of best-effort recovery strategy).
 Current flow:
 if (!SPARK_HOME) fail(No SPARK_HOME) 
 else if (assembly) { use assembly) }
 else { try use SPARK_HOME to build spark_executor } 
 Should be:
 sparkExecutor =  if (assembly) {assembly} 
  else if (SPARK_HOME) {try use SPARK_HOME to build 
 spark_executor}
  else { fail(No executor found. Please provide 
 spark.executor.uri (preferred) or spark.home)
 [1] 
 http://apache-spark-user-list.1001560.n3.nabble.com/ClassNotFoundException-with-Spark-Mesos-spark-shell-works-fine-td6165.html
 [2] 
 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerBackend.scala#L89



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-1985) SPARK_HOME shouldn't be required when spark.executor.uri is provided


 [ 
https://issues.apache.org/jira/browse/SPARK-1985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1985.
--
Resolution: Not a Problem

 SPARK_HOME shouldn't be required when spark.executor.uri is provided
 

 Key: SPARK-1985
 URL: https://issues.apache.org/jira/browse/SPARK-1985
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 1.0.0
 Environment: MESOS
Reporter: Gerard Maas

 When trying to run that simple example on a  Mesos installation,  I get an 
 error that SPARK_HOME is not set. A local spark installation should not be 
 required to run a job on Mesos. All that's needed is the executor package, 
 being the assembly.tar.gz on a reachable location (HDFS/S3/HTTP).
 I went looking into the code and indeed there's a check on SPARK_HOME [2] 
 regardless of the presence of the assembly but it's actually only used if the 
 assembly is not provided (which is a kind-of best-effort recovery strategy).
 Current flow:
 if (!SPARK_HOME) fail(No SPARK_HOME) 
 else if (assembly) { use assembly) }
 else { try use SPARK_HOME to build spark_executor } 
 Should be:
 sparkExecutor =  if (assembly) {assembly} 
  else if (SPARK_HOME) {try use SPARK_HOME to build 
 spark_executor}
  else { fail(No executor found. Please provide 
 spark.executor.uri (preferred) or spark.home)
 [1] 
 http://apache-spark-user-list.1001560.n3.nabble.com/ClassNotFoundException-with-Spark-Mesos-spark-shell-works-fine-td6165.html
 [2] 
 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerBackend.scala#L89



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3685) Spark's local dir should accept only local paths

[
https://issues.apache.org/jira/browse/SPARK-3685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352068#comment-14352068
]

Sean Owen commented on SPARK-3685:
--

This is 90% the same discussion as SPARK-1529, although, this concerns making
the current behavior more explicit (e.g. fail an hdfs: URI) whereas SPARK-1529
(and the discussion below) discusses making other FS schemes work. I'd like to
potentially address this issue, without prejudicing SPARK-1529. In fact this
discussion usefully contains a good use case for putting a local dir on
distributed storage, whereas I personally don't see it in the arguments in
SPARK-1529.

Spark's local dir should accept only local paths

Key: SPARK-3685
URL: https://issues.apache.org/jira/browse/SPARK-3685
Project: Spark
Issue Type: Bug
Components: Spark Core, YARN
Affects Versions: 1.1.0
Reporter: Andrew Or

When you try to set local dirs to hdfs:/tmp/foo it doesn't work. What it
will try to do is create a folder called hdfs: and put tmp inside it.
This is because in Util#getOrCreateLocalRootDirs we use java.io.File instead
of Hadoop's file system to parse this path. We also need to resolve the path
appropriately.
This may not have an urgent use case, but it fails silently and does what is
least expected.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5124) Standardize internal RPC interface

2015-03-08 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352101#comment-14352101
 ] 

Shixiong Zhu commented on SPARK-5124:
-

The problem is that the message may come from caller.receive, but the callee 
wants to send the reply to caller.receiveAndReply.

However, I cannot find a use case now. But I find some RpcEndpoint may need to 
know the sender's address. So I added the sender method to RpcCallContext. And 
I also removed replyWithSender since it can be replaced with 
RpcCallContext.sender.sendWithReply(msg, self) now.

 Standardize internal RPC interface
 --

 Key: SPARK-5124
 URL: https://issues.apache.org/jira/browse/SPARK-5124
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Shixiong Zhu
 Attachments: Pluggable RPC - draft 1.pdf, Pluggable RPC - draft 2.pdf


 In Spark we use Akka as the RPC layer. It would be great if we can 
 standardize the internal RPC interface to facilitate testing. This will also 
 provide the foundation to try other RPC implementations in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-5124) Standardize internal RPC interface

2015-03-08 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-5124:

Comment: was deleted

(was: The problem is that the message may come from caller.receive, but the 
callee wants to send the reply to caller.receiveAndReply.

However, I cannot find a use case now. But I find some RpcEndpoint may need to 
know the sender's address. So I added the sender method to RpcCallContext. And 
I also removed replyWithSender since it can be replaced with 
RpcCallContext.sender.sendWithReply(msg, self) now.)

 Standardize internal RPC interface
 --

 Key: SPARK-5124
 URL: https://issues.apache.org/jira/browse/SPARK-5124
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Shixiong Zhu
 Attachments: Pluggable RPC - draft 1.pdf, Pluggable RPC - draft 2.pdf


 In Spark we use Akka as the RPC layer. It would be great if we can 
 standardize the internal RPC interface to facilitate testing. This will also 
 provide the foundation to try other RPC implementations in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-1444) Update branch-0.9's SBT to 0.13.1 so that it works with Java 8


 [ 
https://issues.apache.org/jira/browse/SPARK-1444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1444.
--
  Resolution: Won't Fix
Target Version/s:   (was: 0.9.3)

I suggest we call this WontFix, as 0.9 is now 4 minor releases behind, SBT 
isn't the primary or only build, and the straightforward way to address this 
does not seem to work.

 Update branch-0.9's SBT to 0.13.1 so that it works with Java 8
 --

 Key: SPARK-1444
 URL: https://issues.apache.org/jira/browse/SPARK-1444
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Matei Zaharia

 Apparently the older versions have problems if you compile on Java 8.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2326) DiskBlockManager could add DiskChecker function for kicking off bad directories


 [ 
https://issues.apache.org/jira/browse/SPARK-2326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-2326.
--
Resolution: Duplicate

Essentially the same idea, that {{DiskStore}} / {{BlockManager}} could 
blacklist bad directories.

 DiskBlockManager could add DiskChecker function for kicking off bad 
 directories
 ---

 Key: SPARK-2326
 URL: https://issues.apache.org/jira/browse/SPARK-2326
 Project: Spark
  Issue Type: Bug
  Components: Block Manager
Reporter: YanTang Zhai

 If the disk failure happens when the spark cluster is running, 
 DiskBlockManager should kick off bad directories automatically. 
 DiskBlockManager could add DiskChecker function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4450) SparkSQL producing incorrect answer when using --master yarn


 [ 
https://issues.apache.org/jira/browse/SPARK-4450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4450:
-
Component/s: (was: Spark Core)
 SQL

 SparkSQL producing incorrect answer when using --master yarn
 

 Key: SPARK-4450
 URL: https://issues.apache.org/jira/browse/SPARK-4450
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
 Environment: CDH 5.1
Reporter: Rick Bischoff

 A simple summary program using 
 spark-submit --master local  MyJob.py
 vs.
 spark-submit --master yarn MyJob.py
 produces different answers--the output produced by local has been 
 independently verified and is correct, but the output from yarn is incorrect.
 It does not appear to happen with smaller files, only large files.
 MyJob.py is 
 from pyspark import SparkContext, SparkConf
 from pyspark.sql import *
 def maybeFloat(x):
 Convert NULLs into 0s
 if x=='': return 0.
 else: return float(x)
 def maybeInt(x):
 Convert NULLs into 0s
 if x=='': return 0
 else: return int(x)
 def mapColl(p):
 return {
 f1: p[0],
 f2: p[1],
 f3: p[2],
 f4: int(p[3]),
 f5: int(p[4]),
 f6: p[5],
 f7: p[6],
 f8: p[7],
 f9: p[8],
 f10: maybeInt(p[9]),
 f11: p[10],
 f12: p[11],
 f13: p[12],
 f14: p[13],
 f15: maybeFloat(p[14]),
 f16: maybeInt(p[15]),
 f17: maybeFloat(p[16]) }
 sc = SparkContext()
 sqlContext = SQLContext(sc)
 lines = sc.textFile(sample.csv)
 fields = lines.map(lambda l: mapColl(l.split(,)))
 collTable = sqlContext.inferSchema(fields)
 collTable.registerAsTable(sample)
 test = sqlContext.sql(SELECT f9, COUNT(*) AS rows, SUM(f15) AS f15sum  \
   + FROM sample  \
   + GROUP BY f9)
 foo = test.collect()
 print foo
 sc.stop()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4876) An exception thrown when accessing a Spark SQL table using a JDBC driver from a standalone app.