[jira] [Resolved] (SPARK-8793) error/warning with pyspark WholeTextFiles.first

2015-09-09 Thread Diana Carroll (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Diana Carroll resolved SPARK-8793.
--
Resolution: Not A Problem

this is no longer occurring.

> error/warning with pyspark WholeTextFiles.first
> ---
>
> Key: SPARK-8793
> URL: https://issues.apache.org/jira/browse/SPARK-8793
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.3.0
>Reporter: Diana Carroll
>Priority: Minor
> Attachments: wholefilesbug.txt
>
>
> In Spark 1.3.0 python, calling first() on sc.wholeTextFiles is not working 
> correctly in pyspark.  It works fine in Scala.
> I created a directory with two tiny, simple text files.  
> this works:
> {code}sc.wholeTextFiles("testdata").collect(){code}
> this doesn't:
> {code}sc.wholeTextFiles("testdata").first(){code}
> The main error message is:
> {code}15/07/02 08:01:38 ERROR executor.Executor: Exception in task 0.0 in 
> stage 12.0 (TID 12)
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File "/usr/lib/spark/python/pyspark/worker.py", line 101, in main
> process()
>   File "/usr/lib/spark/python/pyspark/worker.py", line 96, in process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/usr/lib/spark/python/pyspark/serializers.py", line 236, in 
> dump_stream
> vs = list(itertools.islice(iterator, batch))
>   File "/usr/lib/spark/python/pyspark/rdd.py", line 1220, in takeUpToNumLeft
> while taken < left:
> ImportError: No module named iter
> {code}
> I will attach the full stack trace to the JIRA.
> I'm using CentOS 6.6 with CDH 5.4.3 (Spark 1.3.0).  Tested in both Python 2.6 
> and 2.7, same results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9821) pyspark reduceByKey should allow a custom partitioner

2015-08-11 Thread Diana Carroll (JIRA)
Diana Carroll created SPARK-9821:


 Summary: pyspark reduceByKey should allow a custom partitioner
 Key: SPARK-9821
 URL: https://issues.apache.org/jira/browse/SPARK-9821
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.3.0
Reporter: Diana Carroll


In Scala, I can supply a custom partitioner to reduceByKey (and other 
aggregation/repartitioning methods like aggregateByKey and combinedByKey), but 
as far as I can tell from the Pyspark API, there's no way to do the same in 
Python.

Here's an example of my code in Scala:

{code}weblogs.map(s = (getFileType(s), 1)).reduceByKey(new 
FileTypePartitioner(),_+_){code}

But I can't figure out how to do the same in Python.  The closest I can get is 
to call repartition before reduceByKey like so:
{code}weblogs.map(lambda s: (getFileType(s), 
1)).partitionBy(3,hash_filetype).reduceByKey(lambda v1,v2: 
v1+v2).collect(){code}

But that defeats the purpose, because I'm shuffling twice instead of once, so 
my performance is worse instead of better.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8793) error/warning with pyspark WholeTextFiles.first

2015-07-02 Thread Diana Carroll (JIRA)
Diana Carroll created SPARK-8793:


 Summary: error/warning with pyspark WholeTextFiles.first
 Key: SPARK-8793
 URL: https://issues.apache.org/jira/browse/SPARK-8793
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.3.0
Reporter: Diana Carroll
Priority: Minor
 Attachments: wholefilesbug.txt

In Spark 1.3.0 python, calling first() on sc.wholeTextFiles is not working 
correctly in pyspark.  It works fine in Scala.

I created a directory with two tiny, simple text files.  
this works:
{code}sc.wholeTextFiles(testdata).collect(){code}
this doesn't:
{code}sc.wholeTextFiles(testdata).first(){code}

The main error message is:
{code}15/07/02 08:01:38 ERROR executor.Executor: Exception in task 0.0 in stage 
12.0 (TID 12)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File /usr/lib/spark/python/pyspark/worker.py, line 101, in main
process()
  File /usr/lib/spark/python/pyspark/worker.py, line 96, in process
serializer.dump_stream(func(split_index, iterator), outfile)
  File /usr/lib/spark/python/pyspark/serializers.py, line 236, in dump_stream
vs = list(itertools.islice(iterator, batch))
  File /usr/lib/spark/python/pyspark/rdd.py, line 1220, in takeUpToNumLeft
while taken  left:
ImportError: No module named iter
{code}
I will attach the full stack trace to the JIRA.

I'm using CentOS 6.6 with CDH 5.4.3 (Spark 1.3.0).  Tested in both Python 2.6 
and 2.7, same results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8793) error/warning with pyspark WholeTextFiles.first

2015-07-02 Thread Diana Carroll (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Diana Carroll updated SPARK-8793:
-
Attachment: wholefilesbug.txt

 error/warning with pyspark WholeTextFiles.first
 ---

 Key: SPARK-8793
 URL: https://issues.apache.org/jira/browse/SPARK-8793
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.3.0
Reporter: Diana Carroll
Priority: Minor
 Attachments: wholefilesbug.txt


 In Spark 1.3.0 python, calling first() on sc.wholeTextFiles is not working 
 correctly in pyspark.  It works fine in Scala.
 I created a directory with two tiny, simple text files.  
 this works:
 {code}sc.wholeTextFiles(testdata).collect(){code}
 this doesn't:
 {code}sc.wholeTextFiles(testdata).first(){code}
 The main error message is:
 {code}15/07/02 08:01:38 ERROR executor.Executor: Exception in task 0.0 in 
 stage 12.0 (TID 12)
 org.apache.spark.api.python.PythonException: Traceback (most recent call 
 last):
   File /usr/lib/spark/python/pyspark/worker.py, line 101, in main
 process()
   File /usr/lib/spark/python/pyspark/worker.py, line 96, in process
 serializer.dump_stream(func(split_index, iterator), outfile)
   File /usr/lib/spark/python/pyspark/serializers.py, line 236, in 
 dump_stream
 vs = list(itertools.islice(iterator, batch))
   File /usr/lib/spark/python/pyspark/rdd.py, line 1220, in takeUpToNumLeft
 while taken  left:
 ImportError: No module named iter
 {code}
 I will attach the full stack trace to the JIRA.
 I'm using CentOS 6.6 with CDH 5.4.3 (Spark 1.3.0).  Tested in both Python 2.6 
 and 2.7, same results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-8795) pySpark wholeTextFiles error when mapping string

2015-07-02 Thread Diana Carroll (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Diana Carroll closed SPARK-8795.

Resolution: Not A Problem

User configuration problem: error resulted from mismatched versions of Python 
on driver and executor.

 pySpark wholeTextFiles error when mapping string
 

 Key: SPARK-8795
 URL: https://issues.apache.org/jira/browse/SPARK-8795
 Project: Spark
  Issue Type: Bug
  Components: PySpark
 Environment: CentOS 6.6, Python 2.7, CDH 5.4.1
Reporter: Diana Carroll
 Attachments: wholefilesbug.txt


 I created a test directory with two tiny text files.
 This call works:
 {code}sc.wholeTextFiles(testdata).map(lambda (fname,x): 
 len(x)).collect(){code}
 This call does not:
 {code}sc.wholeTextFiles(testdata).map(lambda (fname,x): 
 x.islower()).collect(){code}
 In fact, any attempt to call any string methods on X, or pass X to any 
 function requiring a string, fail the same way.
 The main error is
 {code}  File /usr/lib/spark/python/pyspark/worker.py, line 101, in main
 process()
   File /usr/lib/spark/python/pyspark/worker.py, line 96, in process
 serializer.dump_stream(func(split_index, iterator), outfile)
   File /usr/lib/spark/python/pyspark/serializers.py, line 236, in 
 dump_stream
 vs = list(itertools.islice(iterator, batch))
   File ipython-input-107-5192d18d0e4c, line 1, in lambda
 TypeError: 'bool' object is not callable
 {code}
 Will attach full log.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8795) pySpark wholeTextFiles error when mapping string

2015-07-02 Thread Diana Carroll (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Diana Carroll updated SPARK-8795:
-
Attachment: wholefilesbug.txt

 pySpark wholeTextFiles error when mapping string
 

 Key: SPARK-8795
 URL: https://issues.apache.org/jira/browse/SPARK-8795
 Project: Spark
  Issue Type: Bug
  Components: PySpark
 Environment: CentOS 6.6, Python 2.7, CDH 5.4.1
Reporter: Diana Carroll
 Attachments: wholefilesbug.txt


 I created a test directory with two tiny text files.
 This call works:
 {code}sc.wholeTextFiles(testdata).map(lambda (fname,x): 
 len(x)).collect(){code}
 This call does not:
 {code}sc.wholeTextFiles(testdata).map(lambda (fname,x): 
 x.islower()).collect(){code}
 In fact, any attempt to call any string methods on X, or pass X to any 
 function requiring a string, fail the same way.
 The main error is
 {code}  File /usr/lib/spark/python/pyspark/worker.py, line 101, in main
 process()
   File /usr/lib/spark/python/pyspark/worker.py, line 96, in process
 serializer.dump_stream(func(split_index, iterator), outfile)
   File /usr/lib/spark/python/pyspark/serializers.py, line 236, in 
 dump_stream
 vs = list(itertools.islice(iterator, batch))
   File ipython-input-107-5192d18d0e4c, line 1, in lambda
 TypeError: 'bool' object is not callable
 {code}
 Will attach full log.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8795) pySpark wholeTextFiles error when mapping string

2015-07-02 Thread Diana Carroll (JIRA)
Diana Carroll created SPARK-8795:


 Summary: pySpark wholeTextFiles error when mapping string
 Key: SPARK-8795
 URL: https://issues.apache.org/jira/browse/SPARK-8795
 Project: Spark
  Issue Type: Bug
  Components: PySpark
 Environment: CentOS 6.6, Python 2.7, CDH 5.4.1
Reporter: Diana Carroll


I created a test directory with two tiny text files.

This call works:
{code}sc.wholeTextFiles(testdata).map(lambda (fname,x): 
len(x)).collect(){code}

This call does not:
{code}sc.wholeTextFiles(testdata).map(lambda (fname,x): 
x.islower()).collect(){code}

In fact, any attempt to call any string methods on X, or pass X to any function 
requiring a string, fail the same way.

The main error is
{code}  File /usr/lib/spark/python/pyspark/worker.py, line 101, in main
process()
  File /usr/lib/spark/python/pyspark/worker.py, line 96, in process
serializer.dump_stream(func(split_index, iterator), outfile)
  File /usr/lib/spark/python/pyspark/serializers.py, line 236, in dump_stream
vs = list(itertools.islice(iterator, batch))
  File ipython-input-107-5192d18d0e4c, line 1, in lambda
TypeError: 'bool' object is not callable
{code}

Will attach full log.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7607) Spark SQL prog guide code error

2015-05-13 Thread Diana Carroll (JIRA)
Diana Carroll created SPARK-7607:


 Summary: Spark SQL prog guide code error
 Key: SPARK-7607
 URL: https://issues.apache.org/jira/browse/SPARK-7607
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.3.1
Reporter: Diana Carroll
Priority: Minor


In the DataFrame Operations section, there's a bug in the filter example for 
all three languages.

The comments say we want to filter for people over age 21, but the code is
{code}
df.filter(df(name)  21).show()
{code}
It's filtering for people whose NAME is over 21.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-7607) Spark SQL prog guide code error

2015-05-13 Thread Diana Carroll (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Diana Carroll closed SPARK-7607.

Resolution: Duplicate

Woops, duplicate of SPARK-6383 (already fixed)

 Spark SQL prog guide code error
 ---

 Key: SPARK-7607
 URL: https://issues.apache.org/jira/browse/SPARK-7607
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.3.1
Reporter: Diana Carroll
Priority: Minor

 In the DataFrame Operations section, there's a bug in the filter example 
 for all three languages.
 The comments say we want to filter for people over age 21, but the code is
 {code}
 df.filter(df(name)  21).show()
 {code}
 It's filtering for people whose NAME is over 21.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2527) incorrect persistence level shown in Spark UI after repersisting

2014-07-16 Thread Diana Carroll (JIRA)
Diana Carroll created SPARK-2527:


 Summary: incorrect persistence level shown in Spark UI after 
repersisting
 Key: SPARK-2527
 URL: https://issues.apache.org/jira/browse/SPARK-2527
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.0.0
Reporter: Diana Carroll


If I persist an RDD at one level, unpersist it, then repersist it at another 
level, the UI will continue to show the RDD at the first level...but correctly 
show individual partitions at the second level.

{code}
import org.apache.spark.api.java.StorageLevels._
val test1 = sc.parallelize(Array(1,2,3))test1.persist(StorageLevels.DISK_ONLY)
test1.count()
test1.unpersist()
test1.persist(StorageLevels.MEMORY_ONLY)
test1.count()
{code}

after the first call to persist and count, the Spark App web UI shows:

RDD Storage Info for 14 Storage Level: Disk Serialized 1x Replicated 
rdd_14_0Disk Serialized 1x Replicated

After the second call, it shows:

RDD Storage Info for 14 Storage Level: Disk Serialized 1x Replicated 
rdd_14_0Memory Deserialized 1x Replicated 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2527) incorrect persistence level shown in Spark UI after repersisting

2014-07-16 Thread Diana Carroll (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Diana Carroll updated SPARK-2527:
-

Description: 
If I persist an RDD at one level, unpersist it, then repersist it at another 
level, the UI will continue to show the RDD at the first level...but correctly 
show individual partitions at the second level.

{code}
import org.apache.spark.api.java.StorageLevels
import org.apache.spark.api.java.StorageLevels._
val test1 = sc.parallelize(Array(1,2,3))test1.persist(StorageLevels.DISK_ONLY)
test1.count()
test1.unpersist()
test1.persist(StorageLevels.MEMORY_ONLY)
test1.count()
{code}

after the first call to persist and count, the Spark App web UI shows:

RDD Storage Info for 14 Storage Level: Disk Serialized 1x Replicated 
rdd_14_0Disk Serialized 1x Replicated

After the second call, it shows:

RDD Storage Info for 14 Storage Level: Disk Serialized 1x Replicated 
rdd_14_0Memory Deserialized 1x Replicated 

  was:
If I persist an RDD at one level, unpersist it, then repersist it at another 
level, the UI will continue to show the RDD at the first level...but correctly 
show individual partitions at the second level.

{code}
import org.apache.spark.api.java.StorageLevels._
val test1 = sc.parallelize(Array(1,2,3))test1.persist(StorageLevels.DISK_ONLY)
test1.count()
test1.unpersist()
test1.persist(StorageLevels.MEMORY_ONLY)
test1.count()
{code}

after the first call to persist and count, the Spark App web UI shows:

RDD Storage Info for 14 Storage Level: Disk Serialized 1x Replicated 
rdd_14_0Disk Serialized 1x Replicated

After the second call, it shows:

RDD Storage Info for 14 Storage Level: Disk Serialized 1x Replicated 
rdd_14_0Memory Deserialized 1x Replicated 


 incorrect persistence level shown in Spark UI after repersisting
 

 Key: SPARK-2527
 URL: https://issues.apache.org/jira/browse/SPARK-2527
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.0.0
Reporter: Diana Carroll
 Attachments: persistbug1.png, persistbug2.png


 If I persist an RDD at one level, unpersist it, then repersist it at another 
 level, the UI will continue to show the RDD at the first level...but 
 correctly show individual partitions at the second level.
 {code}
 import org.apache.spark.api.java.StorageLevels
 import org.apache.spark.api.java.StorageLevels._
 val test1 = sc.parallelize(Array(1,2,3))test1.persist(StorageLevels.DISK_ONLY)
 test1.count()
 test1.unpersist()
 test1.persist(StorageLevels.MEMORY_ONLY)
 test1.count()
 {code}
 after the first call to persist and count, the Spark App web UI shows:
 RDD Storage Info for 14 Storage Level: Disk Serialized 1x Replicated 
 rdd_14_0  Disk Serialized 1x Replicated
 After the second call, it shows:
 RDD Storage Info for 14 Storage Level: Disk Serialized 1x Replicated 
 rdd_14_0  Memory Deserialized 1x Replicated 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2334) Attribute Error calling PipelinedRDD.id() in pyspark

2014-07-01 Thread Diana Carroll (JIRA)
Diana Carroll created SPARK-2334:


 Summary: Attribute Error calling PipelinedRDD.id() in pyspark
 Key: SPARK-2334
 URL: https://issues.apache.org/jira/browse/SPARK-2334
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.0
Reporter: Diana Carroll


calling the id() function of a PipelinedRDD causes an error in PySpark.  (Works 
fine in Scala.)

The second id() call here fails, the first works:
{code}
r1 = sc.parallelize([1,2,3])
r1.id()
r2=r1.map(lambda i: i+1)
r2.id()
{code}

Error:

{code}
---
AttributeErrorTraceback (most recent call last)
ipython-input-31-a0cf66fcf645 in module()
 1 r2.id()

/usr/lib/spark/python/pyspark/rdd.py in id(self)
180 A unique ID for this RDD (within its SparkContext).
181 
-- 182 return self._id
183 
184 def __repr__(self):

AttributeError: 'PipelinedRDD' object has no attribute '_id'
{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Reopened] (SPARK-2003) SparkContext(SparkConf) doesn't work in pyspark

2014-07-01 Thread Diana Carroll (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Diana Carroll reopened SPARK-2003:
--


 SparkContext(SparkConf) doesn't work in pyspark
 ---

 Key: SPARK-2003
 URL: https://issues.apache.org/jira/browse/SPARK-2003
 Project: Spark
  Issue Type: Bug
  Components: Documentation, PySpark
Affects Versions: 1.0.0
Reporter: Diana Carroll
 Fix For: 1.0.1, 1.1.0


 Using SparkConf with SparkContext as described in the Programming Guide does 
 NOT work in Python:
 conf = SparkConf.setAppName(blah)
 sc = SparkContext(conf)
 When I tried I got 
 AttributeError: 'SparkConf' object has no attribute '_get_object_id'
 [This equivalent code in Scala works fine:
 val conf = new SparkConf().setAppName(blah)
 val sc = new SparkContext(conf)]
 I think this is because there's no equivalent for the Scala constructor 
 SparkContext(SparkConf).  
 Workaround:
 If I explicitly set the conf parameter in the python call, it does work:
 sconf = SparkConf.setAppName(blah)
 sc = SparkContext(conf=sconf)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2003) SparkContext(SparkConf) doesn't work in pyspark

2014-07-01 Thread Diana Carroll (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049084#comment-14049084
 ] 

Diana Carroll commented on SPARK-2003:
--

Actually you can't create a SparkContext in the repl...it's already created
for you.  Therefore this bug is moot in the repl, and only applies to a
standalone pyspark program.

as far as I know, in Scala the officially recommended way to
programmatically configure a SparkContext is new SparkContext(sparkConf).
 Therefore the same should be true for PySpark.


On Tue, Jul 1, 2014 at 12:40 PM, Matthew Farrellee (JIRA) j...@apache.org



 SparkContext(SparkConf) doesn't work in pyspark
 ---

 Key: SPARK-2003
 URL: https://issues.apache.org/jira/browse/SPARK-2003
 Project: Spark
  Issue Type: Bug
  Components: Documentation, PySpark
Affects Versions: 1.0.0
Reporter: Diana Carroll
 Fix For: 1.0.1, 1.1.0


 Using SparkConf with SparkContext as described in the Programming Guide does 
 NOT work in Python:
 conf = SparkConf.setAppName(blah)
 sc = SparkContext(conf)
 When I tried I got 
 AttributeError: 'SparkConf' object has no attribute '_get_object_id'
 [This equivalent code in Scala works fine:
 val conf = new SparkConf().setAppName(blah)
 val sc = new SparkContext(conf)]
 I think this is because there's no equivalent for the Scala constructor 
 SparkContext(SparkConf).  
 Workaround:
 If I explicitly set the conf parameter in the python call, it does work:
 sconf = SparkConf.setAppName(blah)
 sc = SparkContext(conf=sconf)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2003) SparkContext(SparkConf) doesn't work in pyspark

2014-06-30 Thread Diana Carroll (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14047710#comment-14047710
 ] 

Diana Carroll commented on SPARK-2003:
--

I do not agree this is a doc bug.  This is legit code that should work.  The 
python implementation of the SparkContext constructor should accept a SparkConf 
object just like the Scala implementation does.

{code}
from pyspark import SparkContext
from pyspark import SparkConf
if __name__ == __main__:
sconf = SparkConf().setAppName(My Spark App)
sc = SparkContext(sconf)
print Hello world
{code}

(setting master is not required in Spark 1.0+, as it can be set using 
spark-submit)


 SparkContext(SparkConf) doesn't work in pyspark
 ---

 Key: SPARK-2003
 URL: https://issues.apache.org/jira/browse/SPARK-2003
 Project: Spark
  Issue Type: Bug
  Components: Documentation, PySpark
Affects Versions: 1.0.0
Reporter: Diana Carroll
 Fix For: 1.0.1, 1.1.0


 Using SparkConf with SparkContext as described in the Programming Guide does 
 NOT work in Python:
 conf = SparkConf.setAppName(blah)
 sc = SparkContext(conf)
 When I tried I got 
 AttributeError: 'SparkConf' object has no attribute '_get_object_id'
 [This equivalent code in Scala works fine:
 val conf = new SparkConf().setAppName(blah)
 val sc = new SparkContext(conf)]
 I think this is because there's no equivalent for the Scala constructor 
 SparkContext(SparkConf).  
 Workaround:
 If I explicitly set the conf parameter in the python call, it does work:
 sconf = SparkConf.setAppName(blah)
 sc = SparkContext(conf=sconf)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-823) spark.default.parallelism's default is inconsistent across scheduler backends

2014-04-30 Thread Diana Carroll (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13985880#comment-13985880
 ] 

Diana Carroll commented on SPARK-823:
-

Yes, please clarify the documentation, I just ran into this.  the Configuration 
guide (http://spark.apache.org/docs/latest/configuration.html) says the default 
is 8.

In testing this on Standalone Spark, there actually is no default value for the 
variable:
sc.getConf.contains(spark.default.parallelism)
res1: Boolean = false

It looks like if the variable is not set, then the default behavior is decided 
in code, e.g. Partitioner.scala:
{code}
if (rdd.context.conf.contains(spark.default.parallelism)) {
  new HashPartitioner(rdd.context.defaultParallelism)
} else {
  new HashPartitioner(bySize.head.partitions.size)
}
{code}

 spark.default.parallelism's default is inconsistent across scheduler backends
 -

 Key: SPARK-823
 URL: https://issues.apache.org/jira/browse/SPARK-823
 Project: Spark
  Issue Type: Bug
  Components: Documentation, Spark Core
Affects Versions: 0.8.0, 0.7.3
Reporter: Josh Rosen
Priority: Minor

 The [0.7.3 configuration 
 guide|http://spark-project.org/docs/latest/configuration.html] says that 
 {{spark.default.parallelism}}'s default is 8, but the default is actually 
 max(totalCoreCount, 2) for the standalone scheduler backend, 8 for the Mesos 
 scheduler, and {{threads}} for the local scheduler:
 https://github.com/mesos/spark/blob/v0.7.3/core/src/main/scala/spark/scheduler/cluster/StandaloneSchedulerBackend.scala#L157
 https://github.com/mesos/spark/blob/v0.7.3/core/src/main/scala/spark/scheduler/mesos/MesosSchedulerBackend.scala#L317
 https://github.com/mesos/spark/blob/v0.7.3/core/src/main/scala/spark/scheduler/local/LocalScheduler.scala#L150
 Should this be clarified in the documentation?  Should the Mesos scheduler 
 backend's default be revised?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-823) spark.default.parallelism's default is inconsistent across scheduler backends

2014-04-30 Thread Diana Carroll (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13985934#comment-13985934
 ] 

Diana Carroll commented on SPARK-823:
-

Okay, this is definitely more than a documentation bug, because PySpark and 
Scala work differently if spark.default.parallelism isn't set by the user.  I'm 
testing using wordcount.

Pyspark: reduceByKey will use the value of sc.defaultParallelism.  That value 
is set to the number of threads when running locally.  On my Spark Standalone 
cluster which has a single node with a single core, the value is 2.  If I set 
spark.default.parallelism, it will set sc.defaultParallelism to that value and 
use that.

Scala: reduceByKey will use the number of partitions in my file/map stage and 
ignore the value of sc.defaultParallelism.  sc.defaultParallism is set by the 
same logic as pyspark (number of threads for local, 2 for my microcluster), it 
is just ignored.

I'm not sure which approach is correct.  Scala works as described here: 
http://spark.apache.org/docs/latest/tuning.html
{quote}
Spark automatically sets the number of “map” tasks to run on each file 
according to its size (though you can control it through optional parameters to 
SparkContext.textFile, etc), and for distributed “reduce” operations, such as 
groupByKey and reduceByKey, it uses the largest parent RDD’s number of 
partitions. You can pass the level of parallelism as a second argument (see the 
spark.PairRDDFunctions documentation), or set the config property 
spark.default.parallelism to change the default. In general, we recommend 2-3 
tasks per CPU core in your cluster.
{quote}



 spark.default.parallelism's default is inconsistent across scheduler backends
 -

 Key: SPARK-823
 URL: https://issues.apache.org/jira/browse/SPARK-823
 Project: Spark
  Issue Type: Bug
  Components: Documentation, Spark Core
Affects Versions: 0.8.0, 0.7.3
Reporter: Josh Rosen
Priority: Minor

 The [0.7.3 configuration 
 guide|http://spark-project.org/docs/latest/configuration.html] says that 
 {{spark.default.parallelism}}'s default is 8, but the default is actually 
 max(totalCoreCount, 2) for the standalone scheduler backend, 8 for the Mesos 
 scheduler, and {{threads}} for the local scheduler:
 https://github.com/mesos/spark/blob/v0.7.3/core/src/main/scala/spark/scheduler/cluster/StandaloneSchedulerBackend.scala#L157
 https://github.com/mesos/spark/blob/v0.7.3/core/src/main/scala/spark/scheduler/mesos/MesosSchedulerBackend.scala#L317
 https://github.com/mesos/spark/blob/v0.7.3/core/src/main/scala/spark/scheduler/local/LocalScheduler.scala#L150
 Should this be clarified in the documentation?  Should the Mesos scheduler 
 backend's default be revised?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-823) spark.default.parallelism's default is inconsistent across scheduler backends

2014-04-30 Thread Diana Carroll (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Diana Carroll updated SPARK-823:


Component/s: PySpark

 spark.default.parallelism's default is inconsistent across scheduler backends
 -

 Key: SPARK-823
 URL: https://issues.apache.org/jira/browse/SPARK-823
 Project: Spark
  Issue Type: Bug
  Components: Documentation, PySpark, Spark Core
Affects Versions: 0.8.0, 0.7.3, 0.9.1
Reporter: Josh Rosen
Priority: Minor

 The [0.7.3 configuration 
 guide|http://spark-project.org/docs/latest/configuration.html] says that 
 {{spark.default.parallelism}}'s default is 8, but the default is actually 
 max(totalCoreCount, 2) for the standalone scheduler backend, 8 for the Mesos 
 scheduler, and {{threads}} for the local scheduler:
 https://github.com/mesos/spark/blob/v0.7.3/core/src/main/scala/spark/scheduler/cluster/StandaloneSchedulerBackend.scala#L157
 https://github.com/mesos/spark/blob/v0.7.3/core/src/main/scala/spark/scheduler/mesos/MesosSchedulerBackend.scala#L317
 https://github.com/mesos/spark/blob/v0.7.3/core/src/main/scala/spark/scheduler/local/LocalScheduler.scala#L150
 Should this be clarified in the documentation?  Should the Mesos scheduler 
 backend's default be revised?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-823) spark.default.parallelism's default is inconsistent across scheduler backends

2014-04-30 Thread Diana Carroll (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Diana Carroll updated SPARK-823:


Affects Version/s: 0.9.1

 spark.default.parallelism's default is inconsistent across scheduler backends
 -

 Key: SPARK-823
 URL: https://issues.apache.org/jira/browse/SPARK-823
 Project: Spark
  Issue Type: Bug
  Components: Documentation, PySpark, Spark Core
Affects Versions: 0.8.0, 0.7.3, 0.9.1
Reporter: Josh Rosen
Priority: Minor

 The [0.7.3 configuration 
 guide|http://spark-project.org/docs/latest/configuration.html] says that 
 {{spark.default.parallelism}}'s default is 8, but the default is actually 
 max(totalCoreCount, 2) for the standalone scheduler backend, 8 for the Mesos 
 scheduler, and {{threads}} for the local scheduler:
 https://github.com/mesos/spark/blob/v0.7.3/core/src/main/scala/spark/scheduler/cluster/StandaloneSchedulerBackend.scala#L157
 https://github.com/mesos/spark/blob/v0.7.3/core/src/main/scala/spark/scheduler/mesos/MesosSchedulerBackend.scala#L317
 https://github.com/mesos/spark/blob/v0.7.3/core/src/main/scala/spark/scheduler/local/LocalScheduler.scala#L150
 Should this be clarified in the documentation?  Should the Mesos scheduler 
 backend's default be revised?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1666) document examples

2014-04-29 Thread Diana Carroll (JIRA)
Diana Carroll created SPARK-1666:


 Summary: document examples
 Key: SPARK-1666
 URL: https://issues.apache.org/jira/browse/SPARK-1666
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 0.9.1
Reporter: Diana Carroll


It would be great if there were some guidance about what the example code 
shipped with Spark (under $SPARKHOME/examples and $SPARKHOME/python/examples) 
does and how to run it.  Perhaps a comment block at the beginning explaining 
what the code accomplishes and what parameters it takes.  Also, if there are 
sample datasets on which the example is designed to run, please point to those. 
 

(As an example, look at kmeans.py, which takes a file argument, but has no hint 
about what sort of data is in the file or what format the data should be in.



--
This message was sent by Atlassian JIRA
(v6.2#6252)