[GitHub] spark issue #15053: [SPARK-18069][Doc] improve python API docstrings

2017-01-28 Thread mortada
Github user mortada commented on the issue:

https://github.com/apache/spark/pull/15053
  
@HyukjinKwon my apologies for not having been able to follow up on this. I 
still think this doc improvement would be very helpful to pyspark users. Would 
you like to take over the PR?

Thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15053: [SPARK-18069][Doc] improve python API docstrings

2016-10-24 Thread mortada
Github user mortada commented on the issue:

https://github.com/apache/spark/pull/15053
  
@HyukjinKwon I've created a JIRA ticket and also went through all the files 
you mentioned, please take a look


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15053: [Doc] improve python API docstrings

2016-09-29 Thread mortada
Github user mortada commented on the issue:

https://github.com/apache/spark/pull/15053
  
@HyukjinKwon do you know how I could run the doctests for these files? I 
found this online: 
https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals which says 
that I could do 

```
SPARK_TESTING=1 ./bin/pyspark python/pyspark/my_file.py.
```
but that doesn't actually work
```
$ SPARK_TESTING=1 ./bin/pyspark python/pyspark/sql/dataframe.py
python: Error while finding spec for 'python/pyspark/sql/dataframe.py' 
(ImportError: No module named 'python/pyspark/sql/dataframe')
```

in terms of the extra time to create these small DataFrames, I think we'd 
probably be looking at something reasonably negligible (a few seconds tops) 
```
In [2]: %timeit df = spark.createDataFrame([('Alice', 2), ('Bob', 5)], 
['name', 'age'])
100 loops, best of 3: 10.3 ms per loop
```

i.e. this should be on the order of 1 second even with creating `df` 100 
times. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15053: [Doc] improve python API docstrings

2016-09-26 Thread mortada
Github user mortada commented on the issue:

https://github.com/apache/spark/pull/15053
  
In other words, currently it is not possible for the user to follow the 
examples in this docstring below. It's not clear what all these input variables 
(`df`, `df2`, etc) are, and where you'd even find them 
```
In [5]: DataFrame.join?
Signature: DataFrame.join(self, other, on=None, how=None)
Docstring:
Joins with another :class:`DataFrame`, using the given join expression.

:param other: Right side of the join
:param on: a string for the join column name, a list of column names,
a join expression (Column), or a list of Columns.
If `on` is a string or a list of strings indicating the name of the 
join column(s),
the column(s) must exist on both sides, and this performs an equi-join.
:param how: str, default 'inner'.
One of `inner`, `outer`, `left_outer`, `right_outer`, `leftsemi`.

The following performs a full outer join between ``df1`` and ``df2``.

>>> df.join(df2, df.name == df2.name, 'outer').select(df.name, 
df2.height).collect()
[Row(name=None, height=80), Row(name='Bob', height=85), Row(name='Alice', 
height=None)]

>>> df.join(df2, 'name', 'outer').select('name', 'height').collect()
[Row(name='Tom', height=80), Row(name='Bob', height=85), Row(name='Alice', 
height=None)]

>>> cond = [df.name == df3.name, df.age == df3.age]
>>> df.join(df3, cond, 'outer').select(df.name, df3.age).collect()
[Row(name='Alice', age=2), Row(name='Bob', age=5)]

>>> df.join(df2, 'name').select(df.name, df2.height).collect()
[Row(name='Bob', height=85)]

>>> df.join(df4, ['name', 'age']).select(df.name, df.age).collect()
[Row(name='Bob', age=5)]

.. versionadded:: 1.3
File:  ~/code/spark/python/pyspark/sql/dataframe.py
Type:  function
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15053: [Doc] improve python API docstrings

2016-09-26 Thread mortada
Github user mortada commented on the issue:

https://github.com/apache/spark/pull/15053
  
But that's what this PR is supposed to fix, the problem that the docstring 
for each individual method is not self-contained :)

I think I now see where I was confused - it seems like we are assuming the 
user would be looking at the package level docstring? I don't think that's the 
typical workflow.

I think the user would be looking at the docstring of one method and expect 
the docstring to explain how the method works. (hence the example with `numpy` 
I posted above 
https://github.com/apache/spark/pull/15053#issuecomment-247906649) 

For instance in `ipython` if you do `DataFrame.join?` it would bring up the 
docstring for the method `join()`, and it just seems really odd that it'd have 
everything including: function signature and parameters, explanation for how it 
works, example usage ... except for how to construct the very input data you 
need to interact with the example 

I don't think the user would know that the input DataFrame in the example 
is somehow defined in the package level docstring.






---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15053: [Doc] improve python API docstrings

2016-09-26 Thread mortada
Github user mortada commented on the issue:

https://github.com/apache/spark/pull/15053
  
I see, ok so you mean leave all the docstrings for the individual methods 
unchanged, but instead just add 
```
"""
>>> df.show()
+-+---+
| name|age|
+-+---+
|Alice|  2|
|  Bob|  5|
+-+---+
"""
```

at the top of the file?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15053: [Doc] improve python API docstrings

2016-09-26 Thread mortada
Github user mortada commented on the issue:

https://github.com/apache/spark/pull/15053
  
@HyukjinKwon I may still be confused about something - first of all what do 
you mean by the package level docstring? Do you mean here: 
https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L1559
 or here: 
https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L18?

Also, is the idea that we would define `df` globally, and then for the 
docstring of each function we would *not* do:

```
>>> df = spark.createDataFrame([('Alice', 2), ('Bob', 5)], ['name', 'age'])
>>> df.select(when(df['age'] == 2, 3).otherwise(4).alias("age")).collect()
 [Row(age=3), Row(age=4)]
```

and instead we do:
```
>>> df.show()
+-+---+
| name|age|
+-+---+
|Alice|  2|
|  Bob|  5|
+-+---+
>>> df.select(when(df['age'] == 2, 3).otherwise(4).alias("age")).collect()
 [Row(age=3), Row(age=4)]
```

therefore not showing the user how to construct the DataFrame? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15053: [Doc] improve python API docstrings

2016-09-25 Thread mortada
Github user mortada commented on a diff in the pull request:

https://github.com/apache/spark/pull/15053#discussion_r80405104
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -411,7 +415,7 @@ def monotonically_increasing_id():
 
 The generated ID is guaranteed to be monotonically increasing and 
unique, but not consecutive.
 The current implementation puts the partition ID in the upper 31 bits, 
and the record number
-within each partition in the lower 33 bits. The assumption is that the 
data frame has
+within each partition in the lower 33 bits. The assumption is that the 
DataFrame has
--- End diff --

@HyukjinKwon great idea, will update 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15053: [Doc] improve python API docstrings

2016-09-18 Thread mortada
Github user mortada commented on the issue:

https://github.com/apache/spark/pull/15053
  
@HyukjinKwon I understand we can have `py.test` and `doctest`, but I don't 
quite see how we could define the input DataFrame globally while at the same 
time have a clear, self-contained docstring for each function?

@holdenk could you please elaborate on what you mean? 

If we want to repeat something like this in every docstring
```
>>> print(df.collect())
```
we might as well simply include how to actually create the DataFrame so the 
user can easily reproduce the example?

It seems to me that the user would often want to see the docstring to 
understand how a function works, and they may not be looking at some global 
documentation as a whole. And the fact that many of the input DataFrames are 
the same is really just a convenience for the doc writer and not a requirement.

For instance this is the docstring for a numpy method (`numpy.argmax`), and 
the example is with the input clearly defined:
```
Examples

>>> a = np.arange(6).reshape(2,3)
>>> a
array([[0, 1, 2],
   [3, 4, 5]])
>>> np.argmax(a)
5
>>> np.argmax(a, axis=0)
array([1, 1, 1])
>>> np.argmax(a, axis=1)
array([2, 2])
```

IMHO it seems odd to require the user to look at some global doc in order 
to follow the example usage for one single function


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15053: [Doc] improve python API docstrings

2016-09-18 Thread mortada
Github user mortada commented on the issue:

https://github.com/apache/spark/pull/15053
  
@HyukjinKwon thanks for your help! I'm happy to complete this PR and follow 
what you suggest for testing. 

How would the package level docstring work? The goal (which I think we all 
agree on) is to be able to let the user easily see how the input is set up for 
each function in the docstring in a self-contained way. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15053: [Doc] improve python API docstrings

2016-09-12 Thread mortada
Github user mortada commented on a diff in the pull request:

https://github.com/apache/spark/pull/15053#discussion_r78480019
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1393,6 +1420,7 @@ def withColumnRenamed(self, existing, new):
 :param existing: string, name of the existing column to rename.
 :param col: string, new name of the column.
 
+>>> df = spark.createDataFrame([('Alice', 2), ('Bob', 5)], 
['name', 'age'])
 >>> df.withColumnRenamed('age', 'age2').collect()
 [Row(age2=2, name=u'Alice'), Row(age2=5, name=u'Bob')]
--- End diff --

@HyukjinKwon thank you, I'll update the PR 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15053: [Doc] improve python API docstrings

2016-09-11 Thread mortada
GitHub user mortada opened a pull request:

https://github.com/apache/spark/pull/15053

[Doc] improve python API docstrings

## What changes were proposed in this pull request?

a lot of the python API functions show example usage that is incomplete. 
The docstring shows output without having the input DataFrame defined. It can 
be quite confusing trying to understand the example. This PR fixes the 
docstring.

## How was this patch tested?

docs changes only




You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mortada/spark python_docstring

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15053.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15053


commit 52240bcf8df42dd454e874ce7640d7040c5cdad9
Author: Mortada Mehyar <mortada.meh...@gmail.com>
Date:   2016-09-11T20:28:54Z

[Doc] improve python API docstrings




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #12229: [SPARK-10063][SQL] Remove DirectParquetOutputCommitter

2016-08-22 Thread mortada
Github user mortada commented on the issue:

https://github.com/apache/spark/pull/12229
  
@rxin so it seems like `DirectParquetOutputCommitter` has been removed with 
Spark 2.0, is there a recommended replacement? 

(I'm in the process of migrating form Spark 1.6 to 2.0)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14253: [Doc] improve python doc for rdd.histogram

2016-07-18 Thread mortada
GitHub user mortada opened a pull request:

https://github.com/apache/spark/pull/14253

[Doc] improve python doc for rdd.histogram

## What changes were proposed in this pull request?

doc change only


## How was this patch tested?

doc change only 




You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mortada/spark histogram_typos

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14253.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14253


commit 979c7f44690c5239f49621733de112ec623e
Author: Mortada Mehyar <mortada.meh...@gmail.com>
Date:   2016-07-19T05:22:58Z

[Doc] improve python doc for rdd.histogram




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13639: [DOCUMENTATION] fixed typos in python programming guide

2016-06-13 Thread mortada
Github user mortada commented on the issue:

https://github.com/apache/spark/pull/13639
  
@srowen I went through the docs, found a few more minor fixes 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13639: [DOCUMENTATION] fixed typo

2016-06-13 Thread mortada
Github user mortada commented on the issue:

https://github.com/apache/spark/pull/13639
  
sure will do 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13639: [DOCUMENTATION] fixed typo

2016-06-13 Thread mortada
GitHub user mortada opened a pull request:

https://github.com/apache/spark/pull/13639

[DOCUMENTATION] fixed typo

## What changes were proposed in this pull request?

minor typo


## How was this patch tested?

minor typo in the doc, should be self explanatory 




You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mortada/spark typo

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13639.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13639


commit ca7c11e4d33843c10a67e234bd6a9057a645b317
Author: Mortada Mehyar <mortada.meh...@gmail.com>
Date:   2016-06-13T07:20:05Z

[DOCUMENTATION] fixed typo




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13587: [Documentation] fixed groupby aggregation example...

2016-06-09 Thread mortada
GitHub user mortada opened a pull request:

https://github.com/apache/spark/pull/13587

[Documentation] fixed groupby aggregation example for pyspark

## What changes were proposed in this pull request?

fixing documentation for the groupby/agg example in python

## How was this patch tested?

the existing example in the documentation dose not contain valid syntax 
(missing parenthesis) and is not using `Column` in the expression for `agg()`

after the fix here's how I tested it:

```
In [1]: from pyspark.sql import Row

In [2]: import pyspark.sql.functions as func

In [3]: %cpaste
Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
:records = [{'age': 19, 'department': 1, 'expense': 100},
: {'age': 20, 'department': 1, 'expense': 200},
: {'age': 21, 'department': 2, 'expense': 300},
: {'age': 22, 'department': 2, 'expense': 300},
: {'age': 23, 'department': 3, 'expense': 300}]
:--

In [4]: df = sqlContext.createDataFrame([Row(**d) for d in records])

In [5]: df.groupBy("department").agg(df["department"], func.max("age"), 
func.sum("expense")).show()

+--+--+++
|department|department|max(age)|sum(expense)|
+--+--+++
| 1| 1|  20| 300|
| 2| 2|  22| 600|
| 3| 3|  23| 300|
+--+--+++


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mortada/spark groupby_agg_doc_fix

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13587.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13587


commit 783415c9424cb6db1333aa5bc3ccd3cd1b227204
Author: Mortada Mehyar <mortada.meh...@gmail.com>
Date:   2016-06-10T02:34:16Z

[Documentation] fixed groupby aggregation example for pyspark




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12760] [DOCS] invalid lambda expression...

2016-01-21 Thread mortada
GitHub user mortada opened a pull request:

https://github.com/apache/spark/pull/10867

[SPARK-12760] [DOCS] invalid lambda expression in python example for …

…local vs cluster

@srowen thanks for the PR at https://github.com/apache/spark/pull/10866! 
sorry it took me a while.

This is related to https://github.com/apache/spark/pull/10866, basically 
the assignment in the lambda expression in the python example is actually 
invalid

```
In [1]: data = [1, 2, 3, 4, 5]
In [2]: counter = 0
In [3]: rdd = sc.parallelize(data)
In [4]: rdd.foreach(lambda x: counter += x)
  File "", line 1
rdd.foreach(lambda x: counter += x)
   ^
SyntaxError: invalid syntax
``` 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mortada/spark doc_python_fix

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/10867.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #10867


commit fc9f16a2ffb5846ecc03c4df584f611e6728573d
Author: Mortada Mehyar <mortada.meh...@gmail.com>
Date:   2016-01-21T16:51:28Z

[SPARK-12760] [DOCS] invalid lambda expression in python example for local 
vs cluster




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12760] [DOCS] invalid lambda expression...

2016-01-21 Thread mortada
Github user mortada commented on the pull request:

https://github.com/apache/spark/pull/10867#issuecomment-173648674
  
@srowen I tested the python code in cluster mode (5 ec2 workers) and this 
works fine

```
16/01/21 17:33:29 INFO BlockManagerMasterEndpoint: Registering block 
manager 172.31.10.56:35937 with 6.6 GB RAM, BlockManagerId(4, 172.31.10.56, 
35937)
16/01/21 17:33:29 INFO BlockManagerMasterEndpoint: Registering block 
manager 172.31.10.55:59871 with 6.6 GB RAM, BlockManagerId(0, 172.31.10.55, 
59871)
16/01/21 17:33:29 INFO BlockManagerMasterEndpoint: Registering block 
manager 172.31.10.53:39162 with 6.6 GB RAM, BlockManagerId(1, 172.31.10.53, 
39162)
16/01/21 17:33:29 INFO BlockManagerMasterEndpoint: Registering block 
manager 172.31.10.54:59145 with 6.6 GB RAM, BlockManagerId(2, 172.31.10.54, 
59145)
16/01/21 17:33:29 INFO BlockManagerMasterEndpoint: Registering block 
manager 172.31.10.57:35000 with 6.6 GB RAM, BlockManagerId(3, 172.31.10.57, 
35000)
In [1]: data = [1, 2, 3, 4, 5]

In [2]: counter = 0

In [3]: rdd = sc.parallelize(data)

In [4]: def increment_counter(x):
global counter
counter += x
   ...:

In [5]: rdd.foreach(increment_counter)
16/01/21 17:34:26 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
on 172.31.10.55:59871 (size: 3.2 KB, free: 6.6 GB)
16/01/21 17:34:26 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
on 172.31.10.56:35937 (size: 3.2 KB, free: 6.6 GB)
16/01/21 17:34:26 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
on 172.31.10.57:35000 (size: 3.2 KB, free: 6.6 GB)
16/01/21 17:34:26 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
on 172.31.10.53:39162 (size: 3.2 KB, free: 6.6 GB)
16/01/21 17:34:26 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
on 172.31.10.54:59145 (size: 3.2 KB, free: 6.6 GB)
(other output skipped)

In [6]: print("Counter value: ", counter)
Counter value:  0
```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12760] [DOCS] invalid lambda expression...

2016-01-21 Thread mortada
Github user mortada commented on the pull request:

https://github.com/apache/spark/pull/10867#issuecomment-173639832
  
@srowen it compiles for local, let me test that on a cluster

I noticed that the next line is actually also invalid python

```
In [7]: print("Counter value: " + counter)
---
TypeError Traceback (most recent call last)
 in ()
> 1 print("Counter value: " + counter)

TypeError: Can't convert 'int' object to str implicitly
```

I just updated the PR 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11837] [EC2] python3 compatibility for ...

2015-11-21 Thread mortada
Github user mortada commented on the pull request:

https://github.com/apache/spark/pull/9797#issuecomment-158695890
  
@JoshRosen Jenkins seemed to have failed again, but this PR should be good 
to go 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11837] [EC2] python3 compatibility for ...

2015-11-19 Thread mortada
Github user mortada commented on a diff in the pull request:

https://github.com/apache/spark/pull/9797#discussion_r45396501
  
--- Diff: ec2/spark_ec2.py ---
@@ -591,11 +591,15 @@ def launch_cluster(conn, opts, cluster_name):
 
 # AWS ignores the AMI-specified block device mapping for M3 (see 
SPARK-3342).
 if opts.instance_type.startswith('m3.'):
+if sys.version_info[0] >= 3:
+letters = string.ascii_letters
--- End diff --

interesting I didn't realize that - sure I can change this to use 
`string.ascii_letters` without a conditional check. This does mean we'd 
potentially break python2.5 or older though


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11837] [EC2] python3 compatibility for ...

2015-11-19 Thread mortada
Github user mortada commented on the pull request:

https://github.com/apache/spark/pull/9797#issuecomment-158145798
  
really puzzled by the test results ... the failed test doesn't seem to have 
anything to do with this PR

```
FAIL: test_update_state_by_key (__main__.BasicOperationTests)
--
Traceback (most recent call last):
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/streaming/tests.py",
 line 404, in test_update_state_by_key
self._test_func(input, func, expected)
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/streaming/tests.py",
 line 160, in _test_func
self.assertEqual(expected, result)
AssertionError: Lists differ: [[('k[32 chars], [0, 1, 2])], [('k', [0, 1, 
2, 3])], [('k', [0, 1, 2, 3, 4])]] != [[('k[32 chars], [0, 1, 2])]]

First list contains 2 additional elements.
First extra element 3:
[('k', [0, 1, 2, 3])]

+ [[('k', [0])], [('k', [0, 1])], [('k', [0, 1, 2])]]
- [[('k', [0])],
-  [('k', [0, 1])],
-  [('k', [0, 1, 2])],
-  [('k', [0, 1, 2, 3])],
-  [('k', [0, 1, 2, 3, 4])]]
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11837] [EC2] python3 compatibility for ...

2015-11-19 Thread mortada
Github user mortada commented on the pull request:

https://github.com/apache/spark/pull/9797#issuecomment-158270661
  
just updated the PR incorporating your comment, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: python3 compatibility for launching ec2 m3 ins...

2015-11-18 Thread mortada
Github user mortada commented on the pull request:

https://github.com/apache/spark/pull/9797#issuecomment-157912292
  
@JoshRosen sure I just created a JIRA ticket here: 
https://issues.apache.org/jira/browse/SPARK-11837 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: python3 compatibility for launching ec2 m3 ins...

2015-11-17 Thread mortada
GitHub user mortada opened a pull request:

https://github.com/apache/spark/pull/9797

python3 compatibility for launching ec2 m3 instances

this currently breaks for python3 because `string` module doesn't have 
`letters` anymore, instead `ascii_letters` should be used 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mortada/spark python3_fix

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/9797.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #9797


commit 84dc1230e6ef569935da6b5476bfb1cd6fb31d8e
Author: Mortada Mehyar <mortada.meh...@gmail.com>
Date:   2015-11-18T07:23:54Z

python3 compatibility for launching ec2 m3 instances




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org