[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2017-08-02 Thread MLnick
Github user MLnick commented on the issue:

https://github.com/apache/spark/pull/14579
  
Hey all - sorry I haven't been able to focus on this. It shouldn't be tough 
to do, but it will need some tests. If we  can find someone who wants to take 
it over I think it makes a decent starter task :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2017-07-01 Thread holdenk
Github user holdenk commented on the issue:

https://github.com/apache/spark/pull/14579
  
@MLnick - or if you don't have a chance would it be ok for us to find 
someone (perhaps someone new to the project) to take this over and bring it to 
the finish line?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2017-06-20 Thread ueshin
Github user ueshin commented on the issue:

https://github.com/apache/spark/pull/14579
  
@MLnick Hi, are you still working on this? If so, could you fix conflicts 
and update please?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2017-05-11 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/14579
  
gentle ping.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2017-02-24 Thread holdenk
Github user holdenk commented on the issue:

https://github.com/apache/spark/pull/14579
  
Do you have time to update this @MLnick or maybe would it be OK if someone 
else made an updated PR based on this? It would be a nice feature to have for 
2.2 :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2017-02-09 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/14579
  
Is there any reason why it is not merged yet? I personally like this too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-11-26 Thread holdenk
Github user holdenk commented on the issue:

https://github.com/apache/spark/pull/14579
  
just a gentle ping - would be cool to add this for 2.1 we have the time :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-11-02 Thread MLnick
Github user MLnick commented on the issue:

https://github.com/apache/spark/pull/14579
  
I hope to get to this soon - It's just the test cases that I need to get to 
also!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-11-01 Thread holdenk
Github user holdenk commented on the issue:

https://github.com/apache/spark/pull/14579
  
Just pinging to see how its going?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-08-25 Thread nchammas
Github user nchammas commented on the issue:

https://github.com/apache/spark/pull/14579
  
Looks good to me. 👍


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-08-25 Thread holdenk
Github user holdenk commented on the issue:

https://github.com/apache/spark/pull/14579
  
I like it personally - if no one has a good reason why not it seems like a 
very reasonable approach.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-08-25 Thread MLnick
Github user MLnick commented on the issue:

https://github.com/apache/spark/pull/14579
  
@nchammas @holdenk @davies @rxin how about the approach of @MechCoder in 
https://github.com/apache/spark/pull/14579#discussion_r74813935?

I think this will work well, so we could raise an error to prevent (almost 
all I think) usages outside of the intended pattern of `with some_rdd.cache() 
as x:` or `with some_rdd_already_cached as x:` 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-08-11 Thread MLnick
Github user MLnick commented on the issue:

https://github.com/apache/spark/pull/14579
  
@nchammas to answer your question above 
(https://github.com/apache/spark/pull/14579#issuecomment-239185438) - in short 
no.

The semantics of the utility method will be the same as for the `closing` 
example. `persisted` will return a wrapper class that implements the context 
manager methods. When entering the `with` statement, the `__enter__` method is 
called, which returns the underlying `rdd` instance - this is in turn bound to 
the variable following `as`. So it would look something like this:

```python
class persisted():
def __init__(self, thing):
self.thing = thing
def __enter__(self):
return self.thing
def __exit__(self, *exc_info):
self.thing.unpersist()
```

If someone tries to do `persisted(rdd).map(...)` it will throw an 
`AttributeError`.

The "file-like" version is what is currently implemented in this PR, and it 
works if `__enter__` returns `self` which is what `file` et al do. Of course 
those classes seem to not tend to have other methods that return `self` so 
don't suffer the same chaining issue we run into with `RDD`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-08-11 Thread nchammas
Github user nchammas commented on the issue:

https://github.com/apache/spark/pull/14579
  
Thanks for the quick overview. That's pretty straightforward, actually! 
I'll take a look at `PipelinedRDD` for the details. 👍


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-08-11 Thread holdenk
Github user holdenk commented on the issue:

https://github.com/apache/spark/pull/14579
  
Sure - so at a bit of a high level and not like exactly on point - copying 
data out of the Python back to the JVM is kind of slow so if we have multiple 
python operations we can put together in the same task then we try and do this. 
Since caching is handled by storing the data inside of the JVM a cached RDD 
can't be pipelined since we need to copy the result to the JVM. You can see the 
details in the PipelinedRDD in rdd.py.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-08-11 Thread nchammas
Github user nchammas commented on the issue:

https://github.com/apache/spark/pull/14579
  
Hmm, OK I see. (Apologies, I don't understand what pipelined RDDs are for, 
so the examples are going a bit over my head. 😅)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-08-11 Thread holdenk
Github user holdenk commented on the issue:

https://github.com/apache/spark/pull/14579
  
@nchammas to be clear - subclassing only breaks pipelining if the 
persisted_rdd is later unpersisted (e.g. used with a `with` statement or 
otherwise) - otherwise you can't pipeline on top of a persisted rdd anyways 
(which is why I say its a corner case).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-08-11 Thread nchammas
Github user nchammas commented on the issue:

https://github.com/apache/spark/pull/14579
  
> So there is no chaining requirement, and it will only work in a with 
statement.

@MLnick - Couldn't we also create a scenario (like @holdenk did earlier) 
where a user does something like this?

```python
persisted_rdd = persisted(rdd)
persisted_rdd.map(...).filter(...).count()
```

This would break pipelining too, no?

And I think the expectation would be for it not to break pipelining, 
because existing common context managers in Python don't have a requirement 
that they _must_ be used in a `with` block.

For example, `f = open(file)` works fine, as does `s = requests.Session()`, 
and the resulting objects have the same behavior as they would inside a `with` 
block.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-08-11 Thread MLnick
Github user MLnick commented on the issue:

https://github.com/apache/spark/pull/14579
  
After looking at it and considering all the above, I would say the options 
are (1) do nothing; or (2) if we want to support this use case, then we 
implement a single utility method (I would say called `persisted`) that is a 
context manager.

Even though we could _almost_ achieve things with subclassing, we do sort 
of break something and add too much risk/complexity vs reward of the feature 
IMHO.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-08-11 Thread MLnick
Github user MLnick commented on the issue:

https://github.com/apache/spark/pull/14579
  
@nchammas a utility method (e.g. `cached`) - actually it will be a context 
manager class implemented in the same way as `closing` - will work because it 
only needs to return the context manager, not the RDD instance. So there is no 
chaining requirement, and it will only work in a `with` statement.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-08-10 Thread holdenk
Github user holdenk commented on the issue:

https://github.com/apache/spark/pull/14579
  
Right I wouldn't expect it to error with subclassing - just not pipeline 
successfully - but only in a very long shot corner case.

I think the try/finally with persistance is not an uncommon pattern (we 
have something similar happen frequently inside of Spark ML/mllib but its in 
Scala code).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-08-10 Thread nchammas
Github user nchammas commented on the issue:

https://github.com/apache/spark/pull/14579
  
Ah, I see. I don't fully understand how `PipelinedRDD` works or how it is 
used so I'll have to defer to y'all on this. Does the `cached()` utility method 
have this same problem?

> We could possibly work around it with some type checking etc but it then 
starts to feel like adding more complexity than the feature is worth...

Agreed.

At this point, actually, I'm beginning to feel this feature is not worth it.

Context managers seem to work best when the objects they're working on have 
clear open/close-style semantics. File handles, network connections, and the 
like fit this pattern well.

In fact, the [doc for 
`with`](https://docs.python.org/3/reference/compound_stmts.html#the-with-statement)
 says:

> This allows common `try...except...finally` usage patterns to be 
encapsulated for convenient reuse.

RDDs and DataFrames, on the other hand, don't have a simple open/close or 
`try...except...finally` pattern. And when we try to map one onto persist and 
unpersist, we get the various side-effects we've been discussing here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-08-10 Thread MLnick
Github user MLnick commented on the issue:

https://github.com/apache/spark/pull/14579
  
yeah it would break pipelining - I don't think it will necessarily throw an 
error though.

e.g.

```
In [22]: rdd = sc.parallelize(["b", "a", "c"])

In [23]: type(rdd)
Out[23]: pyspark.rdd.RDD

In [24]: mapped = rdd.map(lambda x: x)

In [25]: type(mapped)
Out[25]: pyspark.rdd.PipelinedRDD

In [26]: mapped._is_pipelinable()
Out[26]: True

In [27]: p = mapped.cache()

In [28]: type(p)
Out[28]: pyspark.rdd.PersistedRDD

In [29]: p._is_pipelinable()
---
AttributeErrorTraceback (most recent call last)
 in ()
> 1 p._is_pipelinable()

AttributeError: 'PersistedRDD' object has no attribute '_is_pipelinable'

In [30]: mapped2 = p.map(lambda x: x)

In [31]: type(mapped2)
Out[31]: pyspark.rdd.PipelinedRDD
```

So I think chaining will work, but the pipelined RDD thinks mapped2 is the 
1st transformation, while it is actually the 2nd. I _think_ this will just be 
an efficiency issue rather than a correctness issue however. 

We could possibly work around it with some type checking etc but it then 
starts to feel like adding more complexity than the feature is worth...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-08-10 Thread nchammas
Github user nchammas commented on the issue:

https://github.com/apache/spark/pull/14579
  
Sorry, you're right, `__exit__()`'s return value is not going to be 
consumed anywhere. What I meant is that `unpersist()` would return the base RDD 
or DataFrame object.

But I'm not seeing the issue with the example you posted. Reformatting for 
clarity:

```python
magic = rdd.persist()

with magic as awesome:
awesome.count()

magic.map(lambda x: x + 1)
```

Are you saying `magic.map()` will error? Why would it?

`magic` would be an instance of `PersistedRDD`, which in turn is a subclass 
of `RDD`, which has `map()` and all of the usual methods defined, plus the 
magic methods we need for the context manager.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-08-10 Thread holdenk
Github user holdenk commented on the issue:

https://github.com/apache/spark/pull/14579
  
@nchammas so if we go with the subclassing approach but keep the current 
cache/persist interface (e.g. no special utility function) a user could easily 
write something like:
`magic = rdd.persist()
with magic as awesome:
awesome.count()
magic.map(lambda x: x + 1)`

I don't _believe_ `__exit__()` could easily return a result that would 
updated `magic` to be `rdd` (infact `__exit__()` generally doesn't seem to 
return a result - instead its expected to do the teardown logic internally).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-08-10 Thread nchammas
Github user nchammas commented on the issue:

https://github.com/apache/spark/pull/14579
  
> the subclassing of RDD approach could cause us to miss out on pipelining 
if the RDD was used again after it was unpersisted

How so? Wouldn't `__exit__()` simply return the parent RDD or DataFrame 
object?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-08-10 Thread holdenk
Github user holdenk commented on the issue:

https://github.com/apache/spark/pull/14579
  
One minor thing to keep in mind - the subclassing of RDD approach could 
cause us to miss out on pipelining if the RDD was used again after it was 
unpersisted - but I think that is a relatively minor issue.

On the whole I think modify base rdd and dataframe classes (option A / 
option 4) which is the one @MLnick has implemented here is probably one of the 
more reasonable options - the `with` statement doesn't add anything if the 
RDD/DataFrame isn't persisted but can do cleanup if it is.

But if there is a better way to do this I'd be excited to find out as well 
:)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-08-10 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/14579
  
cc @davies


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-08-10 Thread nchammas
Github user nchammas commented on the issue:

https://github.com/apache/spark/pull/14579
  
None of our options seems great, but if I had to rank them I would say:

1. Add new `Persisted...` classes.
2. Make no changes.
3. Add separate `persisted()` or `cached()` utility method.
4. Modify base RDD and DataFrame classes.

Adding new internal classes for this use-case honestly seems a bit 
heavy-handed to me, so if we are against that then I would lean towards not 
doing anything.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-08-10 Thread nchammas
Github user nchammas commented on the issue:

https://github.com/apache/spark/pull/14579
  
Ah, you're right.

So if we want to avoid needing magic methods in the main RDD/DataFrame 
classes and avoid needing a separate utility method like `cache()`, I think one 
option available to us is to have separate `PersistedRDD` and 
`PersistedDataFrame` classes that simply wrap the base RDD and DataFrames 
classes and add the appropriate magic methods.

`.persist()` and `.cache()` would then return instances of these classes, 
which should satisfy the `type(x).__enter__(x)` behavior while still preserving 
backwards compatibility and method chaining.

What do you think of that?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-08-10 Thread MLnick
Github user MLnick commented on the issue:

https://github.com/apache/spark/pull/14579
  
@nchammas I looked at the `@contextmanager` decorator. It is an easy way to 
create a method that returns a context manager, but is is essentially _only_ 
usable in a `with` statement as it returns a 
`contextlib.GeneratorContextManager`. For this use case it does not solve the 
issue that we need to return the RDD/DF instance from `cache`.

As far as I can see for the `closing` helper function, it is just a 
[context manager 
itself](https://github.com/python/cpython/blob/master/Lib/contextlib.py#L163), 
so the same as the final variant for option (a) mentioned above, i.e. `with 
cached(rdd) as ...`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-08-10 Thread nchammas
Github user nchammas commented on the issue:

https://github.com/apache/spark/pull/14579
  
Thanks @MLnick for taking this on and for breaking down what you've found 
so far.

I took a look through 
[`contextlib`](https://docs.python.org/3/library/contextlib.html) for 
inspiration, and I wonder if the source code for 
[`closing()`](https://docs.python.org/3/library/contextlib.html#contextlib.closing)
 offers a template we can follow that would let `persist()` return an 
RDD/DataFrame instance with the correct magic methods, without having to modify 
the class.

Have you taken a look at that?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-08-10 Thread MLnick
Github user MLnick commented on the issue:

https://github.com/apache/spark/pull/14579
  
cc @nchammas @holdenk @rxin 

**Note:**
This is implemented by adding the `__enter__` and `__exit__` methods to 
RDD/DataFrame directly. This allows some potentially weird stuff, such as any 
instance of RDD/DF, including any method returning `self`, can be used in a 
`with` statement, e.g. this works:

```python
with rdd.map(lambda x: x) as x:
...
``` 

Clearly this doesn't make a lot of sense. However, I looked at the 2 
options of (a) a separate context manager wrapper class returned by `persist`; 
and (b) trying to dynamically add the methods (or at least `__enter__`) in 
`persist`.

The problem with (a) is that `persist` needs to return an RDD/DF instance, 
so this breaks chaining behavior such as `rdd.cache().count()` etc.

The problem with (b) is that the special method `__enter__` is called in 
the context of `with` as `type(rdd).__enter__(rdd)` (see [PEP 
343](https://www.python.org/dev/peps/pep-0343/)). So it does not help to add a 
method dynamically to an instance, it must be done to the class. In this case, 
then after the first `with` statement usage, *all* existing and future 
instances of RDD/DF have the `__enter__` method, putting us in the same 
situation as the approach in this PR of having the methods defined on the class 
(with associated allowed "weirdness").

So, if we want to avoid that, the only option I see is a variant of (a) 
above - adding a `cached`/`persisted` method that returns a context manager, so 
it would look like this:

```python
with cached(rdd) as x:
x.count()
```

This is less "elegant" but more explicit.

Any other smart ideas for handling option (b) above, please do shout!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-08-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14579
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63520/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-08-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14579
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-08-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14579
  
**[Test build #63520 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63520/consoleFull)**
 for PR 14579 at commit 
[`2b4e56e`](https://github.com/apache/spark/commit/2b4e56e72bf3cd291349baf6feb197666d368b67).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-08-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14579
  
**[Test build #63520 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63520/consoleFull)**
 for PR 14579 at commit 
[`2b4e56e`](https://github.com/apache/spark/commit/2b4e56e72bf3cd291349baf6feb197666d368b67).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org