[jira] [Updated] (SPARK-1962) Add RDD cache reference counting

2014-05-29 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1962:
---

Component/s: Spark Core

> Add RDD cache reference counting
> 
>
> Key: SPARK-1962
> URL: https://issues.apache.org/jira/browse/SPARK-1962
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Taeyun Kim
>Priority: Minor
>
> It would be nice if the RDD cache() method incorporate a reference counting 
> information.
> That is,
> {code}
> void test()
> {
> JavaRDD<...> rdd = ...;
> rdd.cache();  // to reference count 1. actual caching happens.
> rdd.cache();  // to reference count 2. Nop as long as the storage level 
> is the same. Else, exception.
> ...
> rdd.uncache();  // to reference count 1. Nop.
> rdd.uncache();  // to reference count 0. Actual unpersist happens.
> }
> {code}
> This can be useful when writing code in modular way.
> When a function receives an RDD as an argument, it doesn't necessarily know 
> the cache status of the RDD.
> But it could want to cache the RDD, since it will use the RDD multiple times.
> But with the current RDD API, it cannot determine whether it should unpersist 
> it or leave it alone (so that the caller can continue to use that RDD without 
> rebuilding).
> For API compatibility, introducing a new method or adding a parameter may be 
> required.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1962) Add RDD cache reference counting

2014-05-29 Thread Taeyun Kim (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Taeyun Kim updated SPARK-1962:
--

Affects Version/s: 1.0.0

> Add RDD cache reference counting
> 
>
> Key: SPARK-1962
> URL: https://issues.apache.org/jira/browse/SPARK-1962
> Project: Spark
>  Issue Type: New Feature
>Affects Versions: 1.0.0
>Reporter: Taeyun Kim
>Priority: Minor
>
> It would be nice if the RDD cache() method incorporate a reference counting 
> information.
> That is,
> {code}
> void test()
> {
> JavaRDD<...> rdd = ...;
> rdd.cache();  // to reference count 1. actual caching happens.
> rdd.cache();  // to reference count 2. Nop as long as the storage level 
> is the same. Else, exception.
> ...
> rdd.uncache();  // to reference count 1. Nop.
> rdd.uncache();  // to reference count 0. Actual unpersist happens.
> }
> {code}
> This can be useful when writing code in modular way.
> When a function receives an RDD as an argument, it doesn't necessarily know 
> the cache status of the RDD.
> But it could want to cache the RDD, since it will use the RDD multiple times.
> But with the current RDD API, it cannot determine whether it should unpersist 
> it or leave it alone (so that the caller can continue to use that RDD without 
> rebuilding).
> For API compatibility, introducing a new method or adding a parameter may be 
> required.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1962) Add RDD cache reference counting

2014-05-29 Thread Taeyun Kim (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Taeyun Kim updated SPARK-1962:
--

Description: 
It would be nice if the RDD cache() method incorporate a reference counting 
information.

That is,

{code}
void test()
{

JavaRDD<...> rdd = ...;


rdd.cache();  // to reference count 1. actual caching happens.
rdd.cache();  // to reference count 2. Nop as long as the storage level is 
the same. Else, exception.

...

rdd.uncache();  // to reference count 1. Nop.
rdd.uncache();  // to reference count 0. Actual unpersist happens.
}
{code}

This can be useful when writing code in modular way.
When a function receives an RDD as an argument, it doesn't necessarily know the 
cache status of the RDD.
But it could want to cache the RDD, since it will use the RDD multiple times.
But with the current RDD API, it cannot determine whether it should unpersist 
it or leave it alone (so that the caller can continue to use that RDD without 
rebuilding).

For API compatibility, introducing a new method or adding a parameter may be 
required.

  was:
It would be nice if the RDD cache() method incorporate a reference counting 
information.

That is,

{code}
void test()
{

JavaRDD<...> rdd = ...;


rdd.cache();  // to reference count 1. actual caching happens.
rdd.cache();  // to reference count 2. Nop as long as the storage level is 
the same. Else, exception.

...

rdd.uncache();  // to reference count 1. Nop.
rdd.uncache();  // to reference count 0. Actual unpersist happens.
}
{code}

This can be useful when writing code in modular way.
When a function receives an RDD as an argument, it doesn't necessarily know the 
cache status of the RDD.
But it could want to cache the RDD, since it will use the RDD multiple times.
But with the current RDD API, it cannot determine whether it should unpersist 
it or leave it alone (so that the caller can continue to use that RDD without 
rebuilding).



> Add RDD cache reference counting
> 
>
> Key: SPARK-1962
> URL: https://issues.apache.org/jira/browse/SPARK-1962
> Project: Spark
>  Issue Type: New Feature
>Reporter: Taeyun Kim
>Priority: Minor
>
> It would be nice if the RDD cache() method incorporate a reference counting 
> information.
> That is,
> {code}
> void test()
> {
> JavaRDD<...> rdd = ...;
> rdd.cache();  // to reference count 1. actual caching happens.
> rdd.cache();  // to reference count 2. Nop as long as the storage level 
> is the same. Else, exception.
> ...
> rdd.uncache();  // to reference count 1. Nop.
> rdd.uncache();  // to reference count 0. Actual unpersist happens.
> }
> {code}
> This can be useful when writing code in modular way.
> When a function receives an RDD as an argument, it doesn't necessarily know 
> the cache status of the RDD.
> But it could want to cache the RDD, since it will use the RDD multiple times.
> But with the current RDD API, it cannot determine whether it should unpersist 
> it or leave it alone (so that the caller can continue to use that RDD without 
> rebuilding).
> For API compatibility, introducing a new method or adding a parameter may be 
> required.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1962) Add RDD cache reference counting

2014-05-29 Thread Taeyun Kim (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Taeyun Kim updated SPARK-1962:
--

Description: 
It would be nice if the RDD cache() method incorporate a reference counting 
information.

That is,

{code}
void test()
{

JavaRDD<...> rdd = ...;


rdd.cache();  // to reference count 1. actual caching happens.
rdd.cache();  // to reference count 2. Nop as long as the storage level is 
the same. Else, exception.

...

rdd.uncache();  // to reference count 1. Nop.
rdd.uncache();  // to reference count 0. Actual unpersist happens.
}
{code}

This can be useful when writing code in modular way.
When a function receives an RDD as an argument, it doesn't necessarily know the 
cache status of the RDD.
But it could want to cache the RDD, since it will use the RDD multiple times.
But with the current RDD API, it cannot determine whether it should unpersist 
it or leave it alone (so that the caller can continue to use that RDD without 
rebuilding).


  was:
It would be nice if the RDD cache() method incorporate a reference counting 
information.

That is,

{code}
void test()
{

JavaRDD<...> rdd = ...;


rdd.cache();  // to depth 1. actual caching happens.
rdd.cache();  // to depth 2. Nop as long as the storage level is the same. 
Else, exception.

...

rdd.uncache();  // to depth 1. Nop.
rdd.uncache();  // to depth 0. Actual unpersist happens.
}
{code}

This can be useful when writing code in modular way.
When a function receives an rdd as an argument, it doesn't necessarily know the 
cache status of the rdd.
But it could want to cache the rdd, since it will use the rdd multiple times.
But with the current RDD API, it cannot determine whether it should unpersist 
it or leave it alone (so that caller can continue to use that rdd without 
rebuilding).



> Add RDD cache reference counting
> 
>
> Key: SPARK-1962
> URL: https://issues.apache.org/jira/browse/SPARK-1962
> Project: Spark
>  Issue Type: New Feature
>Reporter: Taeyun Kim
>Priority: Minor
>
> It would be nice if the RDD cache() method incorporate a reference counting 
> information.
> That is,
> {code}
> void test()
> {
> JavaRDD<...> rdd = ...;
> rdd.cache();  // to reference count 1. actual caching happens.
> rdd.cache();  // to reference count 2. Nop as long as the storage level 
> is the same. Else, exception.
> ...
> rdd.uncache();  // to reference count 1. Nop.
> rdd.uncache();  // to reference count 0. Actual unpersist happens.
> }
> {code}
> This can be useful when writing code in modular way.
> When a function receives an RDD as an argument, it doesn't necessarily know 
> the cache status of the RDD.
> But it could want to cache the RDD, since it will use the RDD multiple times.
> But with the current RDD API, it cannot determine whether it should unpersist 
> it or leave it alone (so that the caller can continue to use that RDD without 
> rebuilding).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1962) Add RDD cache reference counting

2014-05-29 Thread Taeyun Kim (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Taeyun Kim updated SPARK-1962:
--

Description: 
It would be nice if the RDD cache() method incorporate a reference counting 
information.

That is,

{code}
void test()
{

JavaRDD<...> rdd = ...;


rdd.cache();  // to depth 1. actual caching happens.
rdd.cache();  // to depth 2. Nop as long as the storage level is the same. 
Else, exception.

...

rdd.uncache();  // to depth 1. Nop.
rdd.uncache();  // to depth 0. Actual unpersist happens.
}
{code}

This can be useful when writing code in modular way.
When a function receives an rdd as an argument, it doesn't necessarily know the 
cache status of the rdd.
But it could want to cache the rdd, since it will use the rdd multiple times.
But with the current RDD API, it cannot determine whether it should unpersist 
it or leave it alone (so that caller can continue to use that rdd without 
rebuilding).


> Add RDD cache reference counting
> 
>
> Key: SPARK-1962
> URL: https://issues.apache.org/jira/browse/SPARK-1962
> Project: Spark
>  Issue Type: New Feature
>Reporter: Taeyun Kim
>Priority: Minor
>
> It would be nice if the RDD cache() method incorporate a reference counting 
> information.
> That is,
> {code}
> void test()
> {
> JavaRDD<...> rdd = ...;
> rdd.cache();  // to depth 1. actual caching happens.
> rdd.cache();  // to depth 2. Nop as long as the storage level is the same. 
> Else, exception.
> ...
> rdd.uncache();  // to depth 1. Nop.
> rdd.uncache();  // to depth 0. Actual unpersist happens.
> }
> {code}
> This can be useful when writing code in modular way.
> When a function receives an rdd as an argument, it doesn't necessarily know 
> the cache status of the rdd.
> But it could want to cache the rdd, since it will use the rdd multiple times.
> But with the current RDD API, it cannot determine whether it should unpersist 
> it or leave it alone (so that caller can continue to use that rdd without 
> rebuilding).



--
This message was sent by Atlassian JIRA
(v6.2#6252)