Re: Is there a way to create key based on counts in Spark

2014-11-18 Thread Debasish Das
Use zipWithIndex but cache the data before you run zipWithIndex...that way
your ordering will be consistent (unless the bug has been fixed where you
don't have to cache the data)...

Normally these operations are used for dictionary building and so I am
hoping you can cache the dictionary of RDD[String] before you can run
zipWithIndex...

indices are within 0 till maxIndex-1...if you want 1 you have to later map
the index to index + 1

On Tue, Nov 18, 2014 at 8:56 AM, Blind Faith person.of.b...@gmail.com
wrote:

 As it is difficult to explain this, I would show what I want. Lets us say,
 I have an RDD A with the following value

 A = [word1, word2, word3]

 I want to have an RDD with the following value

 B = [(1, word1), (2, word2), (3, word3)]

 That is, it gives a unique number to each entry as a key value. Can we do
 such thing with Python or Scala?



Re: Is there a way to create key based on counts in Spark

2014-11-18 Thread Davies Liu
On Tue, Nov 18, 2014 at 9:06 AM, Debasish Das debasish.da...@gmail.com wrote:
 Use zipWithIndex but cache the data before you run zipWithIndex...that way
 your ordering will be consistent (unless the bug has been fixed where you
 don't have to cache the data)...

Could you point some link about the bug?

 Normally these operations are used for dictionary building and so I am
 hoping you can cache the dictionary of RDD[String] before you can run
 zipWithIndex...

 indices are within 0 till maxIndex-1...if you want 1 you have to later map
 the index to index + 1

 On Tue, Nov 18, 2014 at 8:56 AM, Blind Faith person.of.b...@gmail.com
 wrote:

 As it is difficult to explain this, I would show what I want. Lets us say,
 I have an RDD A with the following value

 A = [word1, word2, word3]

 I want to have an RDD with the following value

 B = [(1, word1), (2, word2), (3, word3)]

 That is, it gives a unique number to each entry as a key value. Can we do
 such thing with Python or Scala?



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Is there a way to create key based on counts in Spark

2014-11-18 Thread Sean Owen
On Tue, Nov 18, 2014 at 8:26 PM, Davies Liu dav...@databricks.com wrote:
 On Tue, Nov 18, 2014 at 9:06 AM, Debasish Das debasish.da...@gmail.com 
 wrote:
 Use zipWithIndex but cache the data before you run zipWithIndex...that way
 your ordering will be consistent (unless the bug has been fixed where you
 don't have to cache the data)...

 Could you point some link about the bug?

I think it's this:

https://issues.apache.org/jira/browse/SPARK-3098

... but it's resolved as not a bug.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Is there a way to create key based on counts in Spark

2014-11-18 Thread Davies Liu
I see, thanks!

On Tue, Nov 18, 2014 at 12:12 PM, Sean Owen so...@cloudera.com wrote:
 On Tue, Nov 18, 2014 at 8:26 PM, Davies Liu dav...@databricks.com wrote:
 On Tue, Nov 18, 2014 at 9:06 AM, Debasish Das debasish.da...@gmail.com 
 wrote:
 Use zipWithIndex but cache the data before you run zipWithIndex...that way
 your ordering will be consistent (unless the bug has been fixed where you
 don't have to cache the data)...

 Could you point some link about the bug?

 I think it's this:

 https://issues.apache.org/jira/browse/SPARK-3098

 ... but it's resolved as not a bug.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org