shall we document this in the API doc? Best,
-- Nan Zhu On Sunday, September 21, 2014 at 12:18 PM, Debasish Das wrote: > zipWithUniqueId is also affected... > > I had to persist the dictionaries to make use of the indices lower down in > the flow... > > On Sun, Sep 21, 2014 at 1:15 AM, Sean Owen <so...@cloudera.com > (mailto:so...@cloudera.com)> wrote: > > Reference - https://issues.apache.org/jira/browse/SPARK-3098 > > I imagine zipWithUniqueID is also affected, but may not happen to have > > exhibited in your test. > > > > On Sun, Sep 21, 2014 at 2:13 AM, Debasish Das <debasish.da...@gmail.com > > (mailto:debasish.da...@gmail.com)> wrote: > > > Some more debug revealed that as Sean said I have to keep the dictionaries > > > persisted till I am done with the RDD manipulation..... > > > > > > Thanks Sean for the pointer...would it be possible to point me to the JIRA > > > as well ? > > > > > > Are there plans to make it more transparent for the users ? > > > > > > Is it possible for the DAG to speculate such things...similar to branch > > > prediction ideas from comp arch... > > > > > > > > > > > > On Sat, Sep 20, 2014 at 1:56 PM, Debasish Das <debasish.da...@gmail.com > > > (mailto:debasish.da...@gmail.com)> > > > wrote: > > >> > > >> I changed zipWithIndex to zipWithUniqueId and that seems to be working... > > >> > > >> What's the difference between zipWithIndex vs zipWithUniqueId ? > > >> > > >> For zipWithIndex we don't need to run the count to compute the offset > > >> which is needed for zipWithUniqueId and so zipWithIndex is efficient ? > > >> It's > > >> not very clear from docs... > > >> > > >> > > >> On Sat, Sep 20, 2014 at 1:48 PM, Debasish Das <debasish.da...@gmail.com > > >> (mailto:debasish.da...@gmail.com)> > > >> wrote: > > >>> > > >>> I did not persist / cache it as I assumed zipWithIndex will preserve > > >>> order... > > >>> > > >>> There is also zipWithUniqueId...I am trying that...If that also shows > > >>> the > > >>> same issue, we should make it clear in the docs... > > >>> > > >>> On Sat, Sep 20, 2014 at 1:44 PM, Sean Owen <so...@cloudera.com > > >>> (mailto:so...@cloudera.com)> wrote: > > >>>> > > >>>> From offline question - zipWithIndex is being used to assign IDs. From > > >>>> a > > >>>> recent JIRA discussion I understand this is not deterministic within a > > >>>> partition so the index can be different when the RDD is reevaluated. > > >>>> If you > > >>>> need it fixed, persist the zipped RDD on disk or in memory. > > >>>> > > >>>> On Sep 20, 2014 8:10 PM, "Debasish Das" <debasish.da...@gmail.com > > >>>> (mailto:debasish.da...@gmail.com)> > > >>>> wrote: > > >>>>> > > >>>>> Hi, > > >>>>> > > >>>>> I am building a dictionary of RDD[(String, Long)] and after the > > >>>>> dictionary is built and cached, I find key "almonds" at value 5187 > > >>>>> using: > > >>>>> > > >>>>> rdd.filter{case(product, index) => product == "almonds"}.collect > > >>>>> > > >>>>> Output: > > >>>>> > > >>>>> Debug product almonds index 5187 > > >>>>> > > >>>>> Now I take the same dictionary and write it out as: > > >>>>> > > >>>>> dictionary.map{case(product, index) => product + "," + index} > > >>>>> .saveAsTextFile(outputPath) > > >>>>> > > >>>>> Inside the map I also print what's the product at index 5187 and I get > > >>>>> a different product: > > >>>>> > > >>>>> Debug Index 5187 userOrProduct cardigans > > >>>>> > > >>>>> Is this an expected behavior from map ? > > >>>>> > > >>>>> By the way "almonds" and "apparel-cardigans" are just one off in the > > >>>>> index... > > >>>>> > > >>>>> I am using spark-1.1 but it's a snapshot.. > > >>>>> > > >>>>> Thanks. > > >>>>> Deb > > >>>>> > > >>>>> > > >>> > > >> > > > >