Re: Distributed dictionary building

Nan Zhu Tue, 23 Sep 2014 07:56:13 -0700

shall we document this in the API doc? 

Best,


-- 
Nan Zhu


On Sunday, September 21, 2014 at 12:18 PM, Debasish Das wrote:

> zipWithUniqueId is also affected...
> 
> I had to persist the dictionaries to make use of the indices lower down in 
> the flow...
> 
> On Sun, Sep 21, 2014 at 1:15 AM, Sean Owen <so...@cloudera.com 
> (mailto:so...@cloudera.com)> wrote:
> > Reference - https://issues.apache.org/jira/browse/SPARK-3098
> > I imagine zipWithUniqueID is also affected, but may not happen to have
> > exhibited in your test.
> > 
> > On Sun, Sep 21, 2014 at 2:13 AM, Debasish Das <debasish.da...@gmail.com 
> > (mailto:debasish.da...@gmail.com)> wrote:
> > > Some more debug revealed that as Sean said I have to keep the dictionaries
> > > persisted till I am done with the RDD manipulation.....
> > >
> > > Thanks Sean for the pointer...would it be possible to point me to the JIRA
> > > as well ?
> > >
> > > Are there plans to make it more transparent for the users ?
> > >
> > > Is it possible for the DAG to speculate such things...similar to branch
> > > prediction ideas from comp arch...
> > >
> > >
> > >
> > > On Sat, Sep 20, 2014 at 1:56 PM, Debasish Das <debasish.da...@gmail.com 
> > > (mailto:debasish.da...@gmail.com)>
> > > wrote:
> > >>
> > >> I changed zipWithIndex to zipWithUniqueId and that seems to be working...
> > >>
> > >> What's the difference between zipWithIndex vs zipWithUniqueId ?
> > >>
> > >> For zipWithIndex we don't need to run the count to compute the offset
> > >> which is needed for zipWithUniqueId and so zipWithIndex is efficient ? 
> > >> It's
> > >> not very clear from docs...
> > >>
> > >>
> > >> On Sat, Sep 20, 2014 at 1:48 PM, Debasish Das <debasish.da...@gmail.com 
> > >> (mailto:debasish.da...@gmail.com)>
> > >> wrote:
> > >>>
> > >>> I did not persist / cache it as I assumed zipWithIndex will preserve
> > >>> order...
> > >>>
> > >>> There is also zipWithUniqueId...I am trying that...If that also shows 
> > >>> the
> > >>> same issue, we should make it clear in the docs...
> > >>>
> > >>> On Sat, Sep 20, 2014 at 1:44 PM, Sean Owen <so...@cloudera.com 
> > >>> (mailto:so...@cloudera.com)> wrote:
> > >>>>
> > >>>> From offline question - zipWithIndex is being used to assign IDs. From 
> > >>>> a
> > >>>> recent JIRA discussion I understand this is not deterministic within a
> > >>>> partition so the index can be different when the RDD is reevaluated. 
> > >>>> If you
> > >>>> need it fixed, persist the zipped RDD on disk or in memory.
> > >>>>
> > >>>> On Sep 20, 2014 8:10 PM, "Debasish Das" <debasish.da...@gmail.com 
> > >>>> (mailto:debasish.da...@gmail.com)>
> > >>>> wrote:
> > >>>>>
> > >>>>> Hi,
> > >>>>>
> > >>>>> I am building a dictionary of RDD[(String, Long)] and after the
> > >>>>> dictionary is built and cached, I find key "almonds" at value 5187 
> > >>>>> using:
> > >>>>>
> > >>>>> rdd.filter{case(product, index) => product == "almonds"}.collect
> > >>>>>
> > >>>>> Output:
> > >>>>>
> > >>>>> Debug product almonds index 5187
> > >>>>>
> > >>>>> Now I take the same dictionary and write it out as:
> > >>>>>
> > >>>>> dictionary.map{case(product, index) => product + "," + index}
> > >>>>> .saveAsTextFile(outputPath)
> > >>>>>
> > >>>>> Inside the map I also print what's the product at index 5187 and I get
> > >>>>> a different product:
> > >>>>>
> > >>>>> Debug Index 5187 userOrProduct cardigans
> > >>>>>
> > >>>>> Is this an expected behavior from map ?
> > >>>>>
> > >>>>> By the way "almonds" and "apparel-cardigans" are just one off in the
> > >>>>> index...
> > >>>>>
> > >>>>> I am using spark-1.1 but it's a snapshot..
> > >>>>>
> > >>>>> Thanks.
> > >>>>> Deb
> > >>>>>
> > >>>>>
> > >>>
> > >>
> > >
>

Re: Distributed dictionary building

Reply via email to