Re: Does reduceByKey only work properly for numeric keys?

Ted Yu Sat, 18 Apr 2015 11:39:07 -0700

Looks like the two keys involving (datetime.datetime(2009, 10, 6, 3, 0) in
your first email might have been processed by x86 node and x64 node,
respectively.


Cheers

On Sat, Apr 18, 2015 at 11:09 AM, SecondDatke <lovejay-lovemu...@outlook.com
> wrote:

> I don't think my experiment is suprising, it's my fault:
>
> To move away from my case, I wrote a test program, which generates data
> randomly, and cast the key to string:
>
> import random
> import operator
>
> COUNT = 23333
> COUNT_PARTITIONS = 36
> LEN = 233
>
> rdd = sc.parallelize(((str(random.randint(1, LEN)), 1) for i in
> xrange(COUNT)), COUNT_PARTITIONS)
> reduced = rdd.reduceByKey(operator.add).sortByKey()
> print(reduced.count(), LEN) # the result is valid if count == LEN
>
> More about my environment: I'm running Spark on a small Mesos cluster, I'm
> always using pyspark shell with Python 2.7.9, IPython 3.0.0. The operating
> system is ArchLinux.
>
> And, there is a node, running x86 Arch Linux, while the others x86_64.
>
> The problem arises, as long as the x86 node and x64 nodes works together.
> Nothing wrong if there is only a x86 node in the cluster, or just x64
> nodes. And currently only reduceByKey with int32 keys makes sense.
>
> Maybe I should update my system.
>
> ------------------------------
> Date: Sat, 18 Apr 2015 08:28:50 -0700
> Subject: Re: Does reduceByKey only work properly for numeric keys?
> From: yuzhih...@gmail.com
> To: lovejay-lovemu...@outlook.com
> CC: user@spark.apache.org
>
> Can you show us the function you passed to reduceByKey() ?
>
> What release of Spark are you using ?
>
> Cheers
>
> On Sat, Apr 18, 2015 at 8:17 AM, SecondDatke <
> lovejay-lovemu...@outlook.com> wrote:
>
> I'm trying to solve a Word-Count like problem, the difference lies in
> that, I need the count of a specific word among a specific timespan in a
> social message stream.
>
> My data is in the format of (time, message), and I transformed (flatMap
> etc.) it into a series of (time, word_id), the time is represented with
> Python datetime.datetime class. And I continued to transform it to ((time,
> word_id), 1) then use reduceByKey for result.
>
> But the dataset returned is a little weird, just like the following:
>
> format:
> ((timespan with datetime.datetime, wordid), freq)
>
> ((datetime.datetime(2009, 10, 6, 2, 0), 0), 8)
> ((datetime.datetime(2009, 10, 6, 3, 0), 0), 3)
> ((datetime.datetime(2009, 10, 6, 3, 0), 0), 14)
>
> As you can see, there are DUPLICATED keys, but as a result of
> reducedByKey, all keys SHOULD BE UNIQUE.
>
> I tried to convert the key to a string (like '2006-12-02 21:00:00-000')
> and reducedByKey again, the problem stays. It seems the only way left for
> me is convert the date to a timestamp, but this time it works.
>
> Is this expected behavior of reduceByKey(and all other transformations
> that work with keys)?
>
> Currently I'm still working on it.
>
>
>

Re: Does reduceByKey only work properly for numeric keys?

Reply via email to