Re: PySpark joins fail - please help

Russell Jurney Fri, 17 Oct 2014 22:27:06 -0700

There was a bug in the devices line: dh.index('id') should have been
x[dh.index('id')].
ᐧ


On Fri, Oct 17, 2014 at 5:52 PM, Russell Jurney <russell.jur...@gmail.com>
wrote:

> Is that not exactly what I've done in j3/j4? The keys are identical
> strings.The k is the same, the value in both instances is an associative
> array.
>
> devices = devices.map(lambda x: (dh.index('id'), {'deviceid': 
> x[dh.index('id')], 'foo': x[dh.index('foo')], 'bar': x[dh.index('bar')]}))
> bytes_in_out = transactions.map(lambda x: (x[th.index('deviceid')], 
> {'deviceid': x[th.index('deviceid')],
>   'foo': x[th.index('foo')],
>   'bar': x[th.index('bar')],
>   'hello': x[th.index('hello')],
>   'world': x[th.index('world')]}))
>
> j3 = bytes_in_out.join(devices, 10)
> j3.take(1)
> j4 = devices.join(bytes_int_out, 10)
> j4.take(1)
>
> ᐧ
>
> On Fri, Oct 17, 2014 at 5:48 PM, Davies Liu <dav...@databricks.com> wrote:
>
>> Hey Russell,
>>
>> join() can only work with RDD of pairs (key, value), such as
>>
>> rdd1:  (k, v1)
>> rdd2: (k, v2)
>>
>> rdd1.join(rdd2) will be  (k1, v1, v2)
>>
>> Spark SQL will be more useful for you, see
>> http://spark.apache.org/docs/1.1.0/sql-programming-guide.html
>>
>> Davies
>>
>>
>> On Fri, Oct 17, 2014 at 5:01 PM, Russell Jurney <russell.jur...@gmail.com
>> > wrote:
>>
>>> https://gist.github.com/rjurney/fd5c0110fe7eb686afc9
>>>
>>> Any way I try to join my data fails. I can't figure out what I'm doing
>>> wrong.
>>>
>>> --
>>> Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome
>>> .com
>>> ᐧ
>>>
>>
>>
>
>
> --
> Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.
> com
>



-- 
Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com

Re: PySpark joins fail - please help

Reply via email to