Right now, we can not figure out which column you referenced in
`select`, if there are multiple row with the same name in the joined
DataFrame (for example, two `value`).
A workaround could be:
numbers2 = numbers.select(df.name, df.value.alias('other'))
rows = numbers.join(numbers2,
(numbers.name==numbers2.name) & (numbers.value !=
numbers2.other),
how="inner") \
.select(numbers.name, numbers.value, numbers2.other) \
.collect()
On Mon, Jun 22, 2015 at 12:53 PM, Ignacio Blasco <[email protected]> wrote:
> Sorry thought it was scala/spark
>
> El 22/6/2015 9:49 p. m., "Bob Corsaro" <[email protected]> escribió:
>>
>> That's invalid syntax. I'm pretty sure pyspark is using a DSL to create a
>> query here and not actually doing an equality operation.
>>
>> On Mon, Jun 22, 2015 at 3:43 PM Ignacio Blasco <[email protected]>
>> wrote:
>>>
>>> Probably you should use === instead of == and !== instead of !=
>>>
>>> Can anyone explain why the dataframe API doesn't work as I expect it to
>>> here? It seems like the column identifiers are getting confused.
>>>
>>> https://gist.github.com/dokipen/4b324a7365ae87b7b0e5
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]