subject:"Question regarding join with multiple columns with pyspark"

Re: Question regarding join with multiple columns with pyspark

2015-08-13 Thread Dan LaBar

The DataFrame issue has been fixed in Spark 1.5.  Refer to SPARK-7990
 and Stackoverflow: Spark
specify multiple column conditions for dataframe join
.

On Tue, Apr 28, 2015 at 12:55 PM, Ali Bajwa  wrote:

> Thanks again Ayan! To close the loop on this issue, I have filed the below
> JIRA to track the issue:
> https://issues.apache.org/jira/browse/SPARK-7197
>
>
>
> On Fri, Apr 24, 2015 at 8:21 PM, ayan guha  wrote:
>
>> I just tested, your observation in DataFrame API is correct. It behaves
>> weirdly in case of multiple column join.  (Maybe we should report a Jira?)
>>
>> Solution: You can go back to our good old composite key field
>> concatenation method. Not ideal, but workaround. (Of course you can use
>> realSQL as well, as shown below)
>>
>> set up Data:
>>
>> a =
>> [[1993,1,100],[1993,2,200],[1994,1,1000],[1994,3,3000],[2000,1,1]]
>> b = [[1993,1,"A"],[1994,1,"AA"],[2000,1,"AAA"]]
>> YM1 = sc.parallelize(a).map(lambda tup: Row(yr=int(tup[0]),mn =
>> int(tup[1]), price = int(tup[2]),joiningKey=str(tup[0])+"~"+str(tup[1])))
>> YM2 = sc.parallelize(b).map(lambda tup: Row(yr=int(tup[0]),mn =
>> int(tup[1]), name = str(tup[2]),joiningKey=str(tup[0])+"~"+str(tup[1])))
>> print YM1.collect()
>> print YM2.collect()
>>
>> YM1DF = ssc.createDataFrame(YM1)
>> YM2DF = ssc.createDataFrame(YM2)
>>
>> print YM1DF.printSchema()
>> print YM2DF.printSchema()
>>
>> This DOES NOT WORK ---
>>
>> YMJN = YM1DF.join(YM2DF,YM1DF.yr==YM2DF.yr and
>> YM1DF.mn==YM2DF.mn,"inner")
>> print YMJN.printSchema()
>> for l in YMJN.collect():
>> print l
>>
>> Row(joiningKey=u'1993~1', mn=1, price=100, yr=1993, joiningKey=u'1993~1',
>> mn=1, name=u'A', yr=1993)
>> Row(joiningKey=u'1994~1', mn=1, price=100, yr=1994, joiningKey=u'1994~1',
>> mn=1, name=u'AA', yr=1994)
>> Row(joiningKey=u'2000~1', mn=1, price=100, yr=2000, joiningKey=u'2000~1',
>> mn=1, name=u'AAA', yr=2000)
>> Row(joiningKey=u'1993~1', mn=1, price=1000, yr=1993,
>> joiningKey=u'1993~1', mn=1, name=u'A', yr=1993)
>> Row(joiningKey=u'1994~1', mn=1, price=1000, yr=1994,
>> joiningKey=u'1994~1', mn=1, name=u'AA', yr=1994)
>> Row(joiningKey=u'2000~1', mn=1, price=1000, yr=2000,
>> joiningKey=u'2000~1', mn=1, name=u'AAA', yr=2000)
>> Row(joiningKey=u'1993~1', mn=1, price=1, yr=1993,
>> joiningKey=u'1993~1', mn=1, name=u'A', yr=1993)
>> Row(joiningKey=u'1994~1', mn=1, price=1, yr=1994,
>> joiningKey=u'1994~1', mn=1, name=u'AA', yr=1994)
>> Row(joiningKey=u'2000~1', mn=1, price=1, yr=2000,
>> joiningKey=u'2000~1', mn=1, name=u'AAA', yr=2000)
>>
>> -
>>
>> SQL Solution - works as expected
>>
>> YM1DF.registerTempTable("ymdf1")
>> YM2DF.registerTempTable("ymdf2")
>> YMJNS = ssc.sql("select * from ymdf1 inner join ymdf2 on
>> ymdf1.yr=ymdf2.yr and ymdf1.mn=ymdf2.mn")
>> print YMJNS.printSchema()
>> for l in YMJNS.collect():
>> print l
>>
>> Row(joiningKey=u'1994~1', mn=1, price=1000, yr=1994,
>> joiningKey=u'1994~1', mn=1, name=u'AA', yr=1994)
>> Row(joiningKey=u'2000~1', mn=1, price=1, yr=2000,
>> joiningKey=u'2000~1', mn=1, name=u'AAA', yr=2000)
>> Row(joiningKey=u'1993~1', mn=1, price=100, yr=1993, joiningKey=u'1993~1',
>> mn=1, name=u'A', yr=1993)
>>
>> -
>>
>> Field concat method, works as well
>>
>> YMJNA = YM1DF.join(YM2DF,YM1DF.joiningKey==YM2DF.joiningKey,"inner")
>> print YMJNA.printSchema()
>> for l in YMJNA.collect():
>> print l
>>
>> Row(joiningKey=u'1994~1', mn=1, price=1000, yr=1994,
>> joiningKey=u'1994~1', mn=1, name=u'AA', yr=1994)
>> Row(joiningKey=u'1993~1', mn=1, price=100, yr=1993, joiningKey=u'1993~1',
>> mn=1, name=u'A', yr=1993)
>> Row(joiningKey=u'2000~1', mn=1, price=1, yr=2000,
>> joiningKey=u'2000~1', mn=1, name=u'AAA', yr=2000)
>>
>> On Sat, Apr 25, 2015 at 10:18 AM, Ali Bajwa  wrote:
>>
>>> Any ideas on this? Any sample code to join 2 data frames on two columns?
>>>
>>> Thanks
>>> Ali
>>>
>>> On Apr 23, 2015, at 1:05 PM, Ali Bajwa  wrote:
>>>
>>> > Hi experts,
>>> >
>>> > Sorry if this is a n00b question or has already been answered...
>>> >
>>> > Am trying to use the data frames API in python to join 2 dataframes
>>> > with more than 1 column. The example I've seen in the documentation
>>> > only shows a single column - so I tried this:
>>> >
>>> > Example code
>>> >
>>> > import pandas as pd
>>> > from pyspark.sql import SQLContext
>>> > hc = SQLContext(sc)
>>> > A = pd.DataFrame({'year': ['1993', '2005', '1994'], 'month': ['5',
>>> > '12', '12'], 'value': [100, 200, 300]})
>>> > a = hc.createDataFrame(A)
>>> > B = pd.DataFrame({'year': ['1993', '1993'], 'month': ['12', '12'],
>>> > 'value': [101, 102]})
>>> > b = hc.createDataFrame(B)
>>> >
>>> > print "Panda

Re: Question regarding join with multiple columns with pyspark

2015-04-28 Thread Ali Bajwa

Thanks again Ayan! To close the loop on this issue, I have filed the below
JIRA to track the issue:
https://issues.apache.org/jira/browse/SPARK-7197



On Fri, Apr 24, 2015 at 8:21 PM, ayan guha  wrote:

> I just tested, your observation in DataFrame API is correct. It behaves
> weirdly in case of multiple column join.  (Maybe we should report a Jira?)
>
> Solution: You can go back to our good old composite key field
> concatenation method. Not ideal, but workaround. (Of course you can use
> realSQL as well, as shown below)
>
> set up Data:
>
> a =
> [[1993,1,100],[1993,2,200],[1994,1,1000],[1994,3,3000],[2000,1,1]]
> b = [[1993,1,"A"],[1994,1,"AA"],[2000,1,"AAA"]]
> YM1 = sc.parallelize(a).map(lambda tup: Row(yr=int(tup[0]),mn =
> int(tup[1]), price = int(tup[2]),joiningKey=str(tup[0])+"~"+str(tup[1])))
> YM2 = sc.parallelize(b).map(lambda tup: Row(yr=int(tup[0]),mn =
> int(tup[1]), name = str(tup[2]),joiningKey=str(tup[0])+"~"+str(tup[1])))
> print YM1.collect()
> print YM2.collect()
>
> YM1DF = ssc.createDataFrame(YM1)
> YM2DF = ssc.createDataFrame(YM2)
>
> print YM1DF.printSchema()
> print YM2DF.printSchema()
>
> This DOES NOT WORK ---
>
> YMJN = YM1DF.join(YM2DF,YM1DF.yr==YM2DF.yr and
> YM1DF.mn==YM2DF.mn,"inner")
> print YMJN.printSchema()
> for l in YMJN.collect():
> print l
>
> Row(joiningKey=u'1993~1', mn=1, price=100, yr=1993, joiningKey=u'1993~1',
> mn=1, name=u'A', yr=1993)
> Row(joiningKey=u'1994~1', mn=1, price=100, yr=1994, joiningKey=u'1994~1',
> mn=1, name=u'AA', yr=1994)
> Row(joiningKey=u'2000~1', mn=1, price=100, yr=2000, joiningKey=u'2000~1',
> mn=1, name=u'AAA', yr=2000)
> Row(joiningKey=u'1993~1', mn=1, price=1000, yr=1993, joiningKey=u'1993~1',
> mn=1, name=u'A', yr=1993)
> Row(joiningKey=u'1994~1', mn=1, price=1000, yr=1994, joiningKey=u'1994~1',
> mn=1, name=u'AA', yr=1994)
> Row(joiningKey=u'2000~1', mn=1, price=1000, yr=2000, joiningKey=u'2000~1',
> mn=1, name=u'AAA', yr=2000)
> Row(joiningKey=u'1993~1', mn=1, price=1, yr=1993,
> joiningKey=u'1993~1', mn=1, name=u'A', yr=1993)
> Row(joiningKey=u'1994~1', mn=1, price=1, yr=1994,
> joiningKey=u'1994~1', mn=1, name=u'AA', yr=1994)
> Row(joiningKey=u'2000~1', mn=1, price=1, yr=2000,
> joiningKey=u'2000~1', mn=1, name=u'AAA', yr=2000)
>
> -
>
> SQL Solution - works as expected
>
> YM1DF.registerTempTable("ymdf1")
> YM2DF.registerTempTable("ymdf2")
> YMJNS = ssc.sql("select * from ymdf1 inner join ymdf2 on
> ymdf1.yr=ymdf2.yr and ymdf1.mn=ymdf2.mn")
> print YMJNS.printSchema()
> for l in YMJNS.collect():
> print l
>
> Row(joiningKey=u'1994~1', mn=1, price=1000, yr=1994, joiningKey=u'1994~1',
> mn=1, name=u'AA', yr=1994)
> Row(joiningKey=u'2000~1', mn=1, price=1, yr=2000,
> joiningKey=u'2000~1', mn=1, name=u'AAA', yr=2000)
> Row(joiningKey=u'1993~1', mn=1, price=100, yr=1993, joiningKey=u'1993~1',
> mn=1, name=u'A', yr=1993)
>
> -
>
> Field concat method, works as well
>
> YMJNA = YM1DF.join(YM2DF,YM1DF.joiningKey==YM2DF.joiningKey,"inner")
> print YMJNA.printSchema()
> for l in YMJNA.collect():
> print l
>
> Row(joiningKey=u'1994~1', mn=1, price=1000, yr=1994, joiningKey=u'1994~1',
> mn=1, name=u'AA', yr=1994)
> Row(joiningKey=u'1993~1', mn=1, price=100, yr=1993, joiningKey=u'1993~1',
> mn=1, name=u'A', yr=1993)
> Row(joiningKey=u'2000~1', mn=1, price=1, yr=2000,
> joiningKey=u'2000~1', mn=1, name=u'AAA', yr=2000)
>
> On Sat, Apr 25, 2015 at 10:18 AM, Ali Bajwa  wrote:
>
>> Any ideas on this? Any sample code to join 2 data frames on two columns?
>>
>> Thanks
>> Ali
>>
>> On Apr 23, 2015, at 1:05 PM, Ali Bajwa  wrote:
>>
>> > Hi experts,
>> >
>> > Sorry if this is a n00b question or has already been answered...
>> >
>> > Am trying to use the data frames API in python to join 2 dataframes
>> > with more than 1 column. The example I've seen in the documentation
>> > only shows a single column - so I tried this:
>> >
>> > Example code
>> >
>> > import pandas as pd
>> > from pyspark.sql import SQLContext
>> > hc = SQLContext(sc)
>> > A = pd.DataFrame({'year': ['1993', '2005', '1994'], 'month': ['5',
>> > '12', '12'], 'value': [100, 200, 300]})
>> > a = hc.createDataFrame(A)
>> > B = pd.DataFrame({'year': ['1993', '1993'], 'month': ['12', '12'],
>> > 'value': [101, 102]})
>> > b = hc.createDataFrame(B)
>> >
>> > print "Pandas"  # try with Pandas
>> > print A
>> > print B
>> > print pd.merge(A, B, on=['year', 'month'], how='inner')
>> >
>> > print "Spark"
>> > print a.toPandas()
>> > print b.toPandas()
>> > print a.join(b, a.year==b.year and a.month==b.month, 'inner').toPandas()
>> >
>> >
>> > *Output
>> >
>> > Pandas
>> >  month  value  year
>> > 0 5100  1993
>> > 112200  2005
>> > 212300  1994
>> >
>> >  month  va

Re: Question regarding join with multiple columns with pyspark

2015-04-24 Thread ayan guha

I just tested your pr
On 25 Apr 2015 10:18, "Ali Bajwa"  wrote:

> Any ideas on this? Any sample code to join 2 data frames on two columns?
>
> Thanks
> Ali
>
> On Apr 23, 2015, at 1:05 PM, Ali Bajwa  wrote:
>
> > Hi experts,
> >
> > Sorry if this is a n00b question or has already been answered...
> >
> > Am trying to use the data frames API in python to join 2 dataframes
> > with more than 1 column. The example I've seen in the documentation
> > only shows a single column - so I tried this:
> >
> > Example code
> >
> > import pandas as pd
> > from pyspark.sql import SQLContext
> > hc = SQLContext(sc)
> > A = pd.DataFrame({'year': ['1993', '2005', '1994'], 'month': ['5',
> > '12', '12'], 'value': [100, 200, 300]})
> > a = hc.createDataFrame(A)
> > B = pd.DataFrame({'year': ['1993', '1993'], 'month': ['12', '12'],
> > 'value': [101, 102]})
> > b = hc.createDataFrame(B)
> >
> > print "Pandas"  # try with Pandas
> > print A
> > print B
> > print pd.merge(A, B, on=['year', 'month'], how='inner')
> >
> > print "Spark"
> > print a.toPandas()
> > print b.toPandas()
> > print a.join(b, a.year==b.year and a.month==b.month, 'inner').toPandas()
> >
> >
> > *Output
> >
> > Pandas
> >  month  value  year
> > 0 5100  1993
> > 112200  2005
> > 212300  1994
> >
> >  month  value  year
> > 012101  1993
> > 112102  1993
> >
> > Empty DataFrame
> >
> > Columns: [month, value_x, year, value_y]
> >
> > Index: []
> >
> > Spark
> >  month  value  year
> > 0 5100  1993
> > 112200  2005
> > 212300  1994
> >
> >  month  value  year
> > 012101  1993
> > 112102  1993
> >
> > month  value  year month  value  year
> > 012200  200512102  1993
> > 112200  200512101  1993
> > 212300  199412102  1993
> > 312300  199412101  1993
> >
> > It looks like Spark returns some results where an inner join should
> > return nothing.
> >
> > Am I doing the join with two columns in the wrong way? If yes, what is
> > the right syntax for this?
> >
> > Thanks!
> > Ali
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Question regarding join with multiple columns with pyspark

2015-04-24 Thread ayan guha

I just tested, your observation in DataFrame API is correct. It behaves
weirdly in case of multiple column join.  (Maybe we should report a Jira?)

Solution: You can go back to our good old composite key field concatenation
method. Not ideal, but workaround. (Of course you can use realSQL as well,
as shown below)

set up Data:

a =
[[1993,1,100],[1993,2,200],[1994,1,1000],[1994,3,3000],[2000,1,1]]
b = [[1993,1,"A"],[1994,1,"AA"],[2000,1,"AAA"]]
YM1 = sc.parallelize(a).map(lambda tup: Row(yr=int(tup[0]),mn =
int(tup[1]), price = int(tup[2]),joiningKey=str(tup[0])+"~"+str(tup[1])))
YM2 = sc.parallelize(b).map(lambda tup: Row(yr=int(tup[0]),mn =
int(tup[1]), name = str(tup[2]),joiningKey=str(tup[0])+"~"+str(tup[1])))
print YM1.collect()
print YM2.collect()

YM1DF = ssc.createDataFrame(YM1)
YM2DF = ssc.createDataFrame(YM2)

print YM1DF.printSchema()
print YM2DF.printSchema()
   
This DOES NOT WORK ---

YMJN = YM1DF.join(YM2DF,YM1DF.yr==YM2DF.yr and
YM1DF.mn==YM2DF.mn,"inner")
print YMJN.printSchema()
for l in YMJN.collect():
print l

Row(joiningKey=u'1993~1', mn=1, price=100, yr=1993, joiningKey=u'1993~1',
mn=1, name=u'A', yr=1993)
Row(joiningKey=u'1994~1', mn=1, price=100, yr=1994, joiningKey=u'1994~1',
mn=1, name=u'AA', yr=1994)
Row(joiningKey=u'2000~1', mn=1, price=100, yr=2000, joiningKey=u'2000~1',
mn=1, name=u'AAA', yr=2000)
Row(joiningKey=u'1993~1', mn=1, price=1000, yr=1993, joiningKey=u'1993~1',
mn=1, name=u'A', yr=1993)
Row(joiningKey=u'1994~1', mn=1, price=1000, yr=1994, joiningKey=u'1994~1',
mn=1, name=u'AA', yr=1994)
Row(joiningKey=u'2000~1', mn=1, price=1000, yr=2000, joiningKey=u'2000~1',
mn=1, name=u'AAA', yr=2000)
Row(joiningKey=u'1993~1', mn=1, price=1, yr=1993, joiningKey=u'1993~1',
mn=1, name=u'A', yr=1993)
Row(joiningKey=u'1994~1', mn=1, price=1, yr=1994, joiningKey=u'1994~1',
mn=1, name=u'AA', yr=1994)
Row(joiningKey=u'2000~1', mn=1, price=1, yr=2000, joiningKey=u'2000~1',
mn=1, name=u'AAA', yr=2000)

-

SQL Solution - works as expected

YM1DF.registerTempTable("ymdf1")
YM2DF.registerTempTable("ymdf2")
YMJNS = ssc.sql("select * from ymdf1 inner join ymdf2 on
ymdf1.yr=ymdf2.yr and ymdf1.mn=ymdf2.mn")
print YMJNS.printSchema()
for l in YMJNS.collect():
print l

Row(joiningKey=u'1994~1', mn=1, price=1000, yr=1994, joiningKey=u'1994~1',
mn=1, name=u'AA', yr=1994)
Row(joiningKey=u'2000~1', mn=1, price=1, yr=2000, joiningKey=u'2000~1',
mn=1, name=u'AAA', yr=2000)
Row(joiningKey=u'1993~1', mn=1, price=100, yr=1993, joiningKey=u'1993~1',
mn=1, name=u'A', yr=1993)

-

Field concat method, works as well

YMJNA = YM1DF.join(YM2DF,YM1DF.joiningKey==YM2DF.joiningKey,"inner")
print YMJNA.printSchema()
for l in YMJNA.collect():
print l

Row(joiningKey=u'1994~1', mn=1, price=1000, yr=1994, joiningKey=u'1994~1',
mn=1, name=u'AA', yr=1994)
Row(joiningKey=u'1993~1', mn=1, price=100, yr=1993, joiningKey=u'1993~1',
mn=1, name=u'A', yr=1993)
Row(joiningKey=u'2000~1', mn=1, price=1, yr=2000, joiningKey=u'2000~1',
mn=1, name=u'AAA', yr=2000)

On Sat, Apr 25, 2015 at 10:18 AM, Ali Bajwa  wrote:

> Any ideas on this? Any sample code to join 2 data frames on two columns?
>
> Thanks
> Ali
>
> On Apr 23, 2015, at 1:05 PM, Ali Bajwa  wrote:
>
> > Hi experts,
> >
> > Sorry if this is a n00b question or has already been answered...
> >
> > Am trying to use the data frames API in python to join 2 dataframes
> > with more than 1 column. The example I've seen in the documentation
> > only shows a single column - so I tried this:
> >
> > Example code
> >
> > import pandas as pd
> > from pyspark.sql import SQLContext
> > hc = SQLContext(sc)
> > A = pd.DataFrame({'year': ['1993', '2005', '1994'], 'month': ['5',
> > '12', '12'], 'value': [100, 200, 300]})
> > a = hc.createDataFrame(A)
> > B = pd.DataFrame({'year': ['1993', '1993'], 'month': ['12', '12'],
> > 'value': [101, 102]})
> > b = hc.createDataFrame(B)
> >
> > print "Pandas"  # try with Pandas
> > print A
> > print B
> > print pd.merge(A, B, on=['year', 'month'], how='inner')
> >
> > print "Spark"
> > print a.toPandas()
> > print b.toPandas()
> > print a.join(b, a.year==b.year and a.month==b.month, 'inner').toPandas()
> >
> >
> > *Output
> >
> > Pandas
> >  month  value  year
> > 0 5100  1993
> > 112200  2005
> > 212300  1994
> >
> >  month  value  year
> > 012101  1993
> > 112102  1993
> >
> > Empty DataFrame
> >
> > Columns: [month, value_x, year, value_y]
> >
> > Index: []
> >
> > Spark
> >  month  value  year
> > 0 5100  1993
> > 112200  2005
> > 212300  1994
> >
> >  month  value  year
> > 012101  1993
> > 112102  1993
> >
> > month  value  year month  value  year
> > 012200  2005

Re: Question regarding join with multiple columns with pyspark

2015-04-24 Thread Ali Bajwa

Any ideas on this? Any sample code to join 2 data frames on two columns?

Thanks
Ali

On Apr 23, 2015, at 1:05 PM, Ali Bajwa  wrote:

> Hi experts,
>
> Sorry if this is a n00b question or has already been answered...
>
> Am trying to use the data frames API in python to join 2 dataframes
> with more than 1 column. The example I've seen in the documentation
> only shows a single column - so I tried this:
>
> Example code
>
> import pandas as pd
> from pyspark.sql import SQLContext
> hc = SQLContext(sc)
> A = pd.DataFrame({'year': ['1993', '2005', '1994'], 'month': ['5',
> '12', '12'], 'value': [100, 200, 300]})
> a = hc.createDataFrame(A)
> B = pd.DataFrame({'year': ['1993', '1993'], 'month': ['12', '12'],
> 'value': [101, 102]})
> b = hc.createDataFrame(B)
>
> print "Pandas"  # try with Pandas
> print A
> print B
> print pd.merge(A, B, on=['year', 'month'], how='inner')
>
> print "Spark"
> print a.toPandas()
> print b.toPandas()
> print a.join(b, a.year==b.year and a.month==b.month, 'inner').toPandas()
>
>
> *Output
>
> Pandas
>  month  value  year
> 0 5100  1993
> 112200  2005
> 212300  1994
>
>  month  value  year
> 012101  1993
> 112102  1993
>
> Empty DataFrame
>
> Columns: [month, value_x, year, value_y]
>
> Index: []
>
> Spark
>  month  value  year
> 0 5100  1993
> 112200  2005
> 212300  1994
>
>  month  value  year
> 012101  1993
> 112102  1993
>
> month  value  year month  value  year
> 012200  200512102  1993
> 112200  200512101  1993
> 212300  199412102  1993
> 312300  199412101  1993
>
> It looks like Spark returns some results where an inner join should
> return nothing.
>
> Am I doing the join with two columns in the wrong way? If yes, what is
> the right syntax for this?
>
> Thanks!
> Ali

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Question regarding join with multiple columns with pyspark

2015-04-23 Thread Ali Bajwa

Hi experts,

Sorry if this is a n00b question or has already been answered...

Am trying to use the data frames API in python to join 2 dataframes
with more than 1 column. The example I've seen in the documentation
only shows a single column - so I tried this:

Example code

import pandas as pd
from pyspark.sql import SQLContext
hc = SQLContext(sc)
A = pd.DataFrame({'year': ['1993', '2005', '1994'], 'month': ['5',
'12', '12'], 'value': [100, 200, 300]})
a = hc.createDataFrame(A)
B = pd.DataFrame({'year': ['1993', '1993'], 'month': ['12', '12'],
'value': [101, 102]})
b = hc.createDataFrame(B)

print "Pandas"  # try with Pandas
print A
print B
print pd.merge(A, B, on=['year', 'month'], how='inner')

print "Spark"
print a.toPandas()
print b.toPandas()
print a.join(b, a.year==b.year and a.month==b.month, 'inner').toPandas()


*Output

Pandas
  month  value  year
0 5100  1993
112200  2005
212300  1994

  month  value  year
012101  1993
112102  1993

Empty DataFrame

Columns: [month, value_x, year, value_y]

Index: []

Spark
  month  value  year
0 5100  1993
112200  2005
212300  1994

  month  value  year
012101  1993
112102  1993

 month  value  year month  value  year
012200  200512102  1993
112200  200512101  1993
212300  199412102  1993
312300  199412101  1993

It looks like Spark returns some results where an inner join should
return nothing.

Am I doing the join with two columns in the wrong way? If yes, what is
the right syntax for this?

Thanks!
Ali

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Question regarding join with multiple columns with pyspark

Re: Question regarding join with multiple columns with pyspark

Re: Question regarding join with multiple columns with pyspark

Re: Question regarding join with multiple columns with pyspark

Re: Question regarding join with multiple columns with pyspark

Question regarding join with multiple columns with pyspark

6 matches

Site Navigation

Mail list logo

Footer information