Re: How to flatten a row in PySpark

ayan guha Thu, 12 Oct 2017 18:28:48 -0700

Quick pyspark code:

>>> s = "ABZ|ABZ|AF|2,3,7,8,B,C,D,E,J,K,L,M,P,Q,T,U,X,Y|1,2,3,4,5|730"
>>> base = sc.parallelize([s.split("|")])
>>> base.take(10)
[['ABZ', 'ABZ', 'AF', '2,3,7,8,B,C,D,E,J,K,L,M,P,Q,T,U,X,Y', '1,2,3,4,5',
'730']]


>>> def pv(t):
...     x = t[3].split(",")
...     y = t[4].split(",")
...     for k in product(x,y):
...         yield (t[0],t[1],k[0],k[1],t[5])
...
>>> res = base.flatMap(pv)
>>> res.take(10)
[('ABZ', 'ABZ', '2', '1', '730'), ('ABZ', 'ABZ', '2', '2', '730'), ('ABZ',
'ABZ', '2', '3', '730'), ('ABZ', 'ABZ', '2', '4', '730'), ('ABZ', 'ABZ',
'2', '5', '730'), ('ABZ', 'ABZ', '3', '1', '730'), ('ABZ', 'ABZ', '3', '2',
'730'), ('ABZ', 'ABZ', '3', '3', '730'), ('ABZ', 'ABZ', '3', '4', '730'),
('ABZ', 'ABZ', '3', '5', '730')]



On Fri, Oct 13, 2017 at 6:03 AM, Nicholas Hakobian <
nicholas.hakob...@rallyhealth.com> wrote:

> Using explode on the 4th column, followed by an explode on the 5th column
> would produce what you want (you might need to use split on the columns
> first if they are not already an array).
>
> Nicholas Szandor Hakobian, Ph.D.
> Staff Data Scientist
> Rally Health
> nicholas.hakob...@rallyhealth.com
>
>
> On Thu, Oct 12, 2017 at 9:09 AM, Debabrata Ghosh <mailford...@gmail.com>
> wrote:
>
>> Hi,
>>         Greetings !
>>
>> I am having data in the format of the following row:
>>
>> ABZ|ABZ|AF|2,3,7,8,B,C,D,E,J,K,L,M,P,Q,T,U,X,Y|1,2,3,4,5|730
>>
>> I want to convert it into several rows in the format below:
>>
>> ABZ|ABZ|AF|2|1|730
>> ABZ|ABZ|AF|3+1|730
>> .
>> .
>> .
>> ABZ|ABZ|AF|3|1|730
>> ABZ|ABZ|AF|3|2|730
>> ABZ|ABZ|AF|3|3|730
>> .
>> .
>> .
>> ABZ|ABZ|AF|Y|4|730
>> ABZ|ABZ|AF||Y|5|730
>>
>> Basically, I want to consider the various combinations of the 4th and 5th
>> columns (where the values are delimited by commas) and accordingly generate
>> the above rows from a single row. Please can you suggest me for a good way
>> of acheiving this. Thanks in advance !
>>
>> Regards,
>>
>> Debu
>>
>
>


-- 
Best Regards,
Ayan Guha

Re: How to flatten a row in PySpark

Reply via email to