Quick pyspark code: >>> s = "ABZ|ABZ|AF|2,3,7,8,B,C,D,E,J,K,L,M,P,Q,T,U,X,Y|1,2,3,4,5|730" >>> base = sc.parallelize([s.split("|")]) >>> base.take(10) [['ABZ', 'ABZ', 'AF', '2,3,7,8,B,C,D,E,J,K,L,M,P,Q,T,U,X,Y', '1,2,3,4,5', '730']]
>>> def pv(t): ... x = t[3].split(",") ... y = t[4].split(",") ... for k in product(x,y): ... yield (t[0],t[1],k[0],k[1],t[5]) ... >>> res = base.flatMap(pv) >>> res.take(10) [('ABZ', 'ABZ', '2', '1', '730'), ('ABZ', 'ABZ', '2', '2', '730'), ('ABZ', 'ABZ', '2', '3', '730'), ('ABZ', 'ABZ', '2', '4', '730'), ('ABZ', 'ABZ', '2', '5', '730'), ('ABZ', 'ABZ', '3', '1', '730'), ('ABZ', 'ABZ', '3', '2', '730'), ('ABZ', 'ABZ', '3', '3', '730'), ('ABZ', 'ABZ', '3', '4', '730'), ('ABZ', 'ABZ', '3', '5', '730')] On Fri, Oct 13, 2017 at 6:03 AM, Nicholas Hakobian < nicholas.hakob...@rallyhealth.com> wrote: > Using explode on the 4th column, followed by an explode on the 5th column > would produce what you want (you might need to use split on the columns > first if they are not already an array). > > Nicholas Szandor Hakobian, Ph.D. > Staff Data Scientist > Rally Health > nicholas.hakob...@rallyhealth.com > > > On Thu, Oct 12, 2017 at 9:09 AM, Debabrata Ghosh <mailford...@gmail.com> > wrote: > >> Hi, >> Greetings ! >> >> I am having data in the format of the following row: >> >> ABZ|ABZ|AF|2,3,7,8,B,C,D,E,J,K,L,M,P,Q,T,U,X,Y|1,2,3,4,5|730 >> >> I want to convert it into several rows in the format below: >> >> ABZ|ABZ|AF|2|1|730 >> ABZ|ABZ|AF|3+1|730 >> . >> . >> . >> ABZ|ABZ|AF|3|1|730 >> ABZ|ABZ|AF|3|2|730 >> ABZ|ABZ|AF|3|3|730 >> . >> . >> . >> ABZ|ABZ|AF|Y|4|730 >> ABZ|ABZ|AF||Y|5|730 >> >> Basically, I want to consider the various combinations of the 4th and 5th >> columns (where the values are delimited by commas) and accordingly generate >> the above rows from a single row. Please can you suggest me for a good way >> of acheiving this. Thanks in advance ! >> >> Regards, >> >> Debu >> > > -- Best Regards, Ayan Guha