Hi I am trying to work through a OOM error. I have 10411 files. I want to select a single column from each file and then join them into a single table.
The files have a row unique id. However it is a very long string. The data file with just the name and column of interest is about 470 M. The column of interest alone is 21 m. it is a column over 5 million real numbers. So I thought I would save a lot of memory if I can join over row numbers. # create dummy variable to orderby https://www.py4u.net/discuss/1840945 w = Window().orderBy(lit('A')) sampleDF = sampleDF.select( ["NumReads"] )\ .withColumnRenamed( "NumReads", sampleName )\ .withColumn( "tid",row_number().over(w) ) This code seem pretty complicated as someone coming from pandas an R dataframes. My unit test works however it generates the following warning. WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation. Is there a better way to create a row number with our reordering my data? The order is important Kind regards Andy
