On Wed, Aug 29, 2018 at 6:37 AM OrielResearch Eila Arich-Landkof < [email protected]> wrote: > > Hello all, > > I would like to process a large numpy matrix with dimensions: > > (100K+, 30K+) > > The column names and the row names are meaningful. > > My plan was to save the numpy matrix values as a txt file and read it to a PColleciton.
Yes, you could do that. Of course a 3B-entry matrix is going to be a rather large file when written out in decimal; you could consider outputting the rows as base64.b64encode(zlib.compress(cPickle.dumps(matrix[row,:]))) and then unpacking on read. > However, I am not sure how to add the row names to the element for processing. > The column names are easier - I can pass them as parameter to the DoFn function and they are not changing. > > With regards to the row names, the only way that I could see is to map the row index to a string, read the row number at the DoFn function and retrieve the name based on it. Is there any more elegant way to solve that? Is "row index" is this one of the columns? (If so, I am assuming this is an integral, or at least numeric, matrix.) If so then, yes, this may work. (How are you storing the row names when you have the original matrix in hand?) If you go with a custom packing format as above, you could write the text file lines as "rowName,base64data" as well.
