Re: Creating a pCollection from large numpy matrix with row and column names

Robert Bradshaw Wed, 29 Aug 2018 06:08:58 -0700

On Wed, Aug 29, 2018 at 6:37 AM OrielResearch Eila Arich-Landkof <
[email protected]> wrote:
>
> Hello all,
>
> I would like to process a large numpy matrix with dimensions:
>
> (100K+, 30K+)
>
> The column names and the row names are meaningful.
>
> My plan was to save the numpy matrix values as a txt file and read it to
a PColleciton.


Yes, you could do that. Of course a 3B-entry matrix is going to be a rather
large file when written out in decimal; you could consider outputting the
rows as base64.b64encode(zlib.compress(cPickle.dumps(matrix[row,:]))) and
then unpacking on read.

> However, I am not sure how to add the row names to the element for
processing.
> The column names are easier - I can pass them as parameter to the DoFn
function and they are not changing.
>
> With regards to the row names, the only way that I could see is to map
the row index to a string, read the row number at the DoFn function and
retrieve the name based on it. Is there any more elegant way to solve that?

Is "row index" is this one of the columns? (If so, I am assuming this is an
integral, or at least numeric, matrix.) If so then, yes, this may work.
(How are you storing the row names when you have the original matrix in
hand?)

If you go with a custom packing format as above, you could write the text
file lines as "rowName,base64data" as well.

Re: Creating a pCollection from large numpy matrix with row and column names

Reply via email to