Hello all, I am trying to create a distributed row matrix of my data which is currently available as text input with each line supposed to become a line of the distributed row. I am using the Spectral KMeans code as a way of understanding how DistributedRowMatrix works and I am sort of confused. Specifically: Does DistributedRowMatrix require that the SequenceFiles have the row ID as the "Key" ? ( The Spectral Kmeans code implements that which is easy because their input's first word has that information. However, since as far as I can see TextInputFormat just renders a unique byte offset (not necessarily the line number), I cant recover the line number from my data. Furthermore, suppose I do change my data to say a bunch of images living in a flat directory, I am thinking of having "key" being some combination of the file number and this byte offset. )
Thanks
