Reading hdf5 formats with pyspark

2014-07-28 Thread Mohit Singh
Hi, We have setup spark on a HPC system and are trying to implement some data pipeline and algorithms in place. The input data is in hdf5 (these are very high resolution brain images) and it can be read via h5py library in python. So, my current approach (which seems to be working ) is writing

Re: Reading hdf5 formats with pyspark

2014-07-28 Thread Xiangrui Meng
That looks good to me since there is no Hadoop InputFormat for HDF5. But remember to specify the number of partitions in sc.parallelize to use all the nodes. You can change `process` to `read` which yields records one-by-one. Then sc.parallelize(files, numPartitions).flatMap(read) returns an RDD