Hi Marcelo, Thanks for the response.. I am not sure I understand. Can you elaborate a bit. So, for example, lets take a look at this example http://pythonvision.org/basic-tutorial
import mahotas dna = mahotas.imread('dna.jpeg') dnaf = ndimage.gaussian_filter(dna, 8) But except dna.jpeg Lets say, I have millions of dna.jpeg and I want to run the above logic on all the millions files. How should I go about this? Thanks On Mon, Jun 2, 2014 at 5:09 PM, Marcelo Vanzin <van...@cloudera.com> wrote: > Hi Jamal, > > If what you want is to process lots of files in parallel, the best > approach is probably to load all file names into an array and > parallelize that. Then each task will take a path as input and can > process it however it wants. > > Or you could write the file list to a file, and then use sc.textFile() > to open it (assuming one path per line), and the rest is pretty much > the same as above. > > It will probably be hard to process each individual file in parallel, > unless mp3 and jpg files can be split into multiple blocks that can be > processed separately. In that case, you'd need a custom (Hadoop) input > format that is able to calculate the splits. But it doesn't sound like > that's what you want. > > > > On Mon, Jun 2, 2014 at 5:02 PM, jamal sasha <jamalsha...@gmail.com> wrote: > > Hi, > > How do one process for data sources other than text? > > Lets say I have millions of mp3 (or jpeg) files and I want to use spark > to > > process them? > > How does one go about it. > > > > > > I have never been able to figure this out.. > > Lets say I have this library in python which works like following: > > > > import audio > > > > song = audio.read_mp3(filename) > > > > Then most of the methods are attached to song or maybe there is another > > function which takes "song" type as an input. > > > > Maybe the above is just rambling.. but how do I use spark to process > (say) > > audiio files. > > Thanks > > > > -- > Marcelo >