Phoofff.. (Mind blown)... Thank you sir. This is awesome
On Mon, Jun 2, 2014 at 5:23 PM, Marcelo Vanzin <van...@cloudera.com> wrote: > The idea is simple. If you want to run something on a collection of > files, do (in pseudo-python): > > def processSingleFile(path): > # Your code to process a file > > files = [ "file1", "file2" ] > sc.parallelize(files).foreach(processSingleFile) > > > On Mon, Jun 2, 2014 at 5:16 PM, jamal sasha <jamalsha...@gmail.com> wrote: > > Hi Marcelo, > > Thanks for the response.. > > I am not sure I understand. Can you elaborate a bit. > > So, for example, lets take a look at this example > > http://pythonvision.org/basic-tutorial > > > > import mahotas > > dna = mahotas.imread('dna.jpeg') > > dnaf = ndimage.gaussian_filter(dna, 8) > > > > But except dna.jpeg Lets say, I have millions of dna.jpeg and I want to > run > > the above logic on all the millions files. > > How should I go about this? > > Thanks > > > > On Mon, Jun 2, 2014 at 5:09 PM, Marcelo Vanzin <van...@cloudera.com> > wrote: > >> > >> Hi Jamal, > >> > >> If what you want is to process lots of files in parallel, the best > >> approach is probably to load all file names into an array and > >> parallelize that. Then each task will take a path as input and can > >> process it however it wants. > >> > >> Or you could write the file list to a file, and then use sc.textFile() > >> to open it (assuming one path per line), and the rest is pretty much > >> the same as above. > >> > >> It will probably be hard to process each individual file in parallel, > >> unless mp3 and jpg files can be split into multiple blocks that can be > >> processed separately. In that case, you'd need a custom (Hadoop) input > >> format that is able to calculate the splits. But it doesn't sound like > >> that's what you want. > >> > >> > >> > >> On Mon, Jun 2, 2014 at 5:02 PM, jamal sasha <jamalsha...@gmail.com> > wrote: > >> > Hi, > >> > How do one process for data sources other than text? > >> > Lets say I have millions of mp3 (or jpeg) files and I want to use > spark > >> > to > >> > process them? > >> > How does one go about it. > >> > > >> > > >> > I have never been able to figure this out.. > >> > Lets say I have this library in python which works like following: > >> > > >> > import audio > >> > > >> > song = audio.read_mp3(filename) > >> > > >> > Then most of the methods are attached to song or maybe there is > another > >> > function which takes "song" type as an input. > >> > > >> > Maybe the above is just rambling.. but how do I use spark to process > >> > (say) > >> > audiio files. > >> > Thanks > >> > >> > >> > >> -- > >> Marcelo > > > > > > > > -- > Marcelo >