Phoofff.. (Mind blown)...
Thank you sir.
This is awesome

On Mon, Jun 2, 2014 at 5:23 PM, Marcelo Vanzin <van...@cloudera.com> wrote:

> The idea is simple. If you want to run something on a collection of
> files, do (in pseudo-python):
>
> def processSingleFile(path):
>   # Your code to process a file
>
> files = [ "file1", "file2" ]
> sc.parallelize(files).foreach(processSingleFile)
>
>
> On Mon, Jun 2, 2014 at 5:16 PM, jamal sasha <jamalsha...@gmail.com> wrote:
> > Hi Marcelo,
> >   Thanks for the response..
> > I am not sure I understand. Can you elaborate a bit.
> > So, for example, lets take a look at this example
> > http://pythonvision.org/basic-tutorial
> >
> > import mahotas
> > dna = mahotas.imread('dna.jpeg')
> > dnaf = ndimage.gaussian_filter(dna, 8)
> >
> > But except dna.jpeg Lets say, I have millions of dna.jpeg and I want to
> run
> > the above logic on all the millions files.
> > How should I go about this?
> > Thanks
> >
> > On Mon, Jun 2, 2014 at 5:09 PM, Marcelo Vanzin <van...@cloudera.com>
> wrote:
> >>
> >> Hi Jamal,
> >>
> >> If what you want is to process lots of files in parallel, the best
> >> approach is probably to load all file names into an array and
> >> parallelize that. Then each task will take a path as input and can
> >> process it however it wants.
> >>
> >> Or you could write the file list to a file, and then use sc.textFile()
> >> to open it (assuming one path per line), and the rest is pretty much
> >> the same as above.
> >>
> >> It will probably be hard to process each individual file in parallel,
> >> unless mp3 and jpg files can be split into multiple blocks that can be
> >> processed separately. In that case, you'd need a custom (Hadoop) input
> >> format that is able to calculate the splits. But it doesn't sound like
> >> that's what you want.
> >>
> >>
> >>
> >> On Mon, Jun 2, 2014 at 5:02 PM, jamal sasha <jamalsha...@gmail.com>
> wrote:
> >> > Hi,
> >> >   How do one process for data sources other than text?
> >> > Lets say I have millions of mp3 (or jpeg) files and I want to use
> spark
> >> > to
> >> > process them?
> >> > How does one go about it.
> >> >
> >> >
> >> > I have never been able to figure this out..
> >> > Lets say I have this library in python which works like following:
> >> >
> >> > import audio
> >> >
> >> > song = audio.read_mp3(filename)
> >> >
> >> > Then most of the methods are attached to song or maybe there is
> another
> >> > function which takes "song" type as an input.
> >> >
> >> > Maybe the above is just rambling.. but how do I use spark to process
> >> > (say)
> >> > audiio files.
> >> > Thanks
> >>
> >>
> >>
> >> --
> >> Marcelo
> >
> >
>
>
>
> --
> Marcelo
>

Reply via email to