Hi Marcelo,
  Thanks for the response..
I am not sure I understand. Can you elaborate a bit.
So, for example, lets take a look at this example
http://pythonvision.org/basic-tutorial

import mahotas
dna = mahotas.imread('dna.jpeg')
dnaf = ndimage.gaussian_filter(dna, 8)

But except dna.jpeg Lets say, I have millions of dna.jpeg and I want to run
the above logic on all the millions files.
How should I go about this?
Thanks

On Mon, Jun 2, 2014 at 5:09 PM, Marcelo Vanzin <van...@cloudera.com> wrote:

> Hi Jamal,
>
> If what you want is to process lots of files in parallel, the best
> approach is probably to load all file names into an array and
> parallelize that. Then each task will take a path as input and can
> process it however it wants.
>
> Or you could write the file list to a file, and then use sc.textFile()
> to open it (assuming one path per line), and the rest is pretty much
> the same as above.
>
> It will probably be hard to process each individual file in parallel,
> unless mp3 and jpg files can be split into multiple blocks that can be
> processed separately. In that case, you'd need a custom (Hadoop) input
> format that is able to calculate the splits. But it doesn't sound like
> that's what you want.
>
>
>
> On Mon, Jun 2, 2014 at 5:02 PM, jamal sasha <jamalsha...@gmail.com> wrote:
> > Hi,
> >   How do one process for data sources other than text?
> > Lets say I have millions of mp3 (or jpeg) files and I want to use spark
> to
> > process them?
> > How does one go about it.
> >
> >
> > I have never been able to figure this out..
> > Lets say I have this library in python which works like following:
> >
> > import audio
> >
> > song = audio.read_mp3(filename)
> >
> > Then most of the methods are attached to song or maybe there is another
> > function which takes "song" type as an input.
> >
> > Maybe the above is just rambling.. but how do I use spark to process
> (say)
> > audiio files.
> > Thanks
>
>
>
> --
> Marcelo
>

Reply via email to