I think you can not use textFile() or binaryFile() or pickleFile() here, it's different format than wav.
You could get a list of paths for all the files, then sc.parallelize(), and foreach(): def process(path): # use subprocess to launch a process to do the job, read the stdout as result files = [] # a list of path of wav files sc.parallelize(files, len(files)).foreach(process) On Fri, Jan 16, 2015 at 2:11 PM, Venkat, Ankam <ankam.ven...@centurylink.com> wrote: > I need to process .wav files in Pyspark. If the files are in local file > system, I am able to process them. Once I store them on HDFS, I am facing > issues. For example, > > > > I run a sox program on a wav file like this. > > > > sox ext2187854_03_27_2014.wav -n stats <-- works fine > > > > sox hdfs://xxxxxxx:8020/user/ab00855/ext2187854_03_27_2014.wav -n stats > <-- Does not work as sox cannot read HDFS file. > > > > So, I do like this. > > > > hadoop fs -cat hdfs://xxxxxxx:8020/user/ab00855/ext2187854_03_27_2014.wav | > sox -t wav - -n stats <-- This works fine > > > > But, I am not able to do this in PySpark. > > > > wavfile = > sc.textFile('hdfs://xxxxxxx:8020/user/ab00855/ext2187854_03_27_2014.wav') > > wavfile.pipe(subprocess.call(['sox', '-t' 'wav', '-', '-n', 'stats'])) > > > > I tried different options like sc.binaryFiles and sc.pickleFile. > > > > Any thoughts? > > > > Regards, > > Venkat Ankam > > > > This communication is the property of CenturyLink and may contain > confidential or privileged information. Unauthorized use of this > communication is strictly prohibited and may be unlawful. If you have > received this communication in error, please immediately notify the sender > by reply e-mail and destroy all copies of the communication and any > attachments. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org