Hi I understand that now. However, your function foo() should take a string and parse it, rather than trying to read from file. This way, you can separate the file read path and process part.
r = sc.wholeTextFile(path) parsed = r.map(lambda x: x[0],foo(x[1])) On Fri, Jun 30, 2017 at 1:25 PM, Saatvik Shah <saatvikshah1...@gmail.com> wrote: > Hey Ayan, > > This isnt a typical text file - Its a proprietary data format for which a > native Spark reader is not available. > > Thanks and Regards, > Saatvik Shah > > On Thu, Jun 29, 2017 at 6:48 PM, ayan guha <guha.a...@gmail.com> wrote: > >> If your files are in same location you can use sc.wholeTextFile. If not, >> sc.textFile accepts a list of filepaths. >> >> On Fri, 30 Jun 2017 at 5:59 am, saatvikshah1994 < >> saatvikshah1...@gmail.com> wrote: >> >>> Hi, >>> >>> I have this file reading function is called /foo/ which reads contents >>> into >>> a list of lists or into a generator of list of lists representing the >>> same >>> file. >>> >>> When reading as a complete chunk(1 record array) I do something like: >>> rdd = file_paths_rdd.map(lambda x: foo(x,"wholeFile")).flatMap(lambda >>> x:x) >>> >>> I'd like to now do something similar but with the generator, so that I >>> can >>> work with more cores and a lower memory. I'm not sure how to tackle this >>> since generators cannot be pickled and thus I'm not sure how to ditribute >>> the work of reading each file_path on the rdd? >>> >>> >>> >>> -- >>> View this message in context: http://apache-spark-user-list. >>> 1001560.n3.nabble.com/PySpark-working-with-Generators-tp28810.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> --------------------------------------------------------------------- >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>> >>> -- >> Best Regards, >> Ayan Guha >> > > -- Best Regards, Ayan Guha