Re: PySpark working with Generators

ayan guha Thu, 29 Jun 2017 21:25:01 -0700

Hi

I understand that now. However, your function foo() should take a string
and parse it, rather than trying to read from file. This way, you can
separate the file read path and process part.


r = sc.wholeTextFile(path)

parsed = r.map(lambda x: x[0],foo(x[1]))



On Fri, Jun 30, 2017 at 1:25 PM, Saatvik Shah <saatvikshah1...@gmail.com>
wrote:

> Hey Ayan,
>
> This isnt a typical text file - Its a proprietary data format for which a
> native Spark reader is not available.
>
> Thanks and Regards,
> Saatvik Shah
>
> On Thu, Jun 29, 2017 at 6:48 PM, ayan guha <guha.a...@gmail.com> wrote:
>
>> If your files are in same location you can use sc.wholeTextFile. If not,
>> sc.textFile accepts a list of filepaths.
>>
>> On Fri, 30 Jun 2017 at 5:59 am, saatvikshah1994 <
>> saatvikshah1...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I have this file reading function is called /foo/ which reads contents
>>> into
>>> a list of lists or into a generator of list of lists representing the
>>> same
>>> file.
>>>
>>> When reading as a complete chunk(1 record array) I do something like:
>>> rdd = file_paths_rdd.map(lambda x: foo(x,"wholeFile")).flatMap(lambda
>>> x:x)
>>>
>>> I'd like to now do something similar but with the generator, so that I
>>> can
>>> work with more cores and a lower memory. I'm not sure how to tackle this
>>> since generators cannot be pickled and thus I'm not sure how to ditribute
>>> the work of reading each file_path on the rdd?
>>>
>>>
>>>
>>> --
>>> View this message in context: http://apache-spark-user-list.
>>> 1001560.n3.nabble.com/PySpark-working-with-Generators-tp28810.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>> --
>> Best Regards,
>> Ayan Guha
>>
>
>


-- 
Best Regards,
Ayan Guha

Re: PySpark working with Generators

Reply via email to