Re: Iterator over RDD in PySpark

Andrei Sat, 02 Aug 2014 03:15:07 -0700

Excellent, thank you!


On Sat, Aug 2, 2014 at 4:46 AM, Aaron Davidson <ilike...@gmail.com> wrote:

> Ah, that's unfortunate, that definitely should be added. Using a
> pyspark-internal method, you could try something like
>
> javaIterator = rdd._jrdd.toLocalIterator()
> it = rdd._collect_iterator_through_file(javaIterator)
>
>
> On Fri, Aug 1, 2014 at 3:04 PM, Andrei <faithlessfri...@gmail.com> wrote:
>
>> Thanks, Aaron, it should be fine with partitions (I can repartition it
>> anyway, right?).
>> But rdd.toLocalIterator is purely Java/Scala method. Is there Python
>> interface to it?
>> I can get Java iterator though rdd._jrdd, but it isn't converted to
>> Python iterator automatically. E.g.:
>>
>>   >>> rdd = sc.parallelize([1, 2, 3, 4, 5])
>>   >>> it = rdd._jrdd.toLocalIterator()
>>   >>> next(it)
>>   14/08/02 01:02:32 INFO SparkContext: Starting job: apply at
>> Iterator.scala:371
>>   ...
>>   14/08/02 01:02:32 INFO SparkContext: Job finished: apply at
>> Iterator.scala:371, took 0.02064317 s
>>   bytearray(b'\x80\x02K\x01.')
>>
>> I understand that returned byte array somehow corresponds to actual data,
>> but how can I get it?
>>
>>
>>
>> On Fri, Aug 1, 2014 at 8:49 PM, Aaron Davidson <ilike...@gmail.com>
>> wrote:
>>
>>> rdd.toLocalIterator will do almost what you want, but requires that each
>>> individual partition fits in memory (rather than each individual line).
>>> Hopefully that's sufficient, though.
>>>
>>>
>>> On Fri, Aug 1, 2014 at 1:38 AM, Andrei <faithlessfri...@gmail.com>
>>> wrote:
>>>
>>>> Is there a way to get iterator from RDD? Something like rdd.collect(),
>>>> but returning lazy sequence and not single array.
>>>>
>>>> Context: I need to GZip processed data to upload it to Amazon S3. Since
>>>> archive should be a single file, I want to iterate over RDD, writing each
>>>> line to a local .gz file. File is small enough to fit local disk, but still
>>>> large enough not to fit into memory.
>>>>
>>>
>>>
>>
>

Re: Iterator over RDD in PySpark

Reply via email to