1. load 3 matrices of size ~ 10000 X 10000 using numpy.
2. rdd2 = rdd1.values().flatMap( fun )  # rdd1 has roughly 10^7 tuples
3. df = sqlCtx.createDataFrame(rdd2)
4. df.save() # in parquet format

It throws exception in createDataFrame() call. I don't know what exactly it
is creating ? everything in memory? or can I make it to persist
simultaneously while getting created.

Thanks


On Fri, Jul 17, 2015 at 5:16 PM, Akhil Das <ak...@sigmoidanalytics.com>
wrote:

> Can you paste the code? How much memory does your system have and how big
> is your dataset? Did you try df.persist(StorageLevel.MEMORY_AND_DISK)?
>
> Thanks
> Best Regards
>
> On Fri, Jul 17, 2015 at 5:14 PM, Harit Vishwakarma <
> harit.vishwaka...@gmail.com> wrote:
>
>> Thanks,
>> Code is running on a single machine.
>> And it still doesn't answer my question.
>>
>> On Fri, Jul 17, 2015 at 4:52 PM, ayan guha <guha.a...@gmail.com> wrote:
>>
>>> You can bump up number of partitions while creating the rdd you are
>>> using for df
>>> On 17 Jul 2015 21:03, "Harit Vishwakarma" <harit.vishwaka...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I used createDataFrame API of SqlContext in python. and getting
>>>> OutOfMemoryException. I am wondering if it is creating whole dataFrame in
>>>> memory?
>>>> I did not find any documentation describing memory usage of Spark APIs.
>>>> Documentation given is nice but little more details (specially on
>>>> memory usage/ data distribution etc.) will really help.
>>>>
>>>> --
>>>> Regards
>>>> Harit Vishwakarma
>>>>
>>>>
>>
>>
>> --
>> Regards
>> Harit Vishwakarma
>>
>>
>


-- 
Regards
Harit Vishwakarma

Reply via email to