1. load 3 matrices of size ~ 10000 X 10000 using numpy. 2. rdd2 = rdd1.values().flatMap( fun ) # rdd1 has roughly 10^7 tuples 3. df = sqlCtx.createDataFrame(rdd2) 4. df.save() # in parquet format
It throws exception in createDataFrame() call. I don't know what exactly it is creating ? everything in memory? or can I make it to persist simultaneously while getting created. Thanks On Fri, Jul 17, 2015 at 5:16 PM, Akhil Das <ak...@sigmoidanalytics.com> wrote: > Can you paste the code? How much memory does your system have and how big > is your dataset? Did you try df.persist(StorageLevel.MEMORY_AND_DISK)? > > Thanks > Best Regards > > On Fri, Jul 17, 2015 at 5:14 PM, Harit Vishwakarma < > harit.vishwaka...@gmail.com> wrote: > >> Thanks, >> Code is running on a single machine. >> And it still doesn't answer my question. >> >> On Fri, Jul 17, 2015 at 4:52 PM, ayan guha <guha.a...@gmail.com> wrote: >> >>> You can bump up number of partitions while creating the rdd you are >>> using for df >>> On 17 Jul 2015 21:03, "Harit Vishwakarma" <harit.vishwaka...@gmail.com> >>> wrote: >>> >>>> Hi, >>>> >>>> I used createDataFrame API of SqlContext in python. and getting >>>> OutOfMemoryException. I am wondering if it is creating whole dataFrame in >>>> memory? >>>> I did not find any documentation describing memory usage of Spark APIs. >>>> Documentation given is nice but little more details (specially on >>>> memory usage/ data distribution etc.) will really help. >>>> >>>> -- >>>> Regards >>>> Harit Vishwakarma >>>> >>>> >> >> >> -- >> Regards >> Harit Vishwakarma >> >> > -- Regards Harit Vishwakarma