Re: How to estimate the size of dataframe using pyspark?
Thanks Davies, I've shared the code snippet and the dataset. Please let me know if you need any other information. On Mon, Apr 11, 2016 at 10:44 AM, Davies Liu wrote: > That's weird, DataFrame.count() should not require lots of memory on > driver, could you provide a way to reproduce it (could generate fake > dataset)? > > On Sat, Apr 9, 2016 at 4:33 PM, Buntu Dev wrote: > > I've allocated about 4g for the driver. For the count stage, I notice the > > Shuffle Write to be 13.9 GB. > > > > On Sat, Apr 9, 2016 at 11:43 AM, Ndjido Ardo BAR > wrote: > >> > >> What's the size of your driver? > >> On Sat, 9 Apr 2016 at 20:33, Buntu Dev wrote: > >>> > >>> Actually, df.show() works displaying 20 rows but df.count() is the one > >>> which is causing the driver to run out of memory. There are just 3 INT > >>> columns. > >>> > >>> Any idea what could be the reason? > >>> > >>> On Sat, Apr 9, 2016 at 10:47 AM, wrote: > >>>> > >>>> You seem to have a lot of column :-) ! > >>>> df.count() displays the size of your data frame. > >>>> df.columns.size() the number of columns. > >>>> > >>>> Finally, I suggest you check the size of your drive and customize it > >>>> accordingly. > >>>> > >>>> Cheers, > >>>> > >>>> Ardo > >>>> > >>>> Sent from my iPhone > >>>> > >>>> > On 09 Apr 2016, at 19:37, bdev wrote: > >>>> > > >>>> > I keep running out of memory on the driver when I attempt to do > >>>> > df.show(). > >>>> > Can anyone let me know how to estimate the size of the dataframe? > >>>> > > >>>> > Thanks! > >>>> > > >>>> > > >>>> > > >>>> > -- > >>>> > View this message in context: > >>>> > > http://apache-spark-user-list.1001560.n3.nabble.com/How-to-estimate-the-size-of-dataframe-using-pyspark-tp26729.html > >>>> > Sent from the Apache Spark User List mailing list archive at > >>>> > Nabble.com. > >>>> > > >>>> > > - > >>>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > >>>> > For additional commands, e-mail: user-h...@spark.apache.org > >>>> > > >>> > >>> > > >
Re: How to estimate the size of dataframe using pyspark?
That's weird, DataFrame.count() should not require lots of memory on driver, could you provide a way to reproduce it (could generate fake dataset)? On Sat, Apr 9, 2016 at 4:33 PM, Buntu Dev wrote: > I've allocated about 4g for the driver. For the count stage, I notice the > Shuffle Write to be 13.9 GB. > > On Sat, Apr 9, 2016 at 11:43 AM, Ndjido Ardo BAR wrote: >> >> What's the size of your driver? >> On Sat, 9 Apr 2016 at 20:33, Buntu Dev wrote: >>> >>> Actually, df.show() works displaying 20 rows but df.count() is the one >>> which is causing the driver to run out of memory. There are just 3 INT >>> columns. >>> >>> Any idea what could be the reason? >>> >>> On Sat, Apr 9, 2016 at 10:47 AM, wrote: >>>> >>>> You seem to have a lot of column :-) ! >>>> df.count() displays the size of your data frame. >>>> df.columns.size() the number of columns. >>>> >>>> Finally, I suggest you check the size of your drive and customize it >>>> accordingly. >>>> >>>> Cheers, >>>> >>>> Ardo >>>> >>>> Sent from my iPhone >>>> >>>> > On 09 Apr 2016, at 19:37, bdev wrote: >>>> > >>>> > I keep running out of memory on the driver when I attempt to do >>>> > df.show(). >>>> > Can anyone let me know how to estimate the size of the dataframe? >>>> > >>>> > Thanks! >>>> > >>>> > >>>> > >>>> > -- >>>> > View this message in context: >>>> > http://apache-spark-user-list.1001560.n3.nabble.com/How-to-estimate-the-size-of-dataframe-using-pyspark-tp26729.html >>>> > Sent from the Apache Spark User List mailing list archive at >>>> > Nabble.com. >>>> > >>>> > - >>>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>> > For additional commands, e-mail: user-h...@spark.apache.org >>>> > >>> >>> > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to estimate the size of dataframe using pyspark?
I've allocated about 4g for the driver. For the count stage, I notice the Shuffle Write to be 13.9 GB. On Sat, Apr 9, 2016 at 11:43 AM, Ndjido Ardo BAR wrote: > What's the size of your driver? > On Sat, 9 Apr 2016 at 20:33, Buntu Dev wrote: > >> Actually, df.show() works displaying 20 rows but df.count() is the one >> which is causing the driver to run out of memory. There are just 3 INT >> columns. >> >> Any idea what could be the reason? >> >> On Sat, Apr 9, 2016 at 10:47 AM, wrote: >> >>> You seem to have a lot of column :-) ! >>> df.count() displays the size of your data frame. >>> df.columns.size() the number of columns. >>> >>> Finally, I suggest you check the size of your drive and customize it >>> accordingly. >>> >>> Cheers, >>> >>> Ardo >>> >>> Sent from my iPhone >>> >>> > On 09 Apr 2016, at 19:37, bdev wrote: >>> > >>> > I keep running out of memory on the driver when I attempt to do >>> df.show(). >>> > Can anyone let me know how to estimate the size of the dataframe? >>> > >>> > Thanks! >>> > >>> > >>> > >>> > -- >>> > View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-estimate-the-size-of-dataframe-using-pyspark-tp26729.html >>> > Sent from the Apache Spark User List mailing list archive at >>> Nabble.com. >>> > >>> > - >>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> > For additional commands, e-mail: user-h...@spark.apache.org >>> > >>> >> >>
Re: How to estimate the size of dataframe using pyspark?
Thanks Mandar, I couldn't see anything under the 'Storage Section' but under the Executors I noticed it to be 3.1 GB: Executors (1) Memory: 0.0 B Used (3.1 GB Total) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-estimate-the-size-of-dataframe-using-pyspark-tp26729p26732.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to estimate the size of dataframe using pyspark?
What's the size of your driver? On Sat, 9 Apr 2016 at 20:33, Buntu Dev wrote: > Actually, df.show() works displaying 20 rows but df.count() is the one > which is causing the driver to run out of memory. There are just 3 INT > columns. > > Any idea what could be the reason? > > On Sat, Apr 9, 2016 at 10:47 AM, wrote: > >> You seem to have a lot of column :-) ! >> df.count() displays the size of your data frame. >> df.columns.size() the number of columns. >> >> Finally, I suggest you check the size of your drive and customize it >> accordingly. >> >> Cheers, >> >> Ardo >> >> Sent from my iPhone >> >> > On 09 Apr 2016, at 19:37, bdev wrote: >> > >> > I keep running out of memory on the driver when I attempt to do >> df.show(). >> > Can anyone let me know how to estimate the size of the dataframe? >> > >> > Thanks! >> > >> > >> > >> > -- >> > View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-estimate-the-size-of-dataframe-using-pyspark-tp26729.html >> > Sent from the Apache Spark User List mailing list archive at Nabble.com. >> > >> > - >> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> > For additional commands, e-mail: user-h...@spark.apache.org >> > >> > >
Re: How to estimate the size of dataframe using pyspark?
Actually, df.show() works displaying 20 rows but df.count() is the one which is causing the driver to run out of memory. There are just 3 INT columns. Any idea what could be the reason? On Sat, Apr 9, 2016 at 10:47 AM, wrote: > You seem to have a lot of column :-) ! > df.count() displays the size of your data frame. > df.columns.size() the number of columns. > > Finally, I suggest you check the size of your drive and customize it > accordingly. > > Cheers, > > Ardo > > Sent from my iPhone > > > On 09 Apr 2016, at 19:37, bdev wrote: > > > > I keep running out of memory on the driver when I attempt to do > df.show(). > > Can anyone let me know how to estimate the size of the dataframe? > > > > Thanks! > > > > > > > > -- > > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/How-to-estimate-the-size-of-dataframe-using-pyspark-tp26729.html > > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > > > - > > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > > For additional commands, e-mail: user-h...@spark.apache.org > > >
Re: How to estimate the size of dataframe using pyspark?
You seem to have a lot of column :-) ! df.count() displays the size of your data frame. df.columns.size() the number of columns. Finally, I suggest you check the size of your drive and customize it accordingly. Cheers, Ardo Sent from my iPhone > On 09 Apr 2016, at 19:37, bdev wrote: > > I keep running out of memory on the driver when I attempt to do df.show(). > Can anyone let me know how to estimate the size of the dataframe? > > Thanks! > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/How-to-estimate-the-size-of-dataframe-using-pyspark-tp26729.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
How to estimate the size of dataframe using pyspark?
I keep running out of memory on the driver when I attempt to do df.show(). Can anyone let me know how to estimate the size of the dataframe? Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-estimate-the-size-of-dataframe-using-pyspark-tp26729.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org