Dataframe's storage size

2021-12-23 Thread bitfox

Hello

Is it possible to know a dataframe's total storage size in bytes? such 
as:



df.size()

Traceback (most recent call last):
  File "", line 1, in 
  File "/opt/spark/python/pyspark/sql/dataframe.py", line 1660, in 
__getattr__
"'%s' object has no attribute '%s'" % (self.__class__.__name__, 
name))

AttributeError: 'DataFrame' object has no attribute 'size'

Sure it won't work. but if there is such a method that would be great.

Thanks.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: measure running time

2021-12-23 Thread bitfox

Hello list,

I run with Spark 3.2.0

After I started pyspark with:
$ pyspark --packages ch.cern.sparkmeasure:spark-measure_2.12:0.17

I can't load from the module sparkmeasure:


from sparkmeasure import StageMetrics

Traceback (most recent call last):
  File "", line 1, in 
ModuleNotFoundError: No module named 'sparkmeasure'

Do you know why? @Luca thanks.


On 2021-12-24 04:20, bit...@bitfox.top wrote:
Thanks Gourav and Luca. I will try with the tools you provide in the 
Github.


On 2021-12-23 23:40, Luca Canali wrote:

Hi,

I agree with Gourav that just measuring execution time is a simplistic
approach that may lead you to miss important details, in particular
when running distributed computations.

WebUI, REST API, and metrics instrumentation in Spark can be quite
useful for further drill down. See
https://spark.apache.org/docs/latest/monitoring.html

You can also have a look at this tool that takes care of automating
collecting and aggregating some executor task metrics:
https://github.com/LucaCanali/sparkMeasure

Best,

Luca

From: Gourav Sengupta 
Sent: Thursday, December 23, 2021 14:23
To: bit...@bitfox.top
Cc: user 
Subject: Re: measure running time

Hi,

I do not think that such time comparisons make any sense at all in
distributed computation. Just saying that an operation in RDD and
Dataframe can be compared based on their start and stop time may not
provide any valid information.

You will have to look into the details of timing and the steps. For
example, please look at the SPARK UI to see how timings are calculated
in distributed computing mode, there are several well written papers
on this.

Thanks and Regards,

Gourav Sengupta

On Thu, Dec 23, 2021 at 10:57 AM  wrote:


hello community,

In pyspark how can I measure the running time to the command?
I just want to compare the running time of the RDD API and dataframe

API, in my this blog:


https://bitfoxtop.wordpress.com/2021/12/23/count-email-addresses-using-sparks-rdd-and-dataframe/


I tried spark.time() it doesn't work.
Thank you.



-

To unsubscribe e-mail: user-unsubscr...@spark.apache.org


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: measure running time

2021-12-23 Thread bitfox
Thanks Gourav and Luca. I will try with the tools you provide in the 
Github.


On 2021-12-23 23:40, Luca Canali wrote:

Hi,

I agree with Gourav that just measuring execution time is a simplistic
approach that may lead you to miss important details, in particular
when running distributed computations.

WebUI, REST API, and metrics instrumentation in Spark can be quite
useful for further drill down. See
https://spark.apache.org/docs/latest/monitoring.html

You can also have a look at this tool that takes care of automating
collecting and aggregating some executor task metrics:
https://github.com/LucaCanali/sparkMeasure

Best,

Luca

From: Gourav Sengupta 
Sent: Thursday, December 23, 2021 14:23
To: bit...@bitfox.top
Cc: user 
Subject: Re: measure running time

Hi,

I do not think that such time comparisons make any sense at all in
distributed computation. Just saying that an operation in RDD and
Dataframe can be compared based on their start and stop time may not
provide any valid information.

You will have to look into the details of timing and the steps. For
example, please look at the SPARK UI to see how timings are calculated
in distributed computing mode, there are several well written papers
on this.

Thanks and Regards,

Gourav Sengupta

On Thu, Dec 23, 2021 at 10:57 AM  wrote:


hello community,

In pyspark how can I measure the running time to the command?
I just want to compare the running time of the RDD API and dataframe

API, in my this blog:


https://bitfoxtop.wordpress.com/2021/12/23/count-email-addresses-using-sparks-rdd-and-dataframe/


I tried spark.time() it doesn't work.
Thank you.



-

To unsubscribe e-mail: user-unsubscr...@spark.apache.org


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: measure running time

2021-12-23 Thread Mich Talebzadeh
Thanks Luca,

I am still getting some error


* pyspark --packages ch.cern.sparkmeasure:spark-measure_2.12:0.17*

Python 3.7.3 (default, Mar 27 2019, 22:11:17)

[GCC 7.3.0] :: Anaconda, Inc. on linux

Type "help", "copyright", "credits" or "license" for more information.

:: loading settings :: url =
jar:file:/d4T/hduser/spark-3.1.1-bin-hadoop3.2/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml

Ivy Default Cache set to: /home/hduser/.ivy2/cache

The jars for the packages stored in: /home/hduser/.ivy2/jars

ch.cern.sparkmeasure#spark-measure_2.12 added as a dependency

:: resolving dependencies ::
org.apache.spark#spark-submit-parent-8175aada-8494-4953-a687-9c95b282c751;1.0

confs: [default]

found ch.cern.sparkmeasure#spark-measure_2.12;0.17 in central

found com.fasterxml.jackson.module#jackson-module-scala_2.12;2.9.9
in central

found com.fasterxml.jackson.core#jackson-core;2.9.9 in central

found com.fasterxml.jackson.core#jackson-annotations;2.9.9 in
central

found com.fasterxml.jackson.core#jackson-databind;2.9.9 in central

found com.fasterxml.jackson.module#jackson-module-paranamer;2.9.9
in central

found com.thoughtworks.paranamer#paranamer;2.8 in spark-list

found org.slf4j#slf4j-api;1.7.26 in central

found org.influxdb#influxdb-java;2.14 in central

found com.squareup.retrofit2#retrofit;2.4.0 in central

found com.squareup.retrofit2#converter-moshi;2.4.0 in central

found com.squareup.moshi#moshi;1.5.0 in central

found com.squareup.okio#okio;1.13.0 in central

found org.msgpack#msgpack-core;0.8.16 in central

found com.squareup.okhttp3#okhttp;3.11.0 in local-m2-cache

found com.squareup.okio#okio;1.14.0 in local-m2-cache

found com.squareup.okhttp3#logging-interceptor;3.11.0 in central

:: resolution report :: resolve 5349ms :: artifacts dl 2ms

:: modules in use:

ch.cern.sparkmeasure#spark-measure_2.12;0.17 from central in
[default]

com.fasterxml.jackson.core#jackson-annotations;2.9.9 from central
in [default]

com.fasterxml.jackson.core#jackson-core;2.9.9 from central in
[default]

com.fasterxml.jackson.core#jackson-databind;2.9.9 from central in
[default]

com.fasterxml.jackson.module#jackson-module-paranamer;2.9.9 from
central in [default]

com.fasterxml.jackson.module#jackson-module-scala_2.12;2.9.9 from
central in [default]

com.squareup.moshi#moshi;1.5.0 from central in [default]

com.squareup.okhttp3#logging-interceptor;3.11.0 from central in
[default]

com.squareup.okhttp3#okhttp;3.11.0 from local-m2-cache in [default]

com.squareup.okio#okio;1.14.0 from local-m2-cache in [default]

com.squareup.retrofit2#converter-moshi;2.4.0 from central in
[default]

com.squareup.retrofit2#retrofit;2.4.0 from central in [default]

com.thoughtworks.paranamer#paranamer;2.8 from spark-list in
[default]

org.influxdb#influxdb-java;2.14 from central in [default]

org.msgpack#msgpack-core;0.8.16 from central in [default]

org.slf4j#slf4j-api;1.7.26 from central in [default]

:: evicted modules:

com.fasterxml.jackson.core#jackson-annotations;2.9.0 by
[com.fasterxml.jackson.core#jackson-annotations;2.9.9] in [default]

com.squareup.okhttp3#okhttp;3.10.0 by
[com.squareup.okhttp3#okhttp;3.11.0] in [default]

com.squareup.okio#okio;1.13.0 by [com.squareup.okio#okio;1.14.0] in
[default]


-

|  |modules||   artifacts
 |

|   conf   | number| search|dwnlded|evicted||
number|dwnlded|


-

|  default |   19  |   11  |   11  |   3   ||   16  |   0
 |


-


:: problems summary ::

 ERRORS

unknown resolver null


SERVER ERROR: Bad Gateway url=
https://dl.bintray.com/spark-packages/maven/com/fasterxml/jackson/jackson-bom/2.9.9/jackson-bom-2.9.9.jar


SERVER ERROR: Bad Gateway url=
https://dl.bintray.com/spark-packages/maven/com/fasterxml/jackson/jackson-base/2.9.9/jackson-base-2.9.9.jar


unknown resolver null


I will try to investigate it



Cheers


   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 23 Dec 2021 at 19:46, Luca Canali  wrote:

> Hi Mich,
>
>
>
> With Spark 3.1.1 you need to use 

RE: measure running time

2021-12-23 Thread Luca Canali
Hi Mich,

 

With Spark 3.1.1 you need to use spark-measure built with Scala 2.12:  

 

bin/pyspark --packages ch.cern.sparkmeasure:spark-measure_2.12:0.17

 

Best,

Luca

 

From: Mich Talebzadeh  
Sent: Thursday, December 23, 2021 19:59
To: Luca Canali 
Cc: user 
Subject: Re: measure running time

 

Hi Luca,

 

Have you tested this link  https://github.com/LucaCanali/sparkMeasure

 

With Spark 3.1.1/PySpark,   I am getting this error 

 

 

pyspark --packages ch.cern.sparkmeasure:spark-measure_2.11:0.17

 

:: problems summary ::

 ERRORS

unknown resolver null

 

SERVER ERROR: Bad Gateway 
url=https://dl.bintray.com/spark-packages/maven/com/fasterxml/jackson/jackson-bom/2.9.9/jackson-bom-2.9.9.jar

 

SERVER ERROR: Bad Gateway 
url=https://dl.bintray.com/spark-packages/maven/com/fasterxml/jackson/jackson-base/2.9.9/jackson-base-2.9.9.jar

 

Using Python version 3.7.3 (default, Mar 27 2019 22:11:17)

Spark context Web UI available at http://rhes76:4040

Spark context available as 'sc' (master = local[*], app id = 
local-1640285629478).

SparkSession available as 'spark'.

 

>>> from sparkmeasure import StageMetrics

>>> stagemetrics = StageMetrics(spark)

Traceback (most recent call last):

  File "", line 1, in 

  File 
"/home/hduser/anaconda3/envs/pyspark_venv/lib/python3.7/site-packages/sparkmeasure/stagemetrics.py",
 line 15, in __init__

self.stagemetrics = 
self.sc._jvm.ch.cern.sparkmeasure.StageMetrics(self.sparksession._jsparkSession)

  File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 
1569, in __call__

  File "/opt/spark/python/pyspark/sql/utils.py", line 111, in deco

return f(*a, **kw)

  File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, 
in get_return_value

py4j.protocol.Py4JJavaError: An error occurred while calling 
None.ch.cern.sparkmeasure.StageMetrics.

: java.lang.NoClassDefFoundError: scala/Product$class

at ch.cern.sparkmeasure.StageMetrics.(stagemetrics.scala:111)

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)

at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)

at java.lang.reflect.Constructor.newInstance(Constructor.java:423)

at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)

at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)

at py4j.Gateway.invoke(Gateway.java:238)

at 
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)

at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)

at py4j.GatewayConnection.run(GatewayConnection.java:238)

at java.lang.Thread.run(Thread.java:748)

Caused by: java.lang.ClassNotFoundException: scala.Product$class

at java.net.URLClassLoader.findClass(URLClassLoader.java:382)

at java.lang.ClassLoader.loadClass(ClassLoader.java:424)

at java.lang.ClassLoader.loadClass(ClassLoader.java:357)

... 12 more

 

Thanks

 

 

   view my Linkedin profile 
 

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction. 

 

 

 

On Thu, 23 Dec 2021 at 15:41, Luca Canali mailto:luca.can...@cern.ch> > wrote:

Hi,

 

I agree with Gourav that just measuring execution time is a simplistic approach 
that may lead you to miss important details, in particular when running 
distributed computations.

WebUI, REST API, and metrics instrumentation in Spark can be quite useful for 
further drill down. See https://spark.apache.org/docs/latest/monitoring.html

You can also have a look at this tool that takes care of automating collecting 
and aggregating some executor task metrics: 
https://github.com/LucaCanali/sparkMeasure

 

Best,

Luca

 

From: Gourav Sengupta mailto:gourav.sengu...@gmail.com> > 
Sent: Thursday, December 23, 2021 14:23
To: bit...@bitfox.top
Cc: user mailto:user@spark.apache.org> >
Subject: Re: measure running time

 

Hi,

 

I do not think that such time comparisons make any sense at all in distributed 
computation. Just saying that an operation in RDD and Dataframe can be compared 
based on their start and stop time may not provide any valid information.

 

You will have to look into the details of timing and the steps. For example, 
please look at the SPARK UI to see how timings are calculated in distributed 
computing mode, there are several well written papers on this.

 

 

Thanks and Regards,

Gourav Sengupta

 

 

 

 

 

On Thu, Dec 

Re: measure running time

2021-12-23 Thread Mich Talebzadeh
Hi Luca,

Have you tested this link  https://github.com/LucaCanali/sparkMeasure


With Spark 3.1.1/PySpark,   I am getting this error



pyspark --packages ch.cern.sparkmeasure:spark-measure_2.11:0.17

:: problems summary ::

 ERRORS

unknown resolver null


SERVER ERROR: Bad Gateway url=
https://dl.bintray.com/spark-packages/maven/com/fasterxml/jackson/jackson-bom/2.9.9/jackson-bom-2.9.9.jar


SERVER ERROR: Bad Gateway url=
https://dl.bintray.com/spark-packages/maven/com/fasterxml/jackson/jackson-base/2.9.9/jackson-base-2.9.9.jar


Using Python version 3.7.3 (default, Mar 27 2019 22:11:17)

Spark context Web UI available at http://rhes76:4040

Spark context available as 'sc' (master = local[*], app id =
local-1640285629478).

SparkSession available as 'spark'.

>>> from sparkmeasure import StageMetrics
>>> stagemetrics = StageMetrics(spark)
Traceback (most recent call last):
  File "", line 1, in 
  File
"/home/hduser/anaconda3/envs/pyspark_venv/lib/python3.7/site-packages/sparkmeasure/stagemetrics.py",
line 15, in __init__
self.stagemetrics =
self.sc._jvm.ch.cern.sparkmeasure.StageMetrics(self.sparksession._jsparkSession)
  File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py",
line 1569, in __call__
  File "/opt/spark/python/pyspark/sql/utils.py", line 111, in deco
return f(*a, **kw)
  File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line
328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling
None.ch.cern.sparkmeasure.StageMetrics.
: java.lang.NoClassDefFoundError: scala/Product$class
at ch.cern.sparkmeasure.StageMetrics.(stagemetrics.scala:111)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
at
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:238)
at
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at
py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: scala.Product$class
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 12 more

Thanks



   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 23 Dec 2021 at 15:41, Luca Canali  wrote:

> Hi,
>
>
>
> I agree with Gourav that just measuring execution time is a simplistic
> approach that may lead you to miss important details, in particular when
> running distributed computations.
>
> WebUI, REST API, and metrics instrumentation in Spark can be quite useful
> for further drill down. See
> https://spark.apache.org/docs/latest/monitoring.html
>
> You can also have a look at this tool that takes care of automating
> collecting and aggregating some executor task metrics:
> https://github.com/LucaCanali/sparkMeasure
>
>
>
> Best,
>
> Luca
>
>
>
> *From:* Gourav Sengupta 
> *Sent:* Thursday, December 23, 2021 14:23
> *To:* bit...@bitfox.top
> *Cc:* user 
> *Subject:* Re: measure running time
>
>
>
> Hi,
>
>
>
> I do not think that such time comparisons make any sense at all in
> distributed computation. Just saying that an operation in RDD and Dataframe
> can be compared based on their start and stop time may not provide any
> valid information.
>
>
>
> You will have to look into the details of timing and the steps. For
> example, please look at the SPARK UI to see how timings are calculated in
> distributed computing mode, there are several well written papers on this.
>
>
>
>
>
> Thanks and Regards,
>
> Gourav Sengupta
>
>
>
>
>
>
>
>
>
>
>
> On Thu, Dec 23, 2021 at 10:57 AM  wrote:
>
> hello community,
>
> In pyspark how can I measure the running time to the command?
> I just want to compare the running time of the RDD API and dataframe
> API, in my this blog:
>
> https://bitfoxtop.wordpress.com/2021/12/23/count-email-addresses-using-sparks-rdd-and-dataframe/
>
> I tried spark.time() it doesn't work.
> Thank you.
>
> 

Re: About some Spark technical help

2021-12-23 Thread sam smith
Hi Andrew,

Thanks, here's the Github repo to the code and the publication :
https://github.com/SamSmithDevs10/paperReplicationForReview

Kind regards

Le jeu. 23 déc. 2021 à 17:58, Andrew Davidson  a écrit :

> Hi Sam
>
>
>
> Can you tell us more? What is the algorithm? Can you send us the URL the
> publication
>
>
>
> Kind regards
>
>
>
> Andy
>
>
>
> *From: *sam smith 
> *Date: *Wednesday, December 22, 2021 at 10:59 AM
> *To: *"user@spark.apache.org" 
> *Subject: *About some Spark technical help
>
>
>
> Hello guys,
>
>
>
> I am replicating a paper's algorithm in Spark / Java, and want to ask you
> guys for some assistance to validate / review about 150 lines of code. My
> github repo contains both my java class and the related paper,
>
>
>
> Any interested reviewer here ?
>
>
>
>
>
> Thanks.
>


Re: About some Spark technical help

2021-12-23 Thread Andrew Davidson
Hi Sam

Can you tell us more? What is the algorithm? Can you send us the URL the 
publication

Kind regards

Andy

From: sam smith 
Date: Wednesday, December 22, 2021 at 10:59 AM
To: "user@spark.apache.org" 
Subject: About some Spark technical help

Hello guys,

I am replicating a paper's algorithm in Spark / Java, and want to ask you guys 
for some assistance to validate / review about 150 lines of code. My github 
repo contains both my java class and the related paper,

Any interested reviewer here ?


Thanks.


Re: How to estimate the executor memory size according by the data

2021-12-23 Thread Gourav Sengupta
Hi,

just trying to understand:
1.  Are you using JDBC to consume data from HIVE?
2. Or are you reading data directly from S3 and just using HIVE Metastore
in SPARK just to find out where the table is stored and its metadata?

Regards,
Gourav Sengupta

On Thu, Dec 23, 2021 at 2:13 PM Arthur Li  wrote:

> Dear experts,
>
> Recently there’s some OOM issue in my demo jobs which consuming data from
> the hive database, and I know I can increase the executor memory size to
> eliminate the OOM error. While I don’t know how to do the executor memory
> assessment and how to automatically adopt the executor memory size by the
> data size.
>
> Any options I appreciated.
> Arthur Li
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


RE: measure running time

2021-12-23 Thread Luca Canali
Hi,

 

I agree with Gourav that just measuring execution time is a simplistic approach 
that may lead you to miss important details, in particular when running 
distributed computations.

WebUI, REST API, and metrics instrumentation in Spark can be quite useful for 
further drill down. See https://spark.apache.org/docs/latest/monitoring.html

You can also have a look at this tool that takes care of automating collecting 
and aggregating some executor task metrics: 
https://github.com/LucaCanali/sparkMeasure

 

Best,

Luca

 

From: Gourav Sengupta  
Sent: Thursday, December 23, 2021 14:23
To: bit...@bitfox.top
Cc: user 
Subject: Re: measure running time

 

Hi,

 

I do not think that such time comparisons make any sense at all in distributed 
computation. Just saying that an operation in RDD and Dataframe can be compared 
based on their start and stop time may not provide any valid information.

 

You will have to look into the details of timing and the steps. For example, 
please look at the SPARK UI to see how timings are calculated in distributed 
computing mode, there are several well written papers on this.

 

 

Thanks and Regards,

Gourav Sengupta

 

 

 

 

 

On Thu, Dec 23, 2021 at 10:57 AM mailto:bit...@bitfox.top> 
> wrote:

hello community,

In pyspark how can I measure the running time to the command?
I just want to compare the running time of the RDD API and dataframe 
API, in my this blog:
https://bitfoxtop.wordpress.com/2021/12/23/count-email-addresses-using-sparks-rdd-and-dataframe/

I tried spark.time() it doesn't work.
Thank you.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
 



RE: How to estimate the executor memory size according by the data

2021-12-23 Thread Luca Canali
Hi Arthur,

If you are using Spark 3.x you can use executor metrics for memory 
instrumentation.  
Metrics are available on the WebUI, see 
https://spark.apache.org/docs/latest/web-ui.html#stage-detail (search for Peak 
execution memory).  
Memory execution metrics are available also in the REST API and the Spark 
metrics system, see https://spark.apache.org/docs/latest/monitoring.html  
Further information on the topic also at 
https://db-blog.web.cern.ch/blog/luca-canali/2020-08-spark3-memory-monitoring  
  
Best,
Luca

-Original Message-
From: Arthur Li  
Sent: Thursday, December 23, 2021 15:11
To: user@spark.apache.org
Subject: How to estimate the executor memory size according by the data

Dear experts,

Recently there’s some OOM issue in my demo jobs which consuming data from the 
hive database, and I know I can increase the executor memory size to eliminate 
the OOM error. While I don’t know how to do the executor memory assessment and 
how to automatically adopt the executor memory size by the data size.

Any options I appreciated.
Arthur Li

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



How to estimate the executor memory size according by the data

2021-12-23 Thread Arthur Li
Dear experts,

Recently there’s some OOM issue in my demo jobs which consuming data from the 
hive database, and I know I can increase the executor memory size to eliminate 
the OOM error. While I don’t know how to do the executor memory assessment and 
how to automatically adopt the executor memory size by the data size.

Any options I appreciated.
Arthur Li

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: measure running time

2021-12-23 Thread Gourav Sengupta
Hi,

I do not think that such time comparisons make any sense at all in
distributed computation. Just saying that an operation in RDD and Dataframe
can be compared based on their start and stop time may not provide any
valid information.

You will have to look into the details of timing and the steps. For
example, please look at the SPARK UI to see how timings are calculated in
distributed computing mode, there are several well written papers on this.


Thanks and Regards,
Gourav Sengupta





On Thu, Dec 23, 2021 at 10:57 AM  wrote:

> hello community,
>
> In pyspark how can I measure the running time to the command?
> I just want to compare the running time of the RDD API and dataframe
> API, in my this blog:
>
> https://bitfoxtop.wordpress.com/2021/12/23/count-email-addresses-using-sparks-rdd-and-dataframe/
>
> I tried spark.time() it doesn't work.
> Thank you.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


dataset partitioning algorithm implementation help

2021-12-23 Thread sam smith
Hello All,

I am replicating a paper's algorithm about a partitioning approach to
anonymize datasets with Spark / Java, and want to ask you for some help to
review my 150 lines of code. My github repo, attached below, contains both
my java class and the related paper:

https://github.com/SamSmithDevs10/paperReplicationForReview

Thanks in advance.

Thanks.


Re: measure running time

2021-12-23 Thread Mich Talebzadeh
Try this simple thing first

import time
def main():
start_time = time.time()

   print("\nStarted at");uf.println(lst)
   # your code

   print("\nFinished at");uf.println(lst)
   end_time = time.time()
   time_elapsed = (end_time - start_time)
   print(f"""Elapsed time in seconds is {time_elapsed}""")
   spark_session.stop()


see

https://github.com/michTalebzadeh/spark_on_gke/blob/main/src/RandomDataBigQuery.py

HTH


   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 23 Dec 2021 at 10:58,  wrote:

> hello community,
>
> In pyspark how can I measure the running time to the command?
> I just want to compare the running time of the RDD API and dataframe
> API, in my this blog:
>
> https://bitfoxtop.wordpress.com/2021/12/23/count-email-addresses-using-sparks-rdd-and-dataframe/
>
> I tried spark.time() it doesn't work.
> Thank you.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


measure running time

2021-12-23 Thread bitfox

hello community,

In pyspark how can I measure the running time to the command?
I just want to compare the running time of the RDD API and dataframe 
API, in my this blog:

https://bitfoxtop.wordpress.com/2021/12/23/count-email-addresses-using-sparks-rdd-and-dataframe/

I tried spark.time() it doesn't work.
Thank you.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org