Re: measure running time

bitfox Fri, 24 Dec 2021 18:26:28 -0800

Thanks a lot Hollis. It is does due to the pypi version. Now I updatedit.


$ pip3 -V
pip 9.0.1 from /usr/lib/python3/dist-packages (python 3.6)


$ pip3 install sparkmeasure
Collecting sparkmeasure

Using cachedhttps://files.pythonhosted.org/packages/9f/bf/c9810ff2d88513ffc185e65a3ab9df6121ad5b4c78aa8d134a06177f9021/sparkmeasure-0.14.0-py2.py3-none-any.whl

Installing collected packages: sparkmeasure
Successfully installed sparkmeasure-0.14.0

$ pyspark --packages ch.cern.sparkmeasure:spark-measure_2.12:0.17
Python 3.6.9 (default, Jan 26 2021, 15:33:00)
[GCC 8.4.0] on linux
...

from sparkmeasure import StageMetrics
stagemetrics = StageMetrics(spark)
stagemetrics.runandmeasure(locals(), 'spark.sql("select count(*) fromrange(1000) cross join range(1000) cross join range(100)").show()')

+---------+
| count(1)|
+---------+
|100000000|
+---------+
...


Hope it helps to others who have met the same issue.
Happy holidays. :0

Bitfox


On 2021-12-25 09:48, Hollis wrote:

---- Replied mail ----

                 From
                 Mich Talebzadeh<mich.talebza...@gmail.com>

                 Date
                 12/25/2021 00:25

                 To
                 Sean Owen<sro...@gmail.com>

                 Cc
                 user<user@spark.apache.org>、Luca Canali<luca.can...@cern.ch>

                 Subject
                 Re: measure running time

Hi Sean,

I have already discussed an issue in my case with Spark 3.1.1 and
sparkmeasure  with the author Luca Canali on this matter. It has been
reproduced. I think we ought to wait for a patch.

HTH,

Mich

   view my Linkedin profile [1]

Disclaimer: Use it at your own risk. Any and all responsibility for
any loss, damage or destruction of data or any other property which
may arise from relying on this email's technical content is explicitly
disclaimed. The author will in no case be liable for any monetary
damages arising from such loss, damage or destruction.

On Fri, 24 Dec 2021 at 14:51, Sean Owen <sro...@gmail.com> wrote:

You probably did not install it on your cluster, nor included the
python package with your app

On Fri, Dec 24, 2021, 4:35 AM <bit...@bitfox.top> wrote:

but I already installed it:

Requirement already satisfied: sparkmeasure in
/usr/local/lib/python2.7/dist-packages

so how? thank you.

On 2021-12-24 18:15, Hollis wrote:

Hi bitfox,

you need pip install sparkmeasure firstly. then can lanch in

pysaprk.

from sparkmeasure import StageMetrics
stagemetrics = StageMetrics(spark)
stagemetrics.runandmeasure(locals(), 'spark.sql("select

count(*)

from range(1000) cross join range(1000) cross join
range(100)").show()')
+---------+

| count(1)|
+---------+
|100000000|
+---------+

Regards,
Hollis

At 2021-12-24 09:18:19, bit...@bitfox.top wrote:

Hello list,

I run with Spark 3.2.0

After I started pyspark with:
$ pyspark --packages

ch.cern.sparkmeasure:spark-measure_2.12:0.17


I can't load from the module sparkmeasure:

from sparkmeasure import StageMetrics

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'sparkmeasure'

Do you know why? @Luca thanks.


On 2021-12-24 04:20, bit...@bitfox.top wrote:

Thanks Gourav and Luca. I will try with the tools you provide

in

the

Github.

On 2021-12-23 23:40, Luca Canali wrote:

Hi,

I agree with Gourav that just measuring execution time is a

simplistic

approach that may lead you to miss important details, in

particular

when running distributed computations.

WebUI, REST API, and metrics instrumentation in Spark can be

quite

useful for further drill down. See
https://spark.apache.org/docs/latest/monitoring.html

You can also have a look at this tool that takes care of

automating

collecting and aggregating some executor task metrics:
https://github.com/LucaCanali/sparkMeasure

Best,

Luca

From: Gourav Sengupta <gourav.sengu...@gmail.com>
Sent: Thursday, December 23, 2021 14:23
To: bit...@bitfox.top
Cc: user <user@spark.apache.org>
Subject: Re: measure running time

Hi,

I do not think that such time comparisons make any sense at

all in

distributed computation. Just saying that an operation in RDD

and

Dataframe can be compared based on their start and stop time

may

not

provide any valid information.

You will have to look into the details of timing and the

steps.

For

example, please look at the SPARK UI to see how timings are

calculated

in distributed computing mode, there are several well written

papers

on this.

Thanks and Regards,

Gourav Sengupta

On Thu, Dec 23, 2021 at 10:57 AM <bit...@bitfox.top> wrote:

hello community,

In pyspark how can I measure the running time to the

command?

I just want to compare the running time of the RDD API and

dataframe


API, in my this blog:

https://bitfoxtop.wordpress.com/2021/12/23/count-email-addresses-using-sparks-rdd-and-dataframe/


I tried spark.time() it doesn't work.
Thank you.

---------------------------------------------------------------------

To unsubscribe e-mail: user-unsubscr...@spark.apache.org

---------------------------------------------------------------------

To unsubscribe e-mail: user-unsubscr...@spark.apache.org

---------------------------------------------------------------------

To unsubscribe e-mail: user-unsubscr...@spark.apache.org

---------------------------------------------------------------------

To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Links:
------
[1] https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: measure running time

Reply via email to