Thanks a lot Hollis. It is does due to the pypi version. Now I updated it.

$ pip3 -V
pip 9.0.1 from /usr/lib/python3/dist-packages (python 3.6)

$ pip3 install sparkmeasure
Collecting sparkmeasure
Using cached https://files.pythonhosted.org/packages/9f/bf/c9810ff2d88513ffc185e65a3ab9df6121ad5b4c78aa8d134a06177f9021/sparkmeasure-0.14.0-py2.py3-none-any.whl
Installing collected packages: sparkmeasure
Successfully installed sparkmeasure-0.14.0

$ pyspark --packages ch.cern.sparkmeasure:spark-measure_2.12:0.17
Python 3.6.9 (default, Jan 26 2021, 15:33:00)
[GCC 8.4.0] on linux
...
from sparkmeasure import StageMetrics
stagemetrics = StageMetrics(spark)
stagemetrics.runandmeasure(locals(), 'spark.sql("select count(*) from range(1000) cross join range(1000) cross join range(100)").show()')
+---------+
| count(1)|
+---------+
|100000000|
+---------+
...


Hope it helps to others who have met the same issue.
Happy holidays. :0

Bitfox


On 2021-12-25 09:48, Hollis wrote:
---- Replied mail ----

                 From
                 Mich Talebzadeh<mich.talebza...@gmail.com>

                 Date
                 12/25/2021 00:25

                 To
                 Sean Owen<sro...@gmail.com>

                 Cc
                 user<user@spark.apache.org>、Luca Canali<luca.can...@cern.ch>

                 Subject
                 Re: measure running time

Hi Sean,

I have already discussed an issue in my case with Spark 3.1.1 and
sparkmeasure  with the author Luca Canali on this matter. It has been
reproduced. I think we ought to wait for a patch.

HTH,

Mich

   view my Linkedin profile [1]

Disclaimer: Use it at your own risk. Any and all responsibility for
any loss, damage or destruction of data or any other property which
may arise from relying on this email's technical content is explicitly
disclaimed. The author will in no case be liable for any monetary
damages arising from such loss, damage or destruction.

On Fri, 24 Dec 2021 at 14:51, Sean Owen <sro...@gmail.com> wrote:

You probably did not install it on your cluster, nor included the
python package with your app

On Fri, Dec 24, 2021, 4:35 AM <bit...@bitfox.top> wrote:

but I already installed it:

Requirement already satisfied: sparkmeasure in
/usr/local/lib/python2.7/dist-packages

so how? thank you.

On 2021-12-24 18:15, Hollis wrote:
Hi bitfox,

you need pip install sparkmeasure firstly. then can lanch in
pysaprk.

from sparkmeasure import StageMetrics
stagemetrics = StageMetrics(spark)
stagemetrics.runandmeasure(locals(), 'spark.sql("select
count(*)
from range(1000) cross join range(1000) cross join
range(100)").show()')
+---------+

| count(1)|
+---------+
|100000000|
+---------+

Regards,
Hollis

At 2021-12-24 09:18:19, bit...@bitfox.top wrote:
Hello list,

I run with Spark 3.2.0

After I started pyspark with:
$ pyspark --packages
ch.cern.sparkmeasure:spark-measure_2.12:0.17

I can't load from the module sparkmeasure:

from sparkmeasure import StageMetrics
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'sparkmeasure'

Do you know why? @Luca thanks.


On 2021-12-24 04:20, bit...@bitfox.top wrote:
Thanks Gourav and Luca. I will try with the tools you provide
in
the
Github.

On 2021-12-23 23:40, Luca Canali wrote:
Hi,

I agree with Gourav that just measuring execution time is a
simplistic
approach that may lead you to miss important details, in
particular
when running distributed computations.

WebUI, REST API, and metrics instrumentation in Spark can be
quite
useful for further drill down. See
https://spark.apache.org/docs/latest/monitoring.html

You can also have a look at this tool that takes care of
automating
collecting and aggregating some executor task metrics:
https://github.com/LucaCanali/sparkMeasure

Best,

Luca

From: Gourav Sengupta <gourav.sengu...@gmail.com>
Sent: Thursday, December 23, 2021 14:23
To: bit...@bitfox.top
Cc: user <user@spark.apache.org>
Subject: Re: measure running time

Hi,

I do not think that such time comparisons make any sense at
all in
distributed computation. Just saying that an operation in RDD
and
Dataframe can be compared based on their start and stop time
may
not
provide any valid information.

You will have to look into the details of timing and the
steps.
For
example, please look at the SPARK UI to see how timings are
calculated
in distributed computing mode, there are several well written
papers
on this.

Thanks and Regards,

Gourav Sengupta

On Thu, Dec 23, 2021 at 10:57 AM <bit...@bitfox.top> wrote:

hello community,

In pyspark how can I measure the running time to the
command?
I just want to compare the running time of the RDD API and
dataframe

API, in my this blog:





https://bitfoxtop.wordpress.com/2021/12/23/count-email-addresses-using-sparks-rdd-and-dataframe/

I tried spark.time() it doesn't work.
Thank you.






---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org





---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org




---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org


Links:
------
[1] https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to