Cluster-mode job compute-time/cost metrics

2023-12-11 Thread Jack Wells
Hello Spark experts - I’m running Spark jobs in cluster mode using a
dedicated cluster for each job. Is there a way to see how much compute time
each job takes via Spark APIs, metrics, etc.? In case it makes a
difference, I’m using AWS EMR - I’d ultimately like to be able to say this
job costs $X since it took Y minutes on Z instance types (assuming all of
the nodes are the same instance type), but I figure I could probably need
to get the Z instance type through EMR APIs.

Thanks!
Jack


Re: [External Email] Re: About /mnt/hdfs/current/BP directories

2023-09-08 Thread Jack Wells
 Assuming you’re not writing to HDFS in your code, Spark can spill to HDFS
if it runs out of memory on a per-executor basis. This could happen when
evaluating a cache operation like you have below or during shuffle
operations in joins, etc. You might try to increase executor memory, tune
shuffle operations, avoid caching, or reduce the size of your dataframe(s).

Jack

On Sep 8, 2023 at 12:43:07, Nebi Aydin 
wrote:

>
> Sure
> df = spark.read.option("basePath",
> some_path).parquet(*list_of_s3_file_paths())
> (
> df
> .where(SOME FILTER)
> .repartition(6)
> .cache()
> )
>
> On Fri, Sep 8, 2023 at 14:56 Jack Wells  wrote:
>
>> Hi Nebi, can you share the code you’re using to read and write from S3?
>>
>> On Sep 8, 2023 at 10:59:59, Nebi Aydin 
>> wrote:
>>
>>> Hi all,
>>> I am using spark on EMR to process data. Basically i read data from AWS
>>> S3 and do the transformation and post transformation i am loading/writing
>>> data to s3.
>>>
>>> Recently we have found that hdfs(/mnt/hdfs) utilization is going too
>>> high.
>>>
>>> I disabled `yarn.log-aggregation-enable` by setting it to False.
>>>
>>> I am not writing any data to hdfs(/mnt/hdfs) however is that spark is
>>> creating blocks and writing data into it. We are going all the operations
>>> in memory.
>>>
>>> Any specific operation writing data to datanode(HDFS)?
>>>
>>> Here is the hdfs dirs created.
>>>
>>> ```
>>>
>>> *15.4G
>>> /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812/current/finalized/subdir1
>>>
>>> 129G
>>> /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812/current/finalized
>>>
>>> 129G /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812/current
>>>
>>> 129G /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812
>>>
>>> 129G /mnt/hdfs/current 129G /mnt/hdfs*
>>>
>>> ```
>>>
>>>
>>> <https://stackoverflow.com/collectives/aws>
>>>
>>


Re: About /mnt/hdfs/current/BP directories

2023-09-08 Thread Jack Wells
 Hi Nebi, can you share the code you’re using to read and write from S3?

On Sep 8, 2023 at 10:59:59, Nebi Aydin 
wrote:

> Hi all,
> I am using spark on EMR to process data. Basically i read data from AWS S3
> and do the transformation and post transformation i am loading/writing data
> to s3.
>
> Recently we have found that hdfs(/mnt/hdfs) utilization is going too high.
>
> I disabled `yarn.log-aggregation-enable` by setting it to False.
>
> I am not writing any data to hdfs(/mnt/hdfs) however is that spark is
> creating blocks and writing data into it. We are going all the operations
> in memory.
>
> Any specific operation writing data to datanode(HDFS)?
>
> Here is the hdfs dirs created.
>
> ```
>
> *15.4G
> /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812/current/finalized/subdir1
>
> 129G
> /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812/current/finalized
>
> 129G /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812/current
>
> 129G /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812
>
> 129G /mnt/hdfs/current 129G /mnt/hdfs*
>
> ```
>
>
> 
>


Re: [Spark SQL] Data objects from query history

2023-07-03 Thread Jack Wells
 Hi Ruben,

I’m not sure if this answers your question, but if you’re interested in
exploring the underlying tables, you could always try something like the
below in a Databricks notebook:

display(spark.read.table(’samples.nyctaxi.trips’))

(For vanilla Spark users, it would be
spark.read.table(’samples.nyctaxi.trips’).show(100, False) )

Since you’re using Databricks, you can also find the data under the Data
menu, scroll down to the samples metastore then click through to trips to
find the file location, schema, and sample data.

On Jun 29, 2023 at 23:53:25, Ruben Mennes  wrote:

> Dear Apache Spark community,
>
> I hope this email finds you well. My name is Ruben, and I am an
> enthusiastic user of Apache Spark, specifically through the Databricks
> platform. I am reaching out to you today to seek your assistance and
> guidance regarding a specific use case.
>
> I have been exploring the capabilities of Spark SQL and Databricks, and I
> have encountered a challenge related to accessing the data objects used by
> queries from the query history. I am aware that Databricks provides a
> comprehensive query history that contains valuable information about
> executed queries.
>
> However, my objective is to extract the underlying data objects (tables)
> involved in each query. By doing so, I aim to analyze and understand the
> dependencies between queries and the data they operate on. This information
> will provide us new insights in how data is used across our data platform.
>
> I have attempted to leverage the Spark SQL Antlr grammar, available at
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4,
> to parse the queries retrieved from the query history. Unfortunately, I
> have encountered difficulties when parsing more complex queries.
>
> As an example, I have struggled to parse queries with intricate constructs
> such as the following:
>
> SELECT
>   concat(pickup_zip, '-', dropoff_zip) as route,
>   AVG(fare_amount) as average_fare
> FROM
>   `samples`.`nyctaxi`.`trips`
> GROUP BY
>   1
> ORDER BY
>   2 DESC
> LIMIT 1000
>
> I would greatly appreciate it if you could provide me with some guidance
> on how to overcome these challenges. Specifically, I am interested in
> understanding if there are alternative approaches or existing tools that
> can help me achieve my goal of extracting the data objects used by queries
> from the Databricks query history.
>
> Additionally, if there are any resources, documentation, or examples that
> provide further clarity on this topic, I would be more than grateful to
> receive them. Any insights you can provide would be of immense help in
> advancing my understanding and enabling me to make the most of the Spark
> SQL and Databricks ecosystem.
>
> Thank you very much for your time and support. I eagerly look forward to
> hearing from you and benefiting from your expertise.
>
> Best regards,
> Ruben Mennes
>