Re: Creating remote tables using PySpark

2024-03-08 Thread Mich Talebzadeh
The error message shows a mismatch between the configured warehouse
directory and the actual location accessible by the Spark application
running in the container..

You have configured the SparkSession with
spark.sql.warehouse.dir="file:/data/hive/warehouse". This tells Spark where
to store temporary and intermediate data during operations like saving
DataFrames as tables. When running the application remotely, the container
cannot access the directory /data/hive/warehouseon your local machine. This
directory path  may exist on the container's host system, but not within
the container itself..
You can set spark.sql.warehouse.dirto a directory within the container's
file system. This directory should be accessible by the Spark application
running inside the container. For example:

spark = SparkSession.builder \
.appName("testme") \
.master("spark://192.168.1.245:7077") \
.config("spark.sql.warehouse.dir", "/tmp/spark-warehouse") \ # Change this
to anything suitable within the container
.config("hive.metastore.uris","thrift://192.168.1.245:9083") \
.enableHiveSupport() \
.getOrCreate()

Use spark.conf.get("spark.sql.warehouse.dir") to print the configured
warehouse directory after creating the SparkSession to confirm all is OK

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Fri, 8 Mar 2024 at 06:01, Tom Barber  wrote:

> Okay interesting, maybe my assumption was incorrect, although I'm still
> confused.
>
> I tried to mount a central mount point that would be the same on my local
> machine and the container. Same error although I moved the path to
> /tmp/hive/data/hive/ but when I rerun the test code to save a table,
> the complaint is still for
>
> Warehouse Dir: file:/tmp/hive/data/hive/warehouse
> Metastore URIs: thrift://192.168.1.245:9083
> Warehouse Dir: file:/tmp/hive/data/hive/warehouse
> Metastore URIs: thrift://192.168.1.245:9083
> ERROR FileOutputCommitter: Mkdirs failed to create
> file:/data/hive/warehouse/input.db/accounts_20240307_232110_1_0_6_post21_g4fdc321_d20240307/_temporary/0
>
> so what is /data/hive even referring to when I print out the spark conf
> values and neither now refer to /data/hive/
>
> On Thu, Mar 7, 2024 at 9:49 PM Tom Barber  wrote:
>
>> Wonder if anyone can just sort my brain out here as to whats possible or
>> not.
>>
>> I have a container running Spark, with Hive and a ThriftServer. I want to
>> run code against it remotely.
>>
>> If I take something simple like this
>>
>> from pyspark.sql import SparkSession
>> from pyspark.sql.types import StructType, StructField, IntegerType,
>> StringType
>>
>> # Initialize SparkSession
>> spark = SparkSession.builder \
>> .appName("ShowDatabases") \
>> .master("spark://192.168.1.245:7077") \
>> .config("spark.sql.warehouse.dir", "file:/data/hive/warehouse") \
>> .config("hive.metastore.uris","thrift://192.168.1.245:9083")\
>> .enableHiveSupport() \
>> .getOrCreate()
>>
>> # Define schema of the DataFrame
>> schema = StructType([
>> StructField("id", IntegerType(), True),
>> StructField("name", StringType(), True)
>> ])
>>
>> # Data to be converted into a DataFrame
>> data = [(1, "John Doe"), (2, "Jane Doe"), (3, "Mike Johnson")]
>>
>> # Create DataFrame
>> df = spark.createDataFrame(data, schema)
>>
>> # Show the DataFrame (optional, for verification)
>> df.show()
>>
>> # Save the DataFrame to a table named "my_table"
>> df.write.mode("overwrite").saveAsTable("my_table")
>>
>> # Stop the SparkSession
>> spark.stop()
>>
>> When I run it in the container it runs fine, but when I run it remotely
>> it says:
>>
>> : java.io.FileNotFoundException: File
>> file:/data/hive/warehouse/my_table/_temporary/0 does not exist
>> at
>> org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:597)
>> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972)
>> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014)
>> at
>> org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:761)
>> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972)
>> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014)
>> at
>> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.getAllCommittedTaskPaths(FileOutputCommitter.java:334)
>> at
>> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJobInternal(FileOutputCommitter.java:404)
>> at
>> 

Spark-UI stages and other tabs not accessible in standalone mode when reverse-proxy is enabled

2024-03-08 Thread sharad mishra
Hi Team,
We're encountering an issue with Spark UI.
When enabled reverse proxy in master and worker configOptions. We're not
able to access different tabs available in spark UI e.g.(stages,
environment, storage etc.)

We're deploying spark through bitnami helm chart :
https://github.com/bitnami/charts/tree/main/bitnami/spark

Name and Version

bitnami/spark - 6.0.0

What steps will reproduce the bug?

Kubernetes Version: 1.25
Spark: 3.4.2
Helm chart: 6.0.0

Steps to reproduce:
After installing the chart Spark Cluster(Master and worker) UI is available
at:


https://spark.staging.abc.com/

We are able to access running application by click on applicationID under
Running Applications link:



We can access spark UI by clicking Application Detail UI:

We are taken to jobs tab when we click on Application Detail UI


URL looks like:
https://spark.staging.abc.com/proxy/app-20240208103209-0030/stages/

When we click any of the tab from spark UI e.g. stages or environment etc,
it takes us back to spark cluster UI page
We noticed that endpoint changes to


https://spark.staging.abc.com/stages/
instead of
https://spark.staging.abc.com/proxy/app-20240208103209-0030/stages/



Are you using any custom parameters or values?

Configurations set in values.yaml
```
master:
  configOptions:
-Dspark.ui.reverseProxy=true
-Dspark.ui.reverseProxyUrl=https://spark.staging.abc.com

worker:
  configOptions:
-Dspark.ui.reverseProxy=true
-Dspark.ui.reverseProxyUrl=https://spark.staging.abc.com

service:
  type: ClusterIP
  ports:
http: 8080
https: 443
cluster: 7077

ingress:

  enabled: true
  pathType: ImplementationSpecific
  apiVersion: ""
  hostname: spark.staging.abc.com
  ingressClassName: "staging"
  path: /
```



What is the expected behavior?

Expected behaviour is that when I click on stages tab, instead of taking me
to
https://spark.staging.abc.com/stages/
it should take me to following URL:
https://spark.staging.abc.com/proxy/app-20240208103209-0030/stages/

What do you see instead?

current behaviour is it takes me to URL:
https://spark.staging.abc.com/stages/ , which shows spark cluster UI with
master and worker details

Best,
Sharad