Re: Creating remote tables using PySpark

Mich Talebzadeh Fri, 08 Mar 2024 03:25:32 -0800

The error message shows a mismatch between the configured warehouse
directory and the actual location accessible by the Spark application
running in the container..


You have configured the SparkSession with
spark.sql.warehouse.dir="file:/data/hive/warehouse". This tells Spark where
to store temporary and intermediate data during operations like saving
DataFrames as tables. When running the application remotely, the container
cannot access the directory /data/hive/warehouseon your local machine. This
directory path  may exist on the container's host system, but not within
the container itself..
You can set spark.sql.warehouse.dirto a directory within the container's
file system. This directory should be accessible by the Spark application
running inside the container. For example:

spark = SparkSession.builder \
.appName("testme") \
.master("spark://192.168.1.245:7077") \
.config("spark.sql.warehouse.dir", "/tmp/spark-warehouse") \ # Change this
to anything suitable within the container
.config("hive.metastore.uris","thrift://192.168.1.245:9083") \
.enableHiveSupport() \
.getOrCreate()

Use spark.conf.get("spark.sql.warehouse.dir") to print the configured
warehouse directory after creating the SparkSession to confirm all is OK

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Fri, 8 Mar 2024 at 06:01, Tom Barber <t...@spicule.co.uk> wrote:

> Okay interesting, maybe my assumption was incorrect, although I'm still
> confused.
>
> I tried to mount a central mount point that would be the same on my local
> machine and the container. Same error although I moved the path to
> /tmp/hive/data/hive/.... but when I rerun the test code to save a table,
> the complaint is still for
>
> Warehouse Dir: file:/tmp/hive/data/hive/warehouse
> Metastore URIs: thrift://192.168.1.245:9083
> Warehouse Dir: file:/tmp/hive/data/hive/warehouse
> Metastore URIs: thrift://192.168.1.245:9083
> ERROR FileOutputCommitter: Mkdirs failed to create
> file:/data/hive/warehouse/input.db/accounts_20240307_232110_1_0_6_post21_g4fdc321_d20240307/_temporary/0
>
> so what is /data/hive even referring to when I print out the spark conf
> values and neither now refer to /data/hive/
>
> On Thu, Mar 7, 2024 at 9:49 PM Tom Barber <t...@spicule.co.uk> wrote:
>
>> Wonder if anyone can just sort my brain out here as to whats possible or
>> not.
>>
>> I have a container running Spark, with Hive and a ThriftServer. I want to
>> run code against it remotely.
>>
>> If I take something simple like this
>>
>> from pyspark.sql import SparkSession
>> from pyspark.sql.types import StructType, StructField, IntegerType,
>> StringType
>>
>> # Initialize SparkSession
>> spark = SparkSession.builder \
>>     .appName("ShowDatabases") \
>>     .master("spark://192.168.1.245:7077") \
>>     .config("spark.sql.warehouse.dir", "file:/data/hive/warehouse") \
>>     .config("hive.metastore.uris","thrift://192.168.1.245:9083")\
>>     .enableHiveSupport() \
>>     .getOrCreate()
>>
>> # Define schema of the DataFrame
>> schema = StructType([
>>     StructField("id", IntegerType(), True),
>>     StructField("name", StringType(), True)
>> ])
>>
>> # Data to be converted into a DataFrame
>> data = [(1, "John Doe"), (2, "Jane Doe"), (3, "Mike Johnson")]
>>
>> # Create DataFrame
>> df = spark.createDataFrame(data, schema)
>>
>> # Show the DataFrame (optional, for verification)
>> df.show()
>>
>> # Save the DataFrame to a table named "my_table"
>> df.write.mode("overwrite").saveAsTable("my_table")
>>
>> # Stop the SparkSession
>> spark.stop()
>>
>> When I run it in the container it runs fine, but when I run it remotely
>> it says:
>>
>> : java.io.FileNotFoundException: File
>> file:/data/hive/warehouse/my_table/_temporary/0 does not exist
>> at
>> org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:597)
>> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972)
>> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014)
>> at
>> org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:761)
>> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972)
>> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014)
>> at
>> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.getAllCommittedTaskPaths(FileOutputCommitter.java:334)
>> at
>> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJobInternal(FileOutputCommitter.java:404)
>> at
>> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:377)
>> at
>> org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:48)
>> at
>> org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.commitJob(HadoopMapReduceCommitProtocol.scala:192)
>>
>> My assumption is that its trying to look on my local machine for
>> /data/hive/warehouse and failing because on the remote box I can see those
>> folders.
>>
>> So the question is, if you're not backing it with hadoop or something do
>> you have to mount the drive in the same place on the computer running the
>> pyspark? Or am I missing a config option somewhere?
>>
>> Thanks!
>>
>

Re: Creating remote tables using PySpark

Reply via email to