Re: Creating remote tables using PySpark
Okay that was some caching issue. Now there is a shared mount point between the place the pyspark code is executed and the spark nodes it runs. Hrmph, I was hoping that wouldn't be the case. Fair enough! On Thu, Mar 7, 2024 at 11:23 PM Tom Barber wrote: > Okay interesting, maybe my assumption was incorrect, although I'm still > confused. > > I tried to mount a central mount point that would be the same on my local > machine and the container. Same error although I moved the path to > /tmp/hive/data/hive/ but when I rerun the test code to save a table, > the complaint is still for > > Warehouse Dir: file:/tmp/hive/data/hive/warehouse > Metastore URIs: thrift://192.168.1.245:9083 > Warehouse Dir: file:/tmp/hive/data/hive/warehouse > Metastore URIs: thrift://192.168.1.245:9083 > ERROR FileOutputCommitter: Mkdirs failed to create > file:/data/hive/warehouse/input.db/accounts_20240307_232110_1_0_6_post21_g4fdc321_d20240307/_temporary/0 > > so what is /data/hive even referring to when I print out the spark conf > values and neither now refer to /data/hive/ > > On Thu, Mar 7, 2024 at 9:49 PM Tom Barber wrote: > >> Wonder if anyone can just sort my brain out here as to whats possible or >> not. >> >> I have a container running Spark, with Hive and a ThriftServer. I want to >> run code against it remotely. >> >> If I take something simple like this >> >> from pyspark.sql import SparkSession >> from pyspark.sql.types import StructType, StructField, IntegerType, >> StringType >> >> # Initialize SparkSession >> spark = SparkSession.builder \ >> .appName("ShowDatabases") \ >> .master("spark://192.168.1.245:7077") \ >> .config("spark.sql.warehouse.dir", "file:/data/hive/warehouse") \ >> .config("hive.metastore.uris","thrift://192.168.1.245:9083")\ >> .enableHiveSupport() \ >> .getOrCreate() >> >> # Define schema of the DataFrame >> schema = StructType([ >> StructField("id", IntegerType(), True), >> StructField("name", StringType(), True) >> ]) >> >> # Data to be converted into a DataFrame >> data = [(1, "John Doe"), (2, "Jane Doe"), (3, "Mike Johnson")] >> >> # Create DataFrame >> df = spark.createDataFrame(data, schema) >> >> # Show the DataFrame (optional, for verification) >> df.show() >> >> # Save the DataFrame to a table named "my_table" >> df.write.mode("overwrite").saveAsTable("my_table") >> >> # Stop the SparkSession >> spark.stop() >> >> When I run it in the container it runs fine, but when I run it remotely >> it says: >> >> : java.io.FileNotFoundException: File >> file:/data/hive/warehouse/my_table/_temporary/0 does not exist >> at >> org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:597) >> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972) >> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014) >> at >> org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:761) >> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972) >> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014) >> at >> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.getAllCommittedTaskPaths(FileOutputCommitter.java:334) >> at >> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJobInternal(FileOutputCommitter.java:404) >> at >> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:377) >> at >> org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:48) >> at >> org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.commitJob(HadoopMapReduceCommitProtocol.scala:192) >> >> My assumption is that its trying to look on my local machine for >> /data/hive/warehouse and failing because on the remote box I can see those >> folders. >> >> So the question is, if you're not backing it with hadoop or something do >> you have to mount the drive in the same place on the computer running the >> pyspark? Or am I missing a config option somewhere? >> >> Thanks! >> >
Re: Creating remote tables using PySpark
Okay interesting, maybe my assumption was incorrect, although I'm still confused. I tried to mount a central mount point that would be the same on my local machine and the container. Same error although I moved the path to /tmp/hive/data/hive/ but when I rerun the test code to save a table, the complaint is still for Warehouse Dir: file:/tmp/hive/data/hive/warehouse Metastore URIs: thrift://192.168.1.245:9083 Warehouse Dir: file:/tmp/hive/data/hive/warehouse Metastore URIs: thrift://192.168.1.245:9083 ERROR FileOutputCommitter: Mkdirs failed to create file:/data/hive/warehouse/input.db/accounts_20240307_232110_1_0_6_post21_g4fdc321_d20240307/_temporary/0 so what is /data/hive even referring to when I print out the spark conf values and neither now refer to /data/hive/ On Thu, Mar 7, 2024 at 9:49 PM Tom Barber wrote: > Wonder if anyone can just sort my brain out here as to whats possible or > not. > > I have a container running Spark, with Hive and a ThriftServer. I want to > run code against it remotely. > > If I take something simple like this > > from pyspark.sql import SparkSession > from pyspark.sql.types import StructType, StructField, IntegerType, > StringType > > # Initialize SparkSession > spark = SparkSession.builder \ > .appName("ShowDatabases") \ > .master("spark://192.168.1.245:7077") \ > .config("spark.sql.warehouse.dir", "file:/data/hive/warehouse") \ > .config("hive.metastore.uris","thrift://192.168.1.245:9083")\ > .enableHiveSupport() \ > .getOrCreate() > > # Define schema of the DataFrame > schema = StructType([ > StructField("id", IntegerType(), True), > StructField("name", StringType(), True) > ]) > > # Data to be converted into a DataFrame > data = [(1, "John Doe"), (2, "Jane Doe"), (3, "Mike Johnson")] > > # Create DataFrame > df = spark.createDataFrame(data, schema) > > # Show the DataFrame (optional, for verification) > df.show() > > # Save the DataFrame to a table named "my_table" > df.write.mode("overwrite").saveAsTable("my_table") > > # Stop the SparkSession > spark.stop() > > When I run it in the container it runs fine, but when I run it remotely it > says: > > : java.io.FileNotFoundException: File > file:/data/hive/warehouse/my_table/_temporary/0 does not exist > at > org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:597) > at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972) > at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014) > at > org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:761) > at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972) > at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014) > at > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.getAllCommittedTaskPaths(FileOutputCommitter.java:334) > at > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJobInternal(FileOutputCommitter.java:404) > at > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:377) > at > org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:48) > at > org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.commitJob(HadoopMapReduceCommitProtocol.scala:192) > > My assumption is that its trying to look on my local machine for > /data/hive/warehouse and failing because on the remote box I can see those > folders. > > So the question is, if you're not backing it with hadoop or something do > you have to mount the drive in the same place on the computer running the > pyspark? Or am I missing a config option somewhere? > > Thanks! >
Creating remote tables using PySpark
Wonder if anyone can just sort my brain out here as to whats possible or not. I have a container running Spark, with Hive and a ThriftServer. I want to run code against it remotely. If I take something simple like this from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StructField, IntegerType, StringType # Initialize SparkSession spark = SparkSession.builder \ .appName("ShowDatabases") \ .master("spark://192.168.1.245:7077") \ .config("spark.sql.warehouse.dir", "file:/data/hive/warehouse") \ .config("hive.metastore.uris","thrift://192.168.1.245:9083")\ .enableHiveSupport() \ .getOrCreate() # Define schema of the DataFrame schema = StructType([ StructField("id", IntegerType(), True), StructField("name", StringType(), True) ]) # Data to be converted into a DataFrame data = [(1, "John Doe"), (2, "Jane Doe"), (3, "Mike Johnson")] # Create DataFrame df = spark.createDataFrame(data, schema) # Show the DataFrame (optional, for verification) df.show() # Save the DataFrame to a table named "my_table" df.write.mode("overwrite").saveAsTable("my_table") # Stop the SparkSession spark.stop() When I run it in the container it runs fine, but when I run it remotely it says: : java.io.FileNotFoundException: File file:/data/hive/warehouse/my_table/_temporary/0 does not exist at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:597) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014) at org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:761) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014) at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.getAllCommittedTaskPaths(FileOutputCommitter.java:334) at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJobInternal(FileOutputCommitter.java:404) at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:377) at org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:48) at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.commitJob(HadoopMapReduceCommitProtocol.scala:192) My assumption is that its trying to look on my local machine for /data/hive/warehouse and failing because on the remote box I can see those folders. So the question is, if you're not backing it with hadoop or something do you have to mount the drive in the same place on the computer running the pyspark? Or am I missing a config option somewhere? Thanks!