Re: Creating remote tables using PySpark

2024-03-07 Thread Tom Barber
Okay that was some caching issue. Now there is a shared mount point between
the place the pyspark code is executed and the spark nodes it runs. Hrmph,
I was hoping that wouldn't be the case. Fair enough!

On Thu, Mar 7, 2024 at 11:23 PM Tom Barber  wrote:

> Okay interesting, maybe my assumption was incorrect, although I'm still
> confused.
>
> I tried to mount a central mount point that would be the same on my local
> machine and the container. Same error although I moved the path to
> /tmp/hive/data/hive/ but when I rerun the test code to save a table,
> the complaint is still for
>
> Warehouse Dir: file:/tmp/hive/data/hive/warehouse
> Metastore URIs: thrift://192.168.1.245:9083
> Warehouse Dir: file:/tmp/hive/data/hive/warehouse
> Metastore URIs: thrift://192.168.1.245:9083
> ERROR FileOutputCommitter: Mkdirs failed to create
> file:/data/hive/warehouse/input.db/accounts_20240307_232110_1_0_6_post21_g4fdc321_d20240307/_temporary/0
>
> so what is /data/hive even referring to when I print out the spark conf
> values and neither now refer to /data/hive/
>
> On Thu, Mar 7, 2024 at 9:49 PM Tom Barber  wrote:
>
>> Wonder if anyone can just sort my brain out here as to whats possible or
>> not.
>>
>> I have a container running Spark, with Hive and a ThriftServer. I want to
>> run code against it remotely.
>>
>> If I take something simple like this
>>
>> from pyspark.sql import SparkSession
>> from pyspark.sql.types import StructType, StructField, IntegerType,
>> StringType
>>
>> # Initialize SparkSession
>> spark = SparkSession.builder \
>> .appName("ShowDatabases") \
>> .master("spark://192.168.1.245:7077") \
>> .config("spark.sql.warehouse.dir", "file:/data/hive/warehouse") \
>> .config("hive.metastore.uris","thrift://192.168.1.245:9083")\
>> .enableHiveSupport() \
>> .getOrCreate()
>>
>> # Define schema of the DataFrame
>> schema = StructType([
>> StructField("id", IntegerType(), True),
>> StructField("name", StringType(), True)
>> ])
>>
>> # Data to be converted into a DataFrame
>> data = [(1, "John Doe"), (2, "Jane Doe"), (3, "Mike Johnson")]
>>
>> # Create DataFrame
>> df = spark.createDataFrame(data, schema)
>>
>> # Show the DataFrame (optional, for verification)
>> df.show()
>>
>> # Save the DataFrame to a table named "my_table"
>> df.write.mode("overwrite").saveAsTable("my_table")
>>
>> # Stop the SparkSession
>> spark.stop()
>>
>> When I run it in the container it runs fine, but when I run it remotely
>> it says:
>>
>> : java.io.FileNotFoundException: File
>> file:/data/hive/warehouse/my_table/_temporary/0 does not exist
>> at
>> org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:597)
>> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972)
>> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014)
>> at
>> org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:761)
>> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972)
>> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014)
>> at
>> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.getAllCommittedTaskPaths(FileOutputCommitter.java:334)
>> at
>> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJobInternal(FileOutputCommitter.java:404)
>> at
>> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:377)
>> at
>> org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:48)
>> at
>> org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.commitJob(HadoopMapReduceCommitProtocol.scala:192)
>>
>> My assumption is that its trying to look on my local machine for
>> /data/hive/warehouse and failing because on the remote box I can see those
>> folders.
>>
>> So the question is, if you're not backing it with hadoop or something do
>> you have to mount the drive in the same place on the computer running the
>> pyspark? Or am I missing a config option somewhere?
>>
>> Thanks!
>>
>


Re: Creating remote tables using PySpark

2024-03-07 Thread Tom Barber
Okay interesting, maybe my assumption was incorrect, although I'm still
confused.

I tried to mount a central mount point that would be the same on my local
machine and the container. Same error although I moved the path to
/tmp/hive/data/hive/ but when I rerun the test code to save a table,
the complaint is still for

Warehouse Dir: file:/tmp/hive/data/hive/warehouse
Metastore URIs: thrift://192.168.1.245:9083
Warehouse Dir: file:/tmp/hive/data/hive/warehouse
Metastore URIs: thrift://192.168.1.245:9083
ERROR FileOutputCommitter: Mkdirs failed to create
file:/data/hive/warehouse/input.db/accounts_20240307_232110_1_0_6_post21_g4fdc321_d20240307/_temporary/0

so what is /data/hive even referring to when I print out the spark conf
values and neither now refer to /data/hive/

On Thu, Mar 7, 2024 at 9:49 PM Tom Barber  wrote:

> Wonder if anyone can just sort my brain out here as to whats possible or
> not.
>
> I have a container running Spark, with Hive and a ThriftServer. I want to
> run code against it remotely.
>
> If I take something simple like this
>
> from pyspark.sql import SparkSession
> from pyspark.sql.types import StructType, StructField, IntegerType,
> StringType
>
> # Initialize SparkSession
> spark = SparkSession.builder \
> .appName("ShowDatabases") \
> .master("spark://192.168.1.245:7077") \
> .config("spark.sql.warehouse.dir", "file:/data/hive/warehouse") \
> .config("hive.metastore.uris","thrift://192.168.1.245:9083")\
> .enableHiveSupport() \
> .getOrCreate()
>
> # Define schema of the DataFrame
> schema = StructType([
> StructField("id", IntegerType(), True),
> StructField("name", StringType(), True)
> ])
>
> # Data to be converted into a DataFrame
> data = [(1, "John Doe"), (2, "Jane Doe"), (3, "Mike Johnson")]
>
> # Create DataFrame
> df = spark.createDataFrame(data, schema)
>
> # Show the DataFrame (optional, for verification)
> df.show()
>
> # Save the DataFrame to a table named "my_table"
> df.write.mode("overwrite").saveAsTable("my_table")
>
> # Stop the SparkSession
> spark.stop()
>
> When I run it in the container it runs fine, but when I run it remotely it
> says:
>
> : java.io.FileNotFoundException: File
> file:/data/hive/warehouse/my_table/_temporary/0 does not exist
> at
> org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:597)
> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972)
> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014)
> at
> org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:761)
> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972)
> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014)
> at
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.getAllCommittedTaskPaths(FileOutputCommitter.java:334)
> at
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJobInternal(FileOutputCommitter.java:404)
> at
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:377)
> at
> org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:48)
> at
> org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.commitJob(HadoopMapReduceCommitProtocol.scala:192)
>
> My assumption is that its trying to look on my local machine for
> /data/hive/warehouse and failing because on the remote box I can see those
> folders.
>
> So the question is, if you're not backing it with hadoop or something do
> you have to mount the drive in the same place on the computer running the
> pyspark? Or am I missing a config option somewhere?
>
> Thanks!
>


Creating remote tables using PySpark

2024-03-07 Thread Tom Barber
Wonder if anyone can just sort my brain out here as to whats possible or
not.

I have a container running Spark, with Hive and a ThriftServer. I want to
run code against it remotely.

If I take something simple like this

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType,
StringType

# Initialize SparkSession
spark = SparkSession.builder \
.appName("ShowDatabases") \
.master("spark://192.168.1.245:7077") \
.config("spark.sql.warehouse.dir", "file:/data/hive/warehouse") \
.config("hive.metastore.uris","thrift://192.168.1.245:9083")\
.enableHiveSupport() \
.getOrCreate()

# Define schema of the DataFrame
schema = StructType([
StructField("id", IntegerType(), True),
StructField("name", StringType(), True)
])

# Data to be converted into a DataFrame
data = [(1, "John Doe"), (2, "Jane Doe"), (3, "Mike Johnson")]

# Create DataFrame
df = spark.createDataFrame(data, schema)

# Show the DataFrame (optional, for verification)
df.show()

# Save the DataFrame to a table named "my_table"
df.write.mode("overwrite").saveAsTable("my_table")

# Stop the SparkSession
spark.stop()

When I run it in the container it runs fine, but when I run it remotely it
says:

: java.io.FileNotFoundException: File
file:/data/hive/warehouse/my_table/_temporary/0 does not exist
at
org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:597)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014)
at
org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:761)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014)
at
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.getAllCommittedTaskPaths(FileOutputCommitter.java:334)
at
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJobInternal(FileOutputCommitter.java:404)
at
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:377)
at
org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:48)
at
org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.commitJob(HadoopMapReduceCommitProtocol.scala:192)

My assumption is that its trying to look on my local machine for
/data/hive/warehouse and failing because on the remote box I can see those
folders.

So the question is, if you're not backing it with hadoop or something do
you have to mount the drive in the same place on the computer running the
pyspark? Or am I missing a config option somewhere?

Thanks!