Hi,
I am trying to play around with Spark and Spark SQL.
I have logs being stored in HDFS on a 10 minute window. Each 10 minute
window could have as many as 10 files with random names of 2GB each.
Now, I want to run some analysis on these files. These files are parquet
files.
I am trying to run Spark SQL queries on them.
I notice that the API only can take a single parquet File and not a
directory or a GLOB pattern where all the files can be loaded as a single
Schema RDD.

I tried doing a unionAll, but from the output of the job it looked like it
was merging the files and writing to disk ( not confirmed but from the time
it took I am assuming).

I tried insertInto, but that definitely wrote to disk and times were
comparable to unionAll operation.
Is there a way to run jobs on multiple files as if they were a single RDD.
I am not restricted to using Spark SQL, this is what I started to play
around with.
What has stopped us from creating an API that takes a GLOB pattern and
create a single RDD from all of the files inside.

Thanks
mohnish

Reply via email to