Hi, I am trying to play around with Spark and Spark SQL. I have logs being stored in HDFS on a 10 minute window. Each 10 minute window could have as many as 10 files with random names of 2GB each. Now, I want to run some analysis on these files. These files are parquet files. I am trying to run Spark SQL queries on them. I notice that the API only can take a single parquet File and not a directory or a GLOB pattern where all the files can be loaded as a single Schema RDD.
I tried doing a unionAll, but from the output of the job it looked like it was merging the files and writing to disk ( not confirmed but from the time it took I am assuming). I tried insertInto, but that definitely wrote to disk and times were comparable to unionAll operation. Is there a way to run jobs on multiple files as if they were a single RDD. I am not restricted to using Spark SQL, this is what I started to play around with. What has stopped us from creating an API that takes a GLOB pattern and create a single RDD from all of the files inside. Thanks mohnish