One option is to use SparkSQL with HiveContext to insert into a table. That's worked well for me, but you still need to periodically run a refresh on the table in Impala so it sees the new data. ________________________________ From: rafeeq s<mailto:rafeeq.ec...@gmail.com> Sent: 8/24/2014 4:20 AM To: u...@spark.incubator.apache.org<mailto:u...@spark.incubator.apache.org> Subject: How to make Spark Streaming write its output so that Impala can read it?
I have the following problem with Spark Streaming API. I am currently streaming input data via KAFKA to Spark Streaming, with which I plan to do some preprocessing for the data. Then, I'd like to save the data to Parquet file system and query it with Impala. However, Spark is writing the data files to separate directories and a new directory is generated for every RDD. This is a problem because, first of all, the external tables in Impala cannot detect subdirectories, but only files, inside the directory they are pointing to, unless partitioned. Secondly, the new directories are added so fast by Spark that it would be very bad for performance to create a new partition periodically in Impala for every generated directory. On the other hand, if I choose to increase the roll interval of the writes in Spark, so that the directories will be generated less frequently, there will be an added delay until Impala can read the incoming data. This is not acceptable since my system has to support real-time applications. In Hive, I could configure the external tables to also detect the subdirectories without need for partitioning, by using these settings: set hive.mapred.supports.subdirectories=true; set mapred.input.dir.recursive=true; But to my understandig Impala does not have a feature like this. * Is there any method to make the external tables in Impala detect sub-directories? * If not, is there any method to make Spark write its output files into a single directory or otherwise in a form that is instantly readable by Impala? Regards, Rafeeq S (“What you do is what matters, not what you think or say or plan.” )