One option is to use SparkSQL with HiveContext to insert into a table. That's 
worked well for me, but you still need to periodically run a refresh on the 
table in Impala so it sees the new data.
________________________________
From: rafeeq s<mailto:rafeeq.ec...@gmail.com>
Sent: ‎8/‎24/‎2014 4:20 AM
To: u...@spark.incubator.apache.org<mailto:u...@spark.incubator.apache.org>
Subject: How to make Spark Streaming write its output so that Impala can read 
it?


I have the following problem with Spark Streaming API. I am currently streaming 
input data via KAFKA to Spark Streaming, with which I plan to do some 
preprocessing for the data. Then, I'd like to save the data to Parquet file 
system and query it with Impala.

However, Spark is writing the data files to separate directories and a new 
directory is generated for every RDD.

This is a problem because, first of all, the external tables in Impala cannot 
detect subdirectories, but only files, inside the directory they are pointing 
to, unless partitioned. Secondly, the new directories are added so fast by 
Spark that it would be very bad for performance to create a new partition 
periodically in Impala for every generated directory.

On the other hand, if I choose to increase the roll interval of the writes in 
Spark, so that the directories will be generated less frequently, there will be 
an added delay until Impala can read the incoming data. This is not acceptable 
since my system has to support real-time applications. In Hive, I could 
configure the external tables to also detect the subdirectories without need 
for partitioning, by using these settings:


set hive.mapred.supports.subdirectories=true;
set mapred.input.dir.recursive=true;


But to my understandig Impala does not have a feature like this.

  *   Is there any method to make the external tables in Impala detect 
sub-directories?
  *   If not, is there any method to make Spark write its output files into a 
single directory or otherwise in a form that is instantly readable by Impala?


Regards,

Rafeeq S
(“What you do is what matters, not what you think or say or plan.” )

Reply via email to