Re: How to make Spark Streaming write its output so that Impala can read it?

Sean Owen Sun, 24 Aug 2014 02:20:44 -0700

As for Impala, subdirectories are typically used with partitions, and
so this is a way to read the subdirectories:
http://grokbase.com/p/cloudera/impala-user/1387dvdzev/creating-impala-external-tables-from-partitioned-dir-file-structures


The catch is that you have to create those partitions at some point
with commands; it's not automatic (as far as I know)

As for Spark Streaming, I suppose you could write a bit of extra code
in a foreachRDD statement that does something with the files after
they are written. In fact the 'save' functions are just using
foreachRDD under the hood. It's probably necessary for the files to
start out as many parts under some subdirectory, since the save
operation is distributed. But then I believe you could merge the
results into one file where you like using HDFS APIs.

On Sun, Aug 24, 2014 at 9:19 AM, rafeeq s <rafeeq.ec...@gmail.com> wrote:
> I have the following problem with Spark Streaming API. I am currently
> streaming input data via KAFKA to Spark Streaming, with which I plan to do
> some preprocessing for the data. Then, I'd like to save the data to Parquet
> file system and query it with Impala.
>
> However, Spark is writing the data files to separate directories and a new
> directory is generated for every RDD.
>
> This is a problem because, first of all, the external tables in Impala
> cannot detect subdirectories, but only files, inside the directory they are
> pointing to, unless partitioned. Secondly, the new directories are added so
> fast by Spark that it would be very bad for performance to create a new
> partition periodically in Impala for every generated directory.
>
> On the other hand, if I choose to increase the roll interval of the writes
> in Spark, so that the directories will be generated less frequently, there
> will be an added delay until Impala can read the incoming data. This is not
> acceptable since my system has to support real-time applications. In Hive, I
> could configure the external tables to also detect the subdirectories
> without need for partitioning, by using these settings:
>
> set hive.mapred.supports.subdirectories=true;
> set mapred.input.dir.recursive=true;
>
> But to my understandig Impala does not have a feature like this.
>
> Is there any method to make the external tables in Impala detect
> sub-directories?
> If not, is there any method to make Spark write its output files into a
> single directory or otherwise in a form that is instantly readable by
> Impala?
>
>
>
> Regards,
>
> Rafeeq S
> (“What you do is what matters, not what you think or say or plan.” )
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How to make Spark Streaming write its output so that Impala can read it?

Reply via email to