Re: Apache kafka + spark + Parquet

Mahebub Sayyed Thu, 17 Jul 2014 04:00:07 -0700

Hi,

To migrate data from *HBase *to *Parquet* we used following query through
* Impala*:


INSERT INTO table PARQUET_HASHTAGS(
key, city_name, country_name, hashtag_date, hashtag_text,
hashtag_source, hashtag_month, posted_time, hashtag_time,
tweet_id, user_id, user_name,
hashtag_year
) *partition(year, month, day)* SELECT key, city_name, country_name,
hashtag_date, hashtag_text, hashtag_source, hashtag_month, posted_time,
hashtag_time, tweet_id, user_id, user_name,hashtag_year, cast(hashtag_year
as int),cast(hashtag_month as int), cast(hashtag_date as int) from HASHTAGS
where hashtag_year='2014' and hashtag_month='04' and hashtag_date='01'
ORDER BY key,hashtag_year,hashtag_month,hashtag_date LIMIT 10000000 offset
0;

using above query we have successfully migrated form HBase to Parquet files
with proper partitions.

Now we are storing Data direct from *Kafka *to *Parquet.*

*How is it possible to create partitions while storing data direct from
kafka to Parquet files??*
*(likewise created in above query)*


On Thu, Jul 17, 2014 at 12:35 PM, Tathagata Das <tathagata.das1...@gmail.com
> wrote:

> 1. You can put in multiple kafka topics in the same Kafka input stream.
> See the example KafkaWordCount
> <https://github.com/apache/spark/blob/68f28dabe9c7679be82e684385be216319beb610/examples/src/main/scala/org/apache/spark/examples/streaming/KafkaWordCount.scala>
>  .
> However they will all be read through a single receiver (though multiple
> threads, one per topic). To parallelize the read (for increasing
> throughput), you can create multiple Kafka input streams, and splits the
> topics appropriately between them.
>
> 2. You can easily read and write to parquet files in Spark. Any RDD
> (generated through DStreams in Spark Streaming, or otherwise), can be
> converted to a SchemaRDD and then saved in the parquet format as
> rdd.saveAsParquetFile. See the Spark SQL guide
> <http://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files>
>  for
> more details. So if you want to write a same dataset (as RDDs) to two
> different parquet files, you just have to call saveAsParquetFile twice (on
> same or transformed versions of the RDD), as shown in the guide.
>
> Hope this helps!
>
> TD
>
>
> On Thu, Jul 17, 2014 at 2:19 AM, Mahebub Sayyed <mahebub...@gmail.com>
> wrote:
>
>> Hi All,
>>
>> Currently we are reading (multiple) topics from Apache kafka and storing
>> that in HBase (multiple tables) using twitter storm (1 tuple stores in 4
>> different tables).
>>  but we are facing some performance issue with HBase.
>> so we are replacing* HBase* with *Parquet* file and *storm* with *Apache
>> Spark*.
>>
>> difficulties:
>>  1. How to read multiple topics from kafka using spark?
>> 2. One tuple belongs to multiple tables, How to write one topic to
>> multiple parquet files with proper partitioning using spark??
>>
>> Please help me
>> Thanks in advance.
>>
>> --
>> *Regards,*
>>
>> *Mahebub *
>>
>
>


-- 
*Regards,*
*Mahebub Sayyed*

Re: Apache kafka + spark + Parquet

Reply via email to