Hi, To migrate data from *HBase *to *Parquet* we used following query through * Impala*:
INSERT INTO table PARQUET_HASHTAGS( key, city_name, country_name, hashtag_date, hashtag_text, hashtag_source, hashtag_month, posted_time, hashtag_time, tweet_id, user_id, user_name, hashtag_year ) *partition(year, month, day)* SELECT key, city_name, country_name, hashtag_date, hashtag_text, hashtag_source, hashtag_month, posted_time, hashtag_time, tweet_id, user_id, user_name,hashtag_year, cast(hashtag_year as int),cast(hashtag_month as int), cast(hashtag_date as int) from HASHTAGS where hashtag_year='2014' and hashtag_month='04' and hashtag_date='01' ORDER BY key,hashtag_year,hashtag_month,hashtag_date LIMIT 10000000 offset 0; using above query we have successfully migrated form HBase to Parquet files with proper partitions. Now we are storing Data direct from *Kafka *to *Parquet.* *How is it possible to create partitions while storing data direct from kafka to Parquet files??* *(likewise created in above query)* On Thu, Jul 17, 2014 at 12:35 PM, Tathagata Das <tathagata.das1...@gmail.com > wrote: > 1. You can put in multiple kafka topics in the same Kafka input stream. > See the example KafkaWordCount > <https://github.com/apache/spark/blob/68f28dabe9c7679be82e684385be216319beb610/examples/src/main/scala/org/apache/spark/examples/streaming/KafkaWordCount.scala> > . > However they will all be read through a single receiver (though multiple > threads, one per topic). To parallelize the read (for increasing > throughput), you can create multiple Kafka input streams, and splits the > topics appropriately between them. > > 2. You can easily read and write to parquet files in Spark. Any RDD > (generated through DStreams in Spark Streaming, or otherwise), can be > converted to a SchemaRDD and then saved in the parquet format as > rdd.saveAsParquetFile. See the Spark SQL guide > <http://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files> > for > more details. So if you want to write a same dataset (as RDDs) to two > different parquet files, you just have to call saveAsParquetFile twice (on > same or transformed versions of the RDD), as shown in the guide. > > Hope this helps! > > TD > > > On Thu, Jul 17, 2014 at 2:19 AM, Mahebub Sayyed <mahebub...@gmail.com> > wrote: > >> Hi All, >> >> Currently we are reading (multiple) topics from Apache kafka and storing >> that in HBase (multiple tables) using twitter storm (1 tuple stores in 4 >> different tables). >> but we are facing some performance issue with HBase. >> so we are replacing* HBase* with *Parquet* file and *storm* with *Apache >> Spark*. >> >> difficulties: >> 1. How to read multiple topics from kafka using spark? >> 2. One tuple belongs to multiple tables, How to write one topic to >> multiple parquet files with proper partitioning using spark?? >> >> Please help me >> Thanks in advance. >> >> -- >> *Regards,* >> >> *Mahebub * >> > > -- *Regards,* *Mahebub Sayyed*