Re: Apache kafka + spark + Parquet

2014-07-22 Thread buntu
Now we are storing Data direct from Kafka to Parquet. We are currently using Camus and wanted to know how you went about storing to Parquet? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Apache-kafka-spark-Parquet-tp10037p10441.html Sent from the Apache

Apache kafka + spark + Parquet

2014-07-17 Thread Mahebub Sayyed
Hi All, Currently we are reading (multiple) topics from Apache kafka and storing that in HBase (multiple tables) using twitter storm (1 tuple stores in 4 different tables). but we are facing some performance issue with HBase. so we are replacing* HBase* with *Parquet* file and *storm* with

Re: Apache kafka + spark + Parquet

2014-07-17 Thread Tathagata Das
1. You can put in multiple kafka topics in the same Kafka input stream. See the example KafkaWordCount https://github.com/apache/spark/blob/68f28dabe9c7679be82e684385be216319beb610/examples/src/main/scala/org/apache/spark/examples/streaming/KafkaWordCount.scala . However they will all be read

Re: Apache kafka + spark + Parquet

2014-07-17 Thread Mahebub Sayyed
Hi, To migrate data from *HBase *to *Parquet* we used following query through * Impala*: INSERT INTO table PARQUET_HASHTAGS( key, city_name, country_name, hashtag_date, hashtag_text, hashtag_source, hashtag_month, posted_time, hashtag_time, tweet_id, user_id, user_name, hashtag_year )

Re: Apache kafka + spark + Parquet

2014-07-17 Thread Tathagata Das
val kafkaStream = KafkaUtils.createStream(... ) // see the example in my previous post val transformedStream = kafkaStream.map ... // whatever transformation you want to do transformedStream.foreachRDD((rdd: RDD[...], time: Time) = { // save the rdd to parquet file, using time as the file

Re: Apache kafka + spark + Parquet

2014-07-17 Thread Michael Armbrust
We don't have support for partitioned parquet yet. There is a JIRA here: https://issues.apache.org/jira/browse/SPARK-2406 On Thu, Jul 17, 2014 at 5:00 PM, Tathagata Das tathagata.das1...@gmail.com wrote: val kafkaStream = KafkaUtils.createStream(... ) // see the example in my previous post