TD, While looking at the API Ref(version 1.1.0) for SchemaRDD i did find these two methods: def insertInto(tableName: String): Unit def insertInto(tableName: String, overwrite: Boolean): Unit
Wouldnt these be a nicer way of appending RDD's to a table or are these not recommended as of now? Also will this apply to a table created using the "registerTempTable" method ? On Thu, Dec 11, 2014 at 6:46 AM, Tathagata Das <tathagata.das1...@gmail.com> wrote: > > First of all, how long do you want to keep doing this? The data is > going to increase infinitely and without any bounds, its going to get > too big for any cluster to handle. If all that is within bounds, then > try the following. > > - Maintain a global variable having the current RDD storing all the > log data. We are going to keep updating this variable. > - Every batch interval, take new data and union it with the earlier > unified RDD (in the global variable) and update the global variable. > If you want sequel queries on this data, then you will have > re-register this new RDD as the named table. > - With this approach the number of partitions is going to increase > rapidly. So periodically take the unified RDD and repartition it to a > smaller set of partitions. This messes up the ordering of data, but > you maybe fine with if your queries are order agnostic. Also, > periodically, checkpoint this RDD, otherwise the lineage is going to > grow indefinitely and everything will start getting slower. > > Hope this helps. > > TD > > On Mon, Dec 8, 2014 at 6:29 PM, Xuelin Cao <xuelin...@yahoo.com.invalid> > wrote: > > > > Hi, > > > > I'm wondering whether there is an efficient way to continuously > > append new data to a registered spark SQL table. > > > > This is what I want: > > I want to make an ad-hoc query service to a json formated system > log. > > Certainly, the system log is continuously generated. I will use spark > > streaming to connect the system log as my input, and I want to find a > way to > > effectively append the new data into an existed spark SQL table. Further > > more, I want the whole table being cached in memory/tachyon. > > > > It looks like spark sql supports the "INSERT" method, but only for > > parquet file. In addition, it is inefficient to insert a single row every > > time. > > > > I do know that somebody build a similar system that I want (ad-hoc > > query service to a on growing system log). So, there must be an efficient > > way. Anyone knows? > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Regards Rakesh Nair