Nice, I’ll check it out. At first glance, writing Parquet files seems to be a 
bit complicated.

On 15.09.2014, at 13:54, andy petrella <andy.petre...@gmail.com> wrote:

> nope.
> It's an efficient storage for genomics data :-D
> 
> aℕdy ℙetrella
> about.me/noootsab
> 
> 
> 
> On Mon, Sep 15, 2014 at 1:52 PM, Marius Soutier <mps....@gmail.com> wrote:
> So you are living the dream of using HDFS as a database? ;)
> 
> On 15.09.2014, at 13:50, andy petrella <andy.petre...@gmail.com> wrote:
> 
>> I'm using Parquet in ADAM, and I can say that it works pretty fine!
>> Enjoy ;-)
>> 
>> aℕdy ℙetrella
>> about.me/noootsab
>> 
>> 
>> 
>> On Mon, Sep 15, 2014 at 1:41 PM, Marius Soutier <mps....@gmail.com> wrote:
>> Thank you guys, I’ll try Parquet and if that’s not quick enough I’ll go the 
>> usual route with either read-only or normal database.
>> 
>> On 13.09.2014, at 12:45, andy petrella <andy.petre...@gmail.com> wrote:
>> 
>>> however, the cache is not guaranteed to remain, if other jobs are launched 
>>> in the cluster and require more memory than what's left in the overall 
>>> caching memory, previous RDDs will be discarded.
>>> 
>>> Using an off heap cache like tachyon as a dump repo can help.
>>> 
>>> In general, I'd say that using a persistent sink (like Cassandra for 
>>> instance) is best.
>>> 
>>> my .2¢
>>> 
>>> 
>>> aℕdy ℙetrella
>>> about.me/noootsab
>>> 
>>> 
>>> 
>>> On Sat, Sep 13, 2014 at 9:20 AM, Mayur Rustagi <mayur.rust...@gmail.com> 
>>> wrote:
>>> You can cache data in memory & query it using Spark Job Server. 
>>> Most folks dump data down to a queue/db for retrieval 
>>> You can batch up data & store into parquet partitions as well. & query it 
>>> using another SparkSQL  shell, JDBC driver in SparkSQL is part 1.1 i 
>>> believe. 
>>> -- 
>>> Regards,
>>> Mayur Rustagi
>>> Ph: +1 (760) 203 3257
>>> http://www.sigmoidanalytics.com
>>> @mayur_rustagi
>>> 
>>> 
>>> On Fri, Sep 12, 2014 at 2:54 PM, Marius Soutier <mps....@gmail.com> wrote:
>>> 
>>> Hi there, 
>>> 
>>> I’m pretty new to Spark, and so far I’ve written my jobs the same way I 
>>> wrote Scalding jobs - one-off, read data from HDFS, count words, write 
>>> counts back to HDFS. 
>>> 
>>> Now I want to display these counts in a dashboard. Since Spark allows to 
>>> cache RDDs in-memory and you have to explicitly terminate your app (and 
>>> there’s even a new JDBC server in 1.1), I’m assuming it’s possible to keep 
>>> an app running indefinitely and query an in-memory RDD from the outside 
>>> (via SparkSQL for example). 
>>> 
>>> Is this how others are using Spark? Or are you just dumping job results 
>>> into message queues or databases? 
>>> 
>>> 
>>> Thanks 
>>> - Marius 
>>> 
>>> 
>>> --------------------------------------------------------------------- 
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
>>> For additional commands, e-mail: user-h...@spark.apache.org 
>>> 
>>> 
>>> 
>>> 
>> 
>> 
> 
> 

Reply via email to