Re: Serving data

2014-09-16 Thread Marius Soutier
Writing to Parquet and querying the result via SparkSQL works great (except for 
some strange SQL parser errors). However the problem remains, how do I get that 
data back to a dashboard. So I guess I’ll have to use a database after all.


You can batch up data  store into parquet partitions as well.  query it using 
another SparkSQL  shell, JDBC driver in SparkSQL is part 1.1 i believe. 


Re: Serving data

2014-09-16 Thread Yana Kadiyska
If your dashboard is doing ajax/pull requests against say a REST API you
can always create a Spark context in your rest service and use SparkSQL to
query over the parquet files. The parquet files are already on disk so it
seems silly to write both to parquet and to a DB...unless I'm missing
something in your setup.

On Tue, Sep 16, 2014 at 4:18 AM, Marius Soutier mps@gmail.com wrote:

 Writing to Parquet and querying the result via SparkSQL works great
 (except for some strange SQL parser errors). However the problem remains,
 how do I get that data back to a dashboard. So I guess I’ll have to use a
 database after all.


 You can batch up data  store into parquet partitions as well.  query
 it using another SparkSQL  shell, JDBC driver in SparkSQL is part 1.1 i
 believe.




Re: Serving data

2014-09-15 Thread Marius Soutier
Thank you guys, I’ll try Parquet and if that’s not quick enough I’ll go the 
usual route with either read-only or normal database.

On 13.09.2014, at 12:45, andy petrella andy.petre...@gmail.com wrote:

 however, the cache is not guaranteed to remain, if other jobs are launched in 
 the cluster and require more memory than what's left in the overall caching 
 memory, previous RDDs will be discarded.
 
 Using an off heap cache like tachyon as a dump repo can help.
 
 In general, I'd say that using a persistent sink (like Cassandra for 
 instance) is best.
 
 my .2¢
 
 
 aℕdy ℙetrella
 about.me/noootsab
 
 
 
 On Sat, Sep 13, 2014 at 9:20 AM, Mayur Rustagi mayur.rust...@gmail.com 
 wrote:
 You can cache data in memory  query it using Spark Job Server. 
 Most folks dump data down to a queue/db for retrieval 
 You can batch up data  store into parquet partitions as well.  query it 
 using another SparkSQL  shell, JDBC driver in SparkSQL is part 1.1 i believe. 
 -- 
 Regards,
 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi
 
 
 On Fri, Sep 12, 2014 at 2:54 PM, Marius Soutier mps@gmail.com wrote:
 
 Hi there, 
 
 I’m pretty new to Spark, and so far I’ve written my jobs the same way I wrote 
 Scalding jobs - one-off, read data from HDFS, count words, write counts back 
 to HDFS. 
 
 Now I want to display these counts in a dashboard. Since Spark allows to 
 cache RDDs in-memory and you have to explicitly terminate your app (and 
 there’s even a new JDBC server in 1.1), I’m assuming it’s possible to keep an 
 app running indefinitely and query an in-memory RDD from the outside (via 
 SparkSQL for example). 
 
 Is this how others are using Spark? Or are you just dumping job results into 
 message queues or databases? 
 
 
 Thanks 
 - Marius 
 
 
 - 
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
 For additional commands, e-mail: user-h...@spark.apache.org 
 
 
 
 



Re: Serving data

2014-09-15 Thread andy petrella
I'm using Parquet in ADAM, and I can say that it works pretty fine!
Enjoy ;-)

aℕdy ℙetrella
about.me/noootsab
[image: aℕdy ℙetrella on about.me]

http://about.me/noootsab

On Mon, Sep 15, 2014 at 1:41 PM, Marius Soutier mps@gmail.com wrote:

 Thank you guys, I’ll try Parquet and if that’s not quick enough I’ll go
 the usual route with either read-only or normal database.

 On 13.09.2014, at 12:45, andy petrella andy.petre...@gmail.com wrote:

 however, the cache is not guaranteed to remain, if other jobs are launched
 in the cluster and require more memory than what's left in the overall
 caching memory, previous RDDs will be discarded.

 Using an off heap cache like tachyon as a dump repo can help.

 In general, I'd say that using a persistent sink (like Cassandra for
 instance) is best.

 my .2¢


 aℕdy ℙetrella
 about.me/noootsab
 [image: aℕdy ℙetrella on about.me]

 http://about.me/noootsab

 On Sat, Sep 13, 2014 at 9:20 AM, Mayur Rustagi mayur.rust...@gmail.com
 wrote:

 You can cache data in memory  query it using Spark Job Server.
 Most folks dump data down to a queue/db for retrieval
 You can batch up data  store into parquet partitions as well.  query it
 using another SparkSQL  shell, JDBC driver in SparkSQL is part 1.1 i
 believe.
 --
 Regards,
 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi


 On Fri, Sep 12, 2014 at 2:54 PM, Marius Soutier mps@gmail.com
 wrote:

 Hi there,

 I’m pretty new to Spark, and so far I’ve written my jobs the same way I
 wrote Scalding jobs - one-off, read data from HDFS, count words, write
 counts back to HDFS.

 Now I want to display these counts in a dashboard. Since Spark allows to
 cache RDDs in-memory and you have to explicitly terminate your app (and
 there’s even a new JDBC server in 1.1), I’m assuming it’s possible to keep
 an app running indefinitely and query an in-memory RDD from the outside
 (via SparkSQL for example).

 Is this how others are using Spark? Or are you just dumping job results
 into message queues or databases?


 Thanks
 - Marius


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org







Re: Serving data

2014-09-15 Thread Marius Soutier
So you are living the dream of using HDFS as a database? ;)

On 15.09.2014, at 13:50, andy petrella andy.petre...@gmail.com wrote:

 I'm using Parquet in ADAM, and I can say that it works pretty fine!
 Enjoy ;-)
 
 aℕdy ℙetrella
 about.me/noootsab
 
 
 
 On Mon, Sep 15, 2014 at 1:41 PM, Marius Soutier mps@gmail.com wrote:
 Thank you guys, I’ll try Parquet and if that’s not quick enough I’ll go the 
 usual route with either read-only or normal database.
 
 On 13.09.2014, at 12:45, andy petrella andy.petre...@gmail.com wrote:
 
 however, the cache is not guaranteed to remain, if other jobs are launched 
 in the cluster and require more memory than what's left in the overall 
 caching memory, previous RDDs will be discarded.
 
 Using an off heap cache like tachyon as a dump repo can help.
 
 In general, I'd say that using a persistent sink (like Cassandra for 
 instance) is best.
 
 my .2¢
 
 
 aℕdy ℙetrella
 about.me/noootsab
 
 
 
 On Sat, Sep 13, 2014 at 9:20 AM, Mayur Rustagi mayur.rust...@gmail.com 
 wrote:
 You can cache data in memory  query it using Spark Job Server. 
 Most folks dump data down to a queue/db for retrieval 
 You can batch up data  store into parquet partitions as well.  query it 
 using another SparkSQL  shell, JDBC driver in SparkSQL is part 1.1 i 
 believe. 
 -- 
 Regards,
 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi
 
 
 On Fri, Sep 12, 2014 at 2:54 PM, Marius Soutier mps@gmail.com wrote:
 
 Hi there, 
 
 I’m pretty new to Spark, and so far I’ve written my jobs the same way I 
 wrote Scalding jobs - one-off, read data from HDFS, count words, write 
 counts back to HDFS. 
 
 Now I want to display these counts in a dashboard. Since Spark allows to 
 cache RDDs in-memory and you have to explicitly terminate your app (and 
 there’s even a new JDBC server in 1.1), I’m assuming it’s possible to keep 
 an app running indefinitely and query an in-memory RDD from the outside (via 
 SparkSQL for example). 
 
 Is this how others are using Spark? Or are you just dumping job results into 
 message queues or databases? 
 
 
 Thanks 
 - Marius 
 
 
 - 
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
 For additional commands, e-mail: user-h...@spark.apache.org 
 
 
 
 
 
 



Re: Serving data

2014-09-13 Thread Mayur Rustagi
You can cache data in memory  query it using Spark Job Server. 

Most folks dump data down to a queue/db for retrieval 

You can batch up data  store into parquet partitions as well.  query it using 
another SparkSQL  shell, JDBC driver in SparkSQL is part 1.1 i believe. 
-- 
Regards,
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi

On Fri, Sep 12, 2014 at 2:54 PM, Marius Soutier mps@gmail.com wrote:

 Hi there,
 I’m pretty new to Spark, and so far I’ve written my jobs the same way I wrote 
 Scalding jobs - one-off, read data from HDFS, count words, write counts back 
 to HDFS.
 Now I want to display these counts in a dashboard. Since Spark allows to 
 cache RDDs in-memory and you have to explicitly terminate your app (and 
 there’s even a new JDBC server in 1.1), I’m assuming it’s possible to keep an 
 app running indefinitely and query an in-memory RDD from the outside (via 
 SparkSQL for example).
 Is this how others are using Spark? Or are you just dumping job results into 
 message queues or databases?
 Thanks
 - Marius
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Serving data

2014-09-13 Thread andy petrella
however, the cache is not guaranteed to remain, if other jobs are launched
in the cluster and require more memory than what's left in the overall
caching memory, previous RDDs will be discarded.

Using an off heap cache like tachyon as a dump repo can help.

In general, I'd say that using a persistent sink (like Cassandra for
instance) is best.

my .2¢


aℕdy ℙetrella
about.me/noootsab
[image: aℕdy ℙetrella on about.me]

http://about.me/noootsab

On Sat, Sep 13, 2014 at 9:20 AM, Mayur Rustagi mayur.rust...@gmail.com
wrote:

 You can cache data in memory  query it using Spark Job Server.
 Most folks dump data down to a queue/db for retrieval
 You can batch up data  store into parquet partitions as well.  query it
 using another SparkSQL  shell, JDBC driver in SparkSQL is part 1.1 i
 believe.
 --
 Regards,
 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi


 On Fri, Sep 12, 2014 at 2:54 PM, Marius Soutier mps@gmail.com wrote:

 Hi there,

 I’m pretty new to Spark, and so far I’ve written my jobs the same way I
 wrote Scalding jobs - one-off, read data from HDFS, count words, write
 counts back to HDFS.

 Now I want to display these counts in a dashboard. Since Spark allows to
 cache RDDs in-memory and you have to explicitly terminate your app (and
 there’s even a new JDBC server in 1.1), I’m assuming it’s possible to keep
 an app running indefinitely and query an in-memory RDD from the outside
 (via SparkSQL for example).

 Is this how others are using Spark? Or are you just dumping job results
 into message queues or databases?


 Thanks
 - Marius


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Serving data

2014-09-12 Thread Marius Soutier
Hi there,

I’m pretty new to Spark, and so far I’ve written my jobs the same way I wrote 
Scalding jobs - one-off, read data from HDFS, count words, write counts back to 
HDFS.

Now I want to display these counts in a dashboard. Since Spark allows to cache 
RDDs in-memory and you have to explicitly terminate your app (and there’s even 
a new JDBC server in 1.1), I’m assuming it’s possible to keep an app running 
indefinitely and query an in-memory RDD from the outside (via SparkSQL for 
example).

Is this how others are using Spark? Or are you just dumping job results into 
message queues or databases?


Thanks
- Marius


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org