Re: OLAP query using spark dataframe with cassandra

2015-11-10 Thread David Morales
 fightf...@163.com
>> *CC:* user <user@spark.apache.org>; dev <d...@spark.apache.org>
>> *Subject:* Re: OLAP query using spark dataframe with cassandra
>>
>> Is there any distributor supporting these software components in
>> combination? If no and your core business is not software then you may want
>> to look for something else, because it might not make sense to build up
>> internal know-how in all of these areas.
>>
>> In any case - it depends all highly on your data and queries. You will
>> have to do your own experiments.
>>
>> On 09 Nov 2015, at 07:02, "fightf...@163.com" <fightf...@163.com> wrote:
>>
>> Hi, community
>>
>> We are specially interested about this featural integration according to
>> some slides from [1]. The SMACK(Spark+Mesos+Akka+Cassandra+Kafka)
>>
>> seems good implementation for lambda architecure in the open-source
>> world, especially non-hadoop based cluster environment. As we can see,
>>
>> the advantages obviously consist of :
>>
>> 1 the feasibility and scalability of spark datafram api, which can also
>> make a perfect complement for Apache Cassandra native cql feature.
>>
>> 2 both streaming and batch process availability using the ALL-STACK
>> thing, cool.
>>
>> 3 we can both achieve compacity and usability for spark with cassandra,
>> including seemlessly integrating with job scheduling and resource
>> management.
>>
>> Only one concern goes to the OLAP query performance issue, which mainly
>> caused by frequent aggregation work between daily increased large tables,
>> for
>>
>> both spark sql and cassandra. I can see that the [1] use case facilitates
>> FiloDB to achieve columnar storage and query performance, but we had
>> nothing more
>>
>> knowledge.
>>
>> Question is : Any guy had such use case for now, especially using in your
>> production environment ? Would be interested in your architeture for
>> designing this
>>
>> OLAP engine using spark +  cassandra. What do you think the comparison
>> between the scenario with traditional OLAP cube design? Like Apache Kylin
>> or
>>
>> pentaho mondrian ?
>>
>> Best Regards,
>>
>> Sun.
>>
>>
>> [1]
>> <http://www.slideshare.net/planetcassandra/cassandra-summit-2014-interactive-olap-queries-using-apache-cassandra-and-spark>
>> http://www.slideshare.net/planetcassandra/cassandra-summit-2014-interactive-olap-queries-using-apache-cassandra-and-spark
>>
>> --
>> fightf...@163.com
>>
>>
>>
>


-- 

David Morales de Frías  ::  +34 607 010 411 :: @dmoralesdf
<https://twitter.com/dmoralesdf>


<http://www.stratio.com/>
Vía de las dos Castillas, 33, Ática 4, 3ª Planta
28224 Pozuelo de Alarcón, Madrid
Tel: +34 91 828 6473 // www.stratio.com // *@stratiobd
<https://twitter.com/StratioBD>*


Re: Spark Streaming Suggestion

2015-09-15 Thread David Morales
Hi there,

This is exactly our goal in Stratio Sparkta, a real-time aggregation engine
fully developed with spark streaming (and fully open source).

Take a look at:


   - the docs: http://docs.stratio.com/modules/sparkta/development/
   - the repository: https://github.com/Stratio/sparkta
   - and some slides explaining how sparkta was born and what it makes:
   http://www.slideshare.net/Stratio/strata-sparkta


Feel free to ask us anything about the project.








2015-09-15 8:10 GMT+02:00 srungarapu vamsi <srungarapu1...@gmail.com>:

> The batch approach i had implemented takes about 10 minutes to complete
> all the pre-computation tasks for the one hour worth of data. When i went
> through my code, i figured out that most of the time consuming tasks are
> the ones, which read data from cassandra and the places where i perform
> sparkContex.union(Array[RDD]).
> Now the ask is to get the pre computation tasks near real time. So i am
> exploring the streaming approach.
>
> My pre computation tasks not only include just finding the unique numbers
> for a given device every minute, every hour, every day but it also includes
> the following tasks:
> 1. Find the number of unique numbers across a set of devices every minute,
> every hour, every day
> 2. Find the number of unique numbers which are commonly occurring across a
> set of devices every minute, every hour, every day
> 3. Find (total time a number occurred across a set of devices)/(total
> unique numbers occurred across the set of devices)
> The above mentioned pre computation tasks are just a few of what i will be
> needing and there are many more coming towards me :)
> I see all these problems need more of data parallel approach and hence i
> am interested to do this on the spark streaming end.
>
>
> On Tue, Sep 15, 2015 at 11:04 AM, Jörn Franke <jornfra...@gmail.com>
> wrote:
>
>> Why did you not stay with the batch approach? For me the architecture
>> looks very complex for a simple thing you want to achieve. Why don't you
>> process the data already in storm ?
>>
>> Le mar. 15 sept. 2015 à 6:20, srungarapu vamsi <srungarapu1...@gmail.com>
>> a écrit :
>>
>>> I am pretty new to spark. Please suggest a better model for the
>>> following use case.
>>>
>>> I have few (about 1500) devices in field which keep emitting about 100KB
>>> of data every minute. The nature of data sent by the devices is just a list
>>> of numbers.
>>> As of now, we have Storm is in the architecture which receives this
>>> data, sanitizes it and writes to cassandra.
>>> Now, i have a requirement to process this data. The processing includes
>>> finding unique numbers emitted by one or more devices for every minute,
>>> every hour, every day, every month.
>>> I had implemented this processing part as a batch job execution and now
>>> i am interested in making it a streaming application. i.e calculating the
>>> processed data as and when devices emit the data.
>>>
>>> I have the following two approaches:
>>> 1. Storm writes the actual data to cassandra and writes a message on
>>> Kafka bus that data corresponding to device D and minute M has been written
>>> to cassandra
>>>
>>> Then Spark streaming reads this message from kafka , then reads the data
>>> of Device D at minute M from cassandra and starts processing the data.
>>>
>>> 2. Storm writes the data to both cassandra and  kafka, spark reads the
>>> actual data from kafka , processes the data and writes to cassandra.
>>> The second approach avoids additional hit of reading from cassandra
>>> every minute , a device has written data to cassandra at the cost of
>>> putting the actual heavy messages instead of light events on  kafka.
>>>
>>> I am a bit confused among the two approaches. Please suggest which one
>>> is better and if both are bad, how can i handle this use case?
>>>
>>>
>>> --
>>> /Vamsi
>>>
>>
>
>
> --
> /Vamsi
>



-- 

David Morales de Frías  ::  +34 607 010 411 :: @dmoralesdf
<https://twitter.com/dmoralesdf>


<http://www.stratio.com/>
Vía de las dos Castillas, 33, Ática 4, 3ª Planta
28224 Pozuelo de Alarcón, Madrid
Tel: +34 91 828 6473 // www.stratio.com // *@stratiobd
<https://twitter.com/StratioBD>*


Re: SPARKTA: a real-time aggregation engine based on Spark Streaming

2015-05-14 Thread David Morales
We put a lot of work in sparkta and it is awesome to hear from both the
community and relevant people. Just as easy as that.

I hope you have time to consider the project, which is our main concern at
this moment, and hear from you too.



2015-05-14 17:46 GMT+02:00 Evo Eftimov evo.efti...@isecc.com:

 I do not intend to provide comments on the actual “product” since my time
 is engaged elsewhere



 My comments were on the “process” for commenting which looked as
 self-indulgent, self patting on the back communication (between members of
 the party and its party leader) – that bs used to be inherent to the
 “commercial” vendors, but I can confirm as fact it is also in effect to the
 “open source movement” (because human nature remains the same)



 *From:* David Morales [mailto:dmora...@stratio.com]
 *Sent:* Thursday, May 14, 2015 4:30 PM
 *To:* Paolo Platter
 *Cc:* Evo Eftimov; Matei Zaharia; user@spark.apache.org

 *Subject:* Re: SPARKTA: a real-time aggregation engine based on Spark
 Streaming



 Thank you Paolo. Don't hesitate to contact us.



 Evo, we will be glad to hear from you and we are happy to see some kind of
 fast feedback from the main thought leader of spark, for sure.







 2015-05-14 17:24 GMT+02:00 Paolo Platter paolo.plat...@agilelab.it:

 Nice Job!



 we are developing something very similar… I will contact you to understand
 if we can contribute to you with some piece !



 Best



 Paolo



 *Da:* Evo Eftimov evo.efti...@isecc.com
 *Data invio:* ‎giovedì‎ ‎14‎ ‎maggio‎ ‎2015 ‎17‎:‎21
 *A:* 'David Morales' dmora...@stratio.com, Matei Zaharia
 matei.zaha...@gmail.com
 *Cc:* user@spark.apache.org



 That has been a really rapid “evaluation” of the “work” and its
 “direction”



 *From:* David Morales [mailto:dmora...@stratio.com]
 *Sent:* Thursday, May 14, 2015 4:12 PM
 *To:* Matei Zaharia
 *Cc:* user@spark.apache.org
 *Subject:* Re: SPARKTA: a real-time aggregation engine based on Spark
 Streaming



 Thanks for your kind words Matei, happy to see that our work is in the
 right way.









 2015-05-14 17:10 GMT+02:00 Matei Zaharia matei.zaha...@gmail.com:

 (Sorry, for non-English people: that means it's a good thing.)

 Matei


  On May 14, 2015, at 10:53 AM, Matei Zaharia matei.zaha...@gmail.com
 wrote:
 
  ...This is madness!
 
  On May 14, 2015, at 9:31 AM, dmoralesdf dmora...@stratio.com wrote:
 
  Hi there,
 
  We have released our real-time aggregation engine based on Spark
 Streaming.
 
  SPARKTA is fully open source (Apache2)
 
 
  You can checkout the slides showed up at the Strata past week:
 
  http://www.slideshare.net/Stratio/strata-sparkta
 
  Source code:
 
  https://github.com/Stratio/sparkta
 
  And documentation
 
  http://docs.stratio.com/modules/sparkta/development/
 
 
  We are open to your ideas and contributors are welcomed.
 
 
  Regards.
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/SPARKTA-a-real-time-aggregation-engine-based-on-Spark-Streaming-tp22883.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 
 





 --

 David Morales de Frías  ::  +34 607 010 411 :: @dmoralesdf
 https://twitter.com/dmoralesdf



 http://www.stratio.com/
 Vía de las dos Castillas, 33, Ática 4, 3ª Planta

 28224 Pozuelo de Alarcón, Madrid

 Tel: +34 91 828 6473 // www.stratio.com // *@stratiobd
 https://twitter.com/StratioBD*





 --

 David Morales de Frías  ::  +34 607 010 411 :: @dmoralesdf
 https://twitter.com/dmoralesdf



 http://www.stratio.com/
 Vía de las dos Castillas, 33, Ática 4, 3ª Planta

 28224 Pozuelo de Alarcón, Madrid

 Tel: +34 91 828 6473 // www.stratio.com // *@stratiobd
 https://twitter.com/StratioBD*




-- 

David Morales de Frías  ::  +34 607 010 411 :: @dmoralesdf
https://twitter.com/dmoralesdf


http://www.stratio.com/
Vía de las dos Castillas, 33, Ática 4, 3ª Planta
28224 Pozuelo de Alarcón, Madrid
Tel: +34 91 828 6473 // www.stratio.com // *@stratiobd
https://twitter.com/StratioBD*


Re: SPARKTA: a real-time aggregation engine based on Spark Streaming

2015-05-14 Thread David Morales
Thank you Paolo. Don't hesitate to contact us.

Evo, we will be glad to hear from you and we are happy to see some kind of
fast feedback from the main thought leader of spark, for sure.



2015-05-14 17:24 GMT+02:00 Paolo Platter paolo.plat...@agilelab.it:

  Nice Job!

  we are developing something very similar… I will contact you to
 understand if we can contribute to you with some piece !

  Best

  Paolo

   *Da:* Evo Eftimov evo.efti...@isecc.com
 *Data invio:* ‎giovedì‎ ‎14‎ ‎maggio‎ ‎2015 ‎17‎:‎21
 *A:* 'David Morales' dmora...@stratio.com, Matei Zaharia
 matei.zaha...@gmail.com
 *Cc:* user@spark.apache.org

   That has been a really rapid “evaluation” of the “work” and its
 “direction”



 *From:* David Morales [mailto:dmora...@stratio.com]
 *Sent:* Thursday, May 14, 2015 4:12 PM
 *To:* Matei Zaharia
 *Cc:* user@spark.apache.org
 *Subject:* Re: SPARKTA: a real-time aggregation engine based on Spark
 Streaming



 Thanks for your kind words Matei, happy to see that our work is in the
 right way.









 2015-05-14 17:10 GMT+02:00 Matei Zaharia matei.zaha...@gmail.com:

 (Sorry, for non-English people: that means it's a good thing.)

 Matei


  On May 14, 2015, at 10:53 AM, Matei Zaharia matei.zaha...@gmail.com
 wrote:
 
  ...This is madness!
 
  On May 14, 2015, at 9:31 AM, dmoralesdf dmora...@stratio.com wrote:
 
  Hi there,
 
  We have released our real-time aggregation engine based on Spark
 Streaming.
 
  SPARKTA is fully open source (Apache2)
 
 
  You can checkout the slides showed up at the Strata past week:
 
  http://www.slideshare.net/Stratio/strata-sparkta
 
  Source code:
 
  https://github.com/Stratio/sparkta
 
  And documentation
 
  http://docs.stratio.com/modules/sparkta/development/
 
 
  We are open to your ideas and contributors are welcomed.
 
 
  Regards.
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/SPARKTA-a-real-time-aggregation-engine-based-on-Spark-Streaming-tp22883.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 
 





 --

 David Morales de Frías  ::  +34 607 010 411 :: @dmoralesdf
 https://twitter.com/dmoralesdf



 http://www.stratio.com/
 Vía de las dos Castillas, 33, Ática 4, 3ª Planta

 28224 Pozuelo de Alarcón, Madrid

 Tel: +34 91 828 6473 // www.stratio.com // *@stratiobd
 https://twitter.com/StratioBD*




-- 

David Morales de Frías  ::  +34 607 010 411 :: @dmoralesdf
https://twitter.com/dmoralesdf


http://www.stratio.com/
Vía de las dos Castillas, 33, Ática 4, 3ª Planta
28224 Pozuelo de Alarcón, Madrid
Tel: +34 91 828 6473 // www.stratio.com // *@stratiobd
https://twitter.com/StratioBD*


Re: SPARKTA: a real-time aggregation engine based on Spark Streaming

2015-05-14 Thread David Morales
Thanks for your kind words Matei, happy to see that our work is in the
right way.




2015-05-14 17:10 GMT+02:00 Matei Zaharia matei.zaha...@gmail.com:

 (Sorry, for non-English people: that means it's a good thing.)

 Matei

  On May 14, 2015, at 10:53 AM, Matei Zaharia matei.zaha...@gmail.com
 wrote:
 
  ...This is madness!
 
  On May 14, 2015, at 9:31 AM, dmoralesdf dmora...@stratio.com wrote:
 
  Hi there,
 
  We have released our real-time aggregation engine based on Spark
 Streaming.
 
  SPARKTA is fully open source (Apache2)
 
 
  You can checkout the slides showed up at the Strata past week:
 
  http://www.slideshare.net/Stratio/strata-sparkta
 
  Source code:
 
  https://github.com/Stratio/sparkta
 
  And documentation
 
  http://docs.stratio.com/modules/sparkta/development/
 
 
  We are open to your ideas and contributors are welcomed.
 
 
  Regards.
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/SPARKTA-a-real-time-aggregation-engine-based-on-Spark-Streaming-tp22883.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 
 




-- 

David Morales de Frías  ::  +34 607 010 411 :: @dmoralesdf
https://twitter.com/dmoralesdf


http://www.stratio.com/
Vía de las dos Castillas, 33, Ática 4, 3ª Planta
28224 Pozuelo de Alarcón, Madrid
Tel: +34 91 828 6473 // www.stratio.com // *@stratiobd
https://twitter.com/StratioBD*