Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-27 Thread kant kodali
I understand now that for I cannot use spark streaming window operation without
checkpointing to HDFS as pointed out by @Ofir but Without window operation I
don't think we can do much with spark streaming. so since it is very essential
can I use Cassandra as a distributed storage? If so, can I see an example on how
I can tell spark cluster to use Cassandra for checkpointing and others if at
all.





On Fri, Aug 26, 2016 9:50 AM, Steve Loughran ste...@hortonworks.com wrote:

On 26 Aug 2016, at 12:58, kant kodali < kanth...@gmail.com > wrote:
@Steve your arguments make sense however there is a good majority of people who
have extensive experience with zookeeper prefer to avoid zookeeper and given the
ease of consul (which btw uses raft for the election) and etcd lot of us are
more inclined to avoid ZK.
And yes any technology needs time for maturity but that said it shouldn't stop
us from transitioning. for example people started using spark when it first
released instead of waiting for spark 2.0 where there are lot of optimizations
and bug fixes.


One way to look at the problem is "what is the cost if something doesn't work?"
If it's some HA consensus system, failure modes are "consensus failure,
everything goes into minority mode and offline". service lost, data fine.
Another is "partition with both groups thinking they are in charge", which is
more dangerous. then there's "partitioning event not detected", which may be
bad.
so: consider the failure modes and then consider not so much whether the tech
you are using is vulnerable to it, but "if it goes wrong, does it matter?"

Even before HDFS had HA with ZK/bookkeeper it didn't fail very often. And if you
looked at the causes of those failures, things like backbone switch failure are
so traumatic that things like ZK/etcd failures aren't going to make much of a
difference. The filesystem is down.
Generally, integrity gets priority over availability. That said, S3 and the like
have put availability ahead of consistency; Cassandra can offer that
too.—sometimes it is the right strategy

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-26 Thread Steve Loughran

On 26 Aug 2016, at 12:58, kant kodali 
> wrote:

@Steve your arguments make sense however there is a good majority of people who 
have extensive experience with zookeeper prefer to avoid zookeeper and given 
the ease of consul (which btw uses raft for the election) and etcd lot of us 
are more inclined to avoid ZK.

And yes any technology needs time for maturity but that said it shouldn't stop 
us from transitioning. for example people started using spark when it first 
released instead of waiting for spark 2.0 where there are lot of optimizations 
and bug fixes.



One way to look at the problem is "what is the cost if something doesn't work?"

If it's some HA consensus system, failure modes are "consensus failure, 
everything goes into minority mode and offline". service lost, data fine. 
Another  is "partition with both groups thinking they are in charge", which is 
more dangerous. then there's "partitioning event not detected", which may be 
bad.

so: consider the failure modes and then consider not so much whether the tech 
you are using is vulnerable to it, but "if it goes wrong, does it matter?"


Even before HDFS had HA with ZK/bookkeeper it didn't fail very often. And if 
you looked at the causes of those failures, things like backbone switch failure 
are so traumatic that things like ZK/etcd failures aren't going to make much of 
a difference. The filesystem is down.

Generally, integrity gets priority over availability. That said, S3 and the 
like have put availability ahead of consistency; Cassandra can offer that 
too.—sometimes it is the right strategy



Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-26 Thread kant kodali

@Mich ofcourse and In my previous message I have given a context as well.
Needless to say, the tools that are used by many banks that I came across such
as Citi, Capital One, Wells Fargo, GSachs are pretty laughable when it comes to
compliance and security. They somehow think they are secure when they aren't.





On Fri, Aug 26, 2016 5:46 AM, Mich Talebzadeh mich.talebza...@gmail.com wrote:
And yes any technology needs time for maturity but that said it shouldn't stop
us from transitioning
Depends on the application and how mission critical the business it is deployed
for. If you are using a tool for a Bank's Credit Risk (Surveillance, Anti-Money
Laundering, Employee Compliance, Anti-Fraud etc) and the tool missed a big chunk
for whatever reason then, the first thing will be the Bank will be fined in
($millions) and I will be looking for a new Job in London transport.
On the hand if the tools is used for some social media, sentiment analysis and
all that sort of stuff, I don't think anyone is going to lose sleep.
HTH








Dr Mich Talebzadeh



LinkedIn 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw




http://talebzadehmich.wordpress.com




Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any
other property which may arise from relying on this email's technical content is
explicitly disclaimed. The author will in no case be liable for any monetary
damages arising from such loss, damage or destruction.




On 26 August 2016 at 12:58, kant kodali < kanth...@gmail.com > wrote:
@Steve your arguments make sense however there is a good majority of people who
have extensive experience with zookeeper prefer to avoid zookeeper and given the
ease of consul (which btw uses raft for the election) and etcd lot of us are
more inclined to avoid ZK.
And yes any technology needs time for maturity but that said it shouldn't stop
us from transitioning. for example people started using spark when it first
released instead of waiting for spark 2.0 where there are lot of optimizations
and bug fixes.





On Fri, Aug 26, 2016 2:50 AM, Steve Loughran ste...@hortonworks.com wrote:

On 25 Aug 2016, at 22:49, kant kodali < kanth...@gmail.com > wrote:
yeah so its seems like its work in progress. At very least Mesos took the
initiative to provide alternatives to ZK. I am just really looking forward for
this.
https://issues.apache.org/ jira/browse/MESOS-3797






I worry about any attempt to implement distributed consensus systems: they take
time in production to get right.
1. There's the need to prove that what you are building is valid if the
implementation matches the specification. That has apparently been done for ZK,
though given the complexity of maths involved, I cannot vouch for that myself: 
https://blog.acolyer.org/2015/ 03/09/zab-high-performance-
broadcast-for-primary-backup- systems/
2. you need to run it in production to find the problems. Google's Chubby paper
hints about the things they found out went wrong there. As far as ZK goes,
jepsen hints its robust
https://aphyr.com/posts/291- jepsen-zookeeper
If it has weaknesses, I'd point at - it's security model -it's lack of 
helpfulness when there are kerberos/SASL auth problems (ZK server
closes connection; client sees connection failure and retries), -the fact that 
it's failure modes aren't always understood by people coding
against it.
http://blog.cloudera.com/blog/ 2014/03/zookeeper-resilience- at-pinterest/
the Raft algorithm appears to be easier to implement than Paxos; there are
things built on it and I look forward to seeing what works/doesn't work in
production.
Certainly Aphyr found problems when it pointed jepsen at etcd, though being a
2014 piece of work, I expect those specific problems to have been addressed. The
main thing is: it shows how hard it is to get things right in the presence of
complex failures.
Finally, regarding S3
You can use S3 object store as a source of data in queries/streaming, and, if
done carefully, a destination. Performance is variable...something some of us
are working on there, across S3a, spark and hive.
Conference placement: I shall be talking on that topic at Spark Summit Europe if
you want to find out more: https://spark-summit. org/eu-2016/

On Thu, Aug 25, 2016 2:00 PM, Michael Gummelt mgumm...@mesosphere.io wrote:
Mesos also uses ZK for leader election. There seems to be some effort in
supporting etcd, but it's in progress: https://issues. 
apache.org/jira/browse/MESOS- 1806

On Thu, Aug 25, 2016 at 1:55 PM, kant kodali < kanth...@gmail.com > wr ote:
@Ofir @Sean very good points.
@Mike We dont use Kafka or Hive and I understand that Zookeeper can do many
things but for our use case all we need is for high availability and given the
devops people frustrations here in our company who had extensive experience
managing large clusters in the past we would be very happy to avoid Zookeeper. I
also heard that Mesos 

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-26 Thread Carlile, Ken



We use Spark with NFS as the data store, mainly using Dr. Jeremy Freeman’s Thunder framework. Works very well (and I see HUGE throughput on the storage system during loads). I haven’t seen (or heard from the devs/users) a need for HDFS or S3.


—Ken



On Aug 25, 2016, at 8:02 PM, kant kodali  wrote:






ZFS linux port has got very stable these days given LLNL maintains the linux port and they also use it as a FileSystem for their super computer (The supercomputer is one of the top in the nation is what I heard)











On Thu, Aug 25, 2016 4:58 PM, kant kodali kanth...@gmail.com wrote:






How about using ZFS?











On Thu, Aug 25, 2016 3:48 PM, Mark Hamstra m...@clearstorydata.com wrote:


That's often not as important as you might think.  It really only affects the loading of data by the first Stage.  Subsequent Stages (in the same Job or even in other Jobs if you do it right) will use the map outputs, and will do so
 with good data locality.

On Thu, Aug 25, 2016 at 3:36 PM, ayan guha  wrote:

At the core of it map reduce relies heavily on data locality. You would lose the ability to process data closest to where it resides if you do not use hdfs. 
S3 or NFS will not able to provide that.


On 26 Aug 2016 07:49, "kant kodali"  wrote:







yeah so its seems like its work in progress. At very least Mesos took the initiative to provide alternatives to ZK. I am just really looking forward for this. 


https://issues.apache.org/jira/browse/MESOS-3797











On Thu, Aug 25, 2016 2:00 PM, Michael Gummelt mgumm...@mesosphere.io wrote:


Mesos also uses ZK for leader election.  There seems to be some effort in supporting etcd, but it's in progress: https://issues.apache.org/jira/browse/MESOS-1806


On Thu, Aug 25, 2016 at 1:55 PM, kant kodali  wrote:







@Ofir @Sean very good points.


@Mike We dont use Kafka or Hive and I understand that Zookeeper can do many things but for our use case all we need is for high availability and given the devops people frustrations here in our company who had extensive experience managing large
 clusters in the past we would be very happy to avoid Zookeeper. I also heard that Mesos can provide High Availability through etcd and consul and if that is true I will be left with the following stack


Spark + Mesos scheduler + Distributed File System or to be precise I should say Distributed Storage since S3 is an object store so I guess this will be HDFS for us + etcd & consul. Now the big question for me is how do I set all this up 













On Thu, Aug 25, 2016 1:35 PM, Ofir Manor ofir.ma...@equalum.io wrote:


Just to add one concrete example regarding HDFS dependency.
Have a look at checkpointing https://spark.apache.org/docs/1.6.2/streaming-programming-guide.html#checkpointing
For example, for Spark Streaming, you can not do any window operation in a cluster without checkpointing to HDFS (or S3).






Ofir Manor

Co-Founder & CTO | Equalum


Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io





On Thu, Aug 25, 2016 at 11:13 PM, Mich Talebzadeh  wrote:


Hi Kant,


I trust the following would be of use.


Big Data depends on Hadoop Ecosystem from whichever angle one looks at it.


In the heart of it and with reference to points you raised about HDFS, one needs to have a working knowledge of Hadoop Core System including HDFS, Map-reduce algorithm and Yarn whether one uses them or not. After all Big Data is all about horizontal
 scaling with master and nodes (as opposed to vertical scaling like SQL Server running on a Host). and distributed data (by default data is replicated three times on different nodes for scalability and availability). 


Other members including Sean provided the limits on how far one operate Spark in its own space. If you are going to deal with data (data in motion and data at rest), then you will need to interact with some form of storage and HDFS and compatible
 file systems like S3 are the natural choices.


Zookeeper is not just about high availability. It is used in Spark Streaming with Kafka, it is also used with Hive for concurrency. It is also a distributed locking system.


HTH











Dr Mich Talebzadeh

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com



Disclaimer: Use
 it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content
 is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. 

 













On 25 August 2016 at 

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-26 Thread Mich Talebzadeh
And yes any technology needs time for maturity but that said it shouldn't
stop us from transitioning

Depends on the application and how mission critical the business it is
deployed for. If you are using a tool for a Bank's Credit Risk
(Surveillance, Anti-Money Laundering, Employee Compliance, Anti-Fraud etc)
and the tool missed a big chunk for whatever reason then, the first thing
will be the Bank will be fined in ($millions)  and I will be looking for a
new Job in London transport.

On the hand if the tools is used for some social media, sentiment analysis
and all that sort of stuff, I don't think anyone is going to lose sleep.

HTH









Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 26 August 2016 at 12:58, kant kodali  wrote:

> @Steve your arguments make sense however there is a good majority of
> people who have extensive experience with zookeeper prefer to avoid
> zookeeper and given the ease of consul (which btw uses raft for the
> election) and etcd lot of us are more inclined to avoid ZK.
>
> And yes any technology needs time for maturity but that said it shouldn't
> stop us from transitioning. for example people started using spark when it
> first released instead of waiting for spark 2.0 where there are lot of
> optimizations and bug fixes.
>
>
>
> On Fri, Aug 26, 2016 2:50 AM, Steve Loughran ste...@hortonworks.com wrote:
>
>>
>> On 25 Aug 2016, at 22:49, kant kodali  wrote:
>>
>> yeah so its seems like its work in progress. At very least Mesos took the
>> initiative to provide alternatives to ZK. I am just really looking forward
>> for this.
>>
>> https://issues.apache.org/jira/browse/MESOS-3797
>>
>>
>>
>>
>> I worry about any attempt to implement distributed consensus systems:
>> they take time in production to get right.
>>
>> 1. There's the need to prove that what you are building is valid if the
>> implementation matches the specification. That has apparently been done for
>> ZK, though given the complexity of maths involved, I cannot vouch for that
>> myself:
>> https://blog.acolyer.org/2015/03/09/zab-high-performance-
>> broadcast-for-primary-backup-systems/
>>
>> 2. you need to run it in production to find the problems. Google's Chubby
>> paper hints about the things they found out went wrong there. As far as ZK
>> goes, jepsen hints its robust
>>
>> https://aphyr.com/posts/291-jepsen-zookeeper
>>
>> If it has weaknesses, I'd point at
>>  - it's security model
>>  -it's lack of helpfulness when there are kerberos/SASL auth problems (ZK
>> server closes connection; client sees connection failure and retries),
>>  -the fact that it's failure modes aren't always understood by people
>> coding against it.
>>
>> http://blog.cloudera.com/blog/2014/03/zookeeper-resilience-at-pinterest/
>>
>> the Raft algorithm appears to be easier to implement than Paxos; there
>> are things built on it and I look forward to seeing what works/doesn't work
>> in production.
>>
>> Certainly Aphyr found problems when it pointed jepsen at etcd, though
>> being a 2014 piece of work, I expect those specific problems to have been
>> addressed. The main thing is: it shows how hard it is to get things right
>> in the presence of complex failures.
>>
>> Finally, regarding S3
>>
>> You can use S3 object store as a source of data in queries/streaming,
>> and, if done carefully, a destination. Performance is variable...something
>> some of us are working on there, across S3a, spark and hive.
>>
>> Conference placement: I shall be talking on that topic at Spark Summit
>> Europe if you want to find out more: https://spark-summit.org/eu-2016/
>>
>>
>> On Thu, Aug 25, 2016 2:00 PM, Michael Gummelt mgumm...@mesosphere.io
>> wrote:
>>
>> Mesos also uses ZK for leader election.  There seems to be some effort in
>> supporting etcd, but it's in progress: https://issues.
>> apache.org/jira/browse/MESOS-1806
>>
>> On Thu, Aug 25, 2016 at 1:55 PM, kant kodali  wrote:
>>
>> @Ofir @Sean very good points.
>>
>> @Mike We dont use Kafka or Hive and I understand that Zookeeper can do
>> many things but for our use case all we need is for high availability and
>> given the devops people frustrations here in our company who had extensive
>> experience managing large clusters in the past we would be very happy to
>> avoid Zookeeper. I also heard that Mesos can provide High Availability
>> through etcd and consul and if that is true I will be left 

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-26 Thread kant kodali

@Steve your arguments make sense however there is a good majority of people who
have extensive experience with zookeeper prefer to avoid zookeeper and given the
ease of consul (which btw uses raft for the election) and etcd lot of us are
more inclined to avoid ZK.
And yes any technology needs time for maturity but that said it shouldn't stop
us from transitioning. for example people started using spark when it first
released instead of waiting for spark 2.0 where there are lot of optimizations
and bug fixes.





On Fri, Aug 26, 2016 2:50 AM, Steve Loughran ste...@hortonworks.com wrote:

On 25 Aug 2016, at 22:49, kant kodali < kanth...@gmail.com > wrote:
yeah so its seems like its work in progress. At very least Mesos took the
initiative to provide alternatives to ZK. I am just really looking forward for
this.
https://issues.apache.org/jira/browse/MESOS-3797






I worry about any attempt to implement distributed consensus systems: they take
time in production to get right.
1. There's the need to prove that what you are building is valid if the
implementation matches the specification. That has apparently been done for ZK,
though given the complexity of maths involved, I cannot vouch for that myself: 
https://blog.acolyer.org/2015/03/09/zab-high-performance-broadcast-for-primary-backup-systems/

2. you need to run it in production to find the problems. Google's Chubby paper
hints about the things they found out went wrong there. As far as ZK goes,
jepsen hints its robust
https://aphyr.com/posts/291-jepsen-zookeeper
If it has weaknesses, I'd point at - it's security model -it's lack of 
helpfulness when there are kerberos/SASL auth problems (ZK server
closes connection; client sees connection failure and retries), -the fact that 
it's failure modes aren't always understood by people coding
against it.
http://blog.cloudera.com/blog/2014/03/zookeeper-resilience-at-pinterest/
the Raft algorithm appears to be easier to implement than Paxos; there are
things built on it and I look forward to seeing what works/doesn't work in
production.
Certainly Aphyr found problems when it pointed jepsen at etcd, though being a
2014 piece of work, I expect those specific problems to have been addressed. The
main thing is: it shows how hard it is to get things right in the presence of
complex failures.
Finally, regarding S3
You can use S3 object store as a source of data in queries/streaming, and, if
done carefully, a destination. Performance is variable...something some of us
are working on there, across S3a, spark and hive.
Conference placement: I shall be talking on that topic at Spark Summit Europe if
you want to find out more: https://spark-summit.org/eu-2016/

On Thu, Aug 25, 2016 2:00 PM, Michael Gummelt mgumm...@mesosphere.io wrote:
Mesos also uses ZK for leader election. There seems to be some effort in
supporting etcd, but it's in progress: 
https://issues.apache.org/jira/browse/MESOS-1806

On Thu, Aug 25, 2016 at 1:55 PM, kant kodali < kanth...@gmail.com > wrote:
@Ofir @Sean very good points.
@Mike We dont use Kafka or Hive and I understand that Zookeeper can do many
things but for our use case all we need is for high availability and given the
devops people frustrations here in our company who had extensive experience
managing large clusters in the past we would be very happy to avoid Zookeeper. I
also heard that Mesos can provide High Availability through etcd and consul and
if that is true I will be left with the following stack




Spark + Mesos scheduler + Distributed File System or to be precise I should say
Distributed Storage since S3 is an object store so I guess this will be HDFS for
us + etcd & consul. Now the big question for me is how do I set all this up

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-26 Thread Steve Loughran

On 25 Aug 2016, at 22:49, kant kodali 
> wrote:

yeah so its seems like its work in progress. At very least Mesos took the 
initiative to provide alternatives to ZK. I am just really looking forward for 
this.

https://issues.apache.org/jira/browse/MESOS-3797




I worry about any attempt to implement distributed consensus systems: they take 
time in production to get right.

1. There's the need to prove that what you are building is valid if the 
implementation matches the specification. That has apparently been done for ZK, 
though given the complexity of maths involved, I cannot vouch for that myself:
https://blog.acolyer.org/2015/03/09/zab-high-performance-broadcast-for-primary-backup-systems/

2. you need to run it in production to find the problems. Google's Chubby paper 
hints about the things they found out went wrong there. As far as ZK goes, 
jepsen hints its robust

https://aphyr.com/posts/291-jepsen-zookeeper

If it has weaknesses, I'd point at
 - it's security model
 -it's lack of helpfulness when there are kerberos/SASL auth problems (ZK 
server closes connection; client sees connection failure and retries),
 -the fact that it's failure modes aren't always understood by people coding 
against it.

http://blog.cloudera.com/blog/2014/03/zookeeper-resilience-at-pinterest/

the Raft algorithm appears to be easier to implement than Paxos; there are 
things built on it and I look forward to seeing what works/doesn't work in 
production.

Certainly Aphyr found problems when it pointed jepsen at etcd, though being a 
2014 piece of work, I expect those specific problems to have been addressed. 
The main thing is: it shows how hard it is to get things right in the presence 
of complex failures.

Finally, regarding S3

You can use S3 object store as a source of data in queries/streaming, and, if 
done carefully, a destination. Performance is variable...something some of us 
are working on there, across S3a, spark and hive.

Conference placement: I shall be talking on that topic at Spark Summit Europe 
if you want to find out more: https://spark-summit.org/eu-2016/


On Thu, Aug 25, 2016 2:00 PM, Michael Gummelt 
mgumm...@mesosphere.io wrote:
Mesos also uses ZK for leader election.  There seems to be some effort in 
supporting etcd, but it's in progress: 
https://issues.apache.org/jira/browse/MESOS-1806

On Thu, Aug 25, 2016 at 1:55 PM, kant kodali 
> wrote:
@Ofir @Sean very good points.

@Mike We dont use Kafka or Hive and I understand that Zookeeper can do many 
things but for our use case all we need is for high availability and given the 
devops people frustrations here in our company who had extensive experience 
managing large clusters in the past we would be very happy to avoid Zookeeper. 
I also heard that Mesos can provide High Availability through etcd and consul 
and if that is true I will be left with the following stack





Spark + Mesos scheduler + Distributed File System or to be precise I should say 
Distributed Storage since S3 is an object store so I guess this will be HDFS 
for us + etcd & consul. Now the big question for me is how do I set all this up 
[https://dv4jgpe7xb4ws.cloudfront.net/v1/simple_smile.png]








Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread kant kodali

ZFS linux port has got very stable these days given LLNL maintains the linux
port and they also use it as a FileSystem for their super computer (The
supercomputer is one of the top in the nation is what I heard)





On Thu, Aug 25, 2016 4:58 PM, kant kodali kanth...@gmail.com wrote:
How about using ZFS?





On Thu, Aug 25, 2016 3:48 PM, Mark Hamstra m...@clearstorydata.com wrote:
That's often not as important as you might think. It really only affects the
loading of data by the first Stage. Subsequent Stages (in the same Job or even
in other Jobs if you do it right) will use the map outputs, and will do so with
good data locality.
On Thu, Aug 25, 2016 at 3:36 PM, ayan guha < guha.a...@gmail.com > wrote:
At the core of it map reduce relies heavily on data locality. You would lose the
ability to process data closest to where it resides if you do not use hdfs.
S3 or NFS will not able to provide that.

On 26 Aug 2016 07:49, "kant kodali" < kanth...@gmail.com > wrote:
yeah so its seems like its work in progress. At very least Mesos took the
initiative to provide alternatives to ZK. I am just really looking forward for
this.
https://issues.apache.org/jira /browse/MESOS-3797





On Thu, Aug 25, 2016 2:00 PM, Michael Gummelt mgumm...@mesosphere.io wrote:
Mesos also uses ZK for leader election. There seems to be some effort in
supporting etcd, but it's in progress: https://issues.apache.org/jira 
/browse/MESOS-1806

On Thu, Aug 25, 2016 at 1:55 PM, kant kodali < kanth...@gmail.com > wrote:
@Ofir @Sean very good points.
@Mike We dont use Kafka or Hive and I understand that Zookeeper can do many
things but for our use case all we need is for high availability and given the
devops people frustrations here in our company who had extensive experience
managing large clusters in the past we would be very happy to avoid Zookeeper. I
also heard that Mesos can provide High Availability through etcd and consul and
if that is true I will be left with the following stack
Spark + Mesos scheduler + Distributed File System or to be precise I should say
Distributed Storage since S3 is an object store so I guess this will be HDFS for
us + etcd & consul. Now the big question for me is how do I set all this up





On Thu, Aug 25, 2016 1:35 PM, Ofir Manor ofir.ma...@equalum.io wrote:
Just to add one concrete example regarding HDFS dependency. Have a look at 
checkpointing https://spark.ap ache.org/docs/1.6.2/streaming- 
programming-guide.html#checkpo
inting For example, for Spark Streaming, you can not do any window operation in 
a
cluster without checkpointing to HDFS (or S3).
Ofir Manor


Co-Founder & CTO | Equalum



Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io


On Thu, Aug 25, 2016 at 11:13 PM, Mich Talebzadeh < mich.talebza...@gmail.com > 
wrote:
Hi Kant,
I trust the following would be of use.
Big Data depends on Hadoop Ecosystem from whichever angle one looks at it.
In the heart of it and with reference to points you raised about HDFS, one needs
to have a working knowledge of Hadoop Core System including HDFS, Map-reduce
algorithm and Yarn whether one uses them or not. After all Big Data is all about
horizontal scaling with master and nodes (as opposed to vertical scaling like
SQL Server running on a Host). and distributed data (by default data is
replicated three times on different nodes for scalability and availability).
Other members including Sean provided the limits on how far one operate Spark in
its own space. If you are going to deal with data (data in motion and data at
rest), then you will need to interact with some form of storage and HDFS and
compatible file systems like S3 are the natural choices.
Zookeeper is not just about high availability. It is used in Spark Streaming
with Kafka, it is also used with Hive for concurrency. It is also a distributed
locking system.
HTH
Dr Mich Talebzadeh



LinkedIn https://www.linkedin.com/prof ile/view?id=AAEWh2gBxianrb
Jd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpres s.com




Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any
other property which may arise from relying on this email's technical content is
explicitly disclaimed. The author will in no case be liable for any monetary
damages arising from such loss, damage or destruction.




On 25 August 2016 at 20:52, Mark Hamstra < m...@clearstorydata.com > wrote:
s/playing a role/paying a role/
On Thu, Aug 25, 2016 at 12:51 PM, Mark Hamstra < m...@clearstorydata.com > 
wrote:
One way you can start to make this make more sense, Sean, is if you exploit the
code/data duality so that the non-distributed data that you are sending out from
the driver is actually paying a role more like code (or at least parameters.)
What is sent from the driver to an Executer is then used (typically as seeds or
parameters) to execute some procedure on the Worker node that generates the
actual data on the Workers. After that, you proceed to execute in a more 

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread kant kodali

How about using ZFS?





On Thu, Aug 25, 2016 3:48 PM, Mark Hamstra m...@clearstorydata.com wrote:
That's often not as important as you might think. It really only affects the
loading of data by the first Stage. Subsequent Stages (in the same Job or even
in other Jobs if you do it right) will use the map outputs, and will do so with
good data locality.
On Thu, Aug 25, 2016 at 3:36 PM, ayan guha < guha.a...@gmail.com > wrote:
At the core of it map reduce relies heavily on data locality. You would lose the
ability to process data closest to where it resides if you do not use hdfs.
S3 or NFS will not able to provide that.

On 26 Aug 2016 07:49, "kant kodali" < kanth...@gmail.com > wrote:
yeah so its seems like its work in progress. At very least Mesos took the
initiative to provide alternatives to ZK. I am just really looking forward for
this.
https://issues.apache.org/jira /browse/MESOS-3797





On Thu, Aug 25, 2016 2:00 PM, Michael Gummelt mgumm...@mesosphere.io wrote:
Mesos also uses ZK for leader election. There seems to be some effort in
supporting etcd, but it's in progress: https://issues.apache.org/jira 
/browse/MESOS-1806

On Thu, Aug 25, 2016 at 1:55 PM, kant kodali < kanth...@gmail.com > wrote:
@Ofir @Sean very good points.
@Mike We dont use Kafka or Hive and I understand that Zookeeper can do many
things but for our use case all we need is for high availability and given the
devops people frustrations here in our company who had extensive experience
managing large clusters in the past we would be very happy to avoid Zookeeper. I
also heard that Mesos can provide High Availability through etcd and consul and
if that is true I will be left with the following stack
Spark + Mesos scheduler + Distributed File System or to be precise I should say
Distributed Storage since S3 is an object store so I guess this will be HDFS for
us + etcd & consul. Now the big question for me is how do I set all this up





On Thu, Aug 25, 2016 1:35 PM, Ofir Manor ofir.ma...@equalum.io wrote:
Just to add one concrete example regarding HDFS dependency. Have a look at 
checkpointing https://spark.ap ache.org/docs/1.6.2/streaming- 
programming-guide.html#checkpo
inting For example, for Spark Streaming, you can not do any window operation in 
a
cluster without checkpointing to HDFS (or S3).
Ofir Manor


Co-Founder & CTO | Equalum



Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io


On Thu, Aug 25, 2016 at 11:13 PM, Mich Talebzadeh < mich.talebza...@gmail.com > 
wrote:
Hi Kant,
I trust the following would be of use.
Big Data depends on Hadoop Ecosystem from whichever angle one looks at it.
In the heart of it and with reference to points you raised about HDFS, one needs
to have a working knowledge of Hadoop Core System including HDFS, Map-reduce
algorithm and Yarn whether one uses them or not. After all Big Data is all about
horizontal scaling with master and nodes (as opposed to vertical scaling like
SQL Server running on a Host). and distributed data (by default data is
replicated three times on different nodes for scalability and availability).
Other members including Sean provided the limits on how far one operate Spark in
its own space. If you are going to deal with data (data in motion and data at
rest), then you will need to interact with some form of storage and HDFS and
compatible file systems like S3 are the natural choices.
Zookeeper is not just about high availability. It is used in Spark Streaming
with Kafka, it is also used with Hive for concurrency. It is also a distributed
locking system.
HTH
Dr Mich Talebzadeh



LinkedIn https://www.linkedin.com/prof ile/view?id=AAEWh2gBxianrb
Jd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpres s.com




Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any
other property which may arise from relying on this email's technical content is
explicitly disclaimed. The author will in no case be liable for any monetary
damages arising from such loss, damage or destruction.




On 25 August 2016 at 20:52, Mark Hamstra < m...@clearstorydata.com > wrote:
s/playing a role/paying a role/
On Thu, Aug 25, 2016 at 12:51 PM, Mark Hamstra < m...@clearstorydata.com > 
wrote:
One way you can start to make this make more sense, Sean, is if you exploit the
code/data duality so that the non-distributed data that you are sending out from
the driver is actually paying a role more like code (or at least parameters.)
What is sent from the driver to an Executer is then used (typically as seeds or
parameters) to execute some procedure on the Worker node that generates the
actual data on the Workers. After that, you proceed to execute in a more typical
fashion with Spark using the now-instantiated distributed data.
But I don't get the sense that this meta-programming-ish style is really what
the OP was aiming at.
On Thu, Aug 25, 2016 at 12:39 PM, Sean Owen < so...@cloudera.com > wrote:
Without a distributed storage system, 

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread Mark Hamstra
That's often not as important as you might think.  It really only affects
the loading of data by the first Stage.  Subsequent Stages (in the same Job
or even in other Jobs if you do it right) will use the map outputs, and
will do so with good data locality.

On Thu, Aug 25, 2016 at 3:36 PM, ayan guha  wrote:

> At the core of it map reduce relies heavily on data locality. You would
> lose the ability to process data closest to where it resides if you do not
> use hdfs.
> S3 or NFS will not able to provide that.
> On 26 Aug 2016 07:49, "kant kodali"  wrote:
>
>> yeah so its seems like its work in progress. At very least Mesos took the
>> initiative to provide alternatives to ZK. I am just really looking forward
>> for this.
>>
>> https://issues.apache.org/jira/browse/MESOS-3797
>>
>>
>>
>> On Thu, Aug 25, 2016 2:00 PM, Michael Gummelt mgumm...@mesosphere.io
>> wrote:
>>
>>> Mesos also uses ZK for leader election.  There seems to be some effort
>>> in supporting etcd, but it's in progress: https://issues.apache.org/jira
>>> /browse/MESOS-1806
>>>
>>> On Thu, Aug 25, 2016 at 1:55 PM, kant kodali  wrote:
>>>
>>> @Ofir @Sean very good points.
>>>
>>> @Mike We dont use Kafka or Hive and I understand that Zookeeper can do
>>> many things but for our use case all we need is for high availability and
>>> given the devops people frustrations here in our company who had extensive
>>> experience managing large clusters in the past we would be very happy to
>>> avoid Zookeeper. I also heard that Mesos can provide High Availability
>>> through etcd and consul and if that is true I will be left with the
>>> following stack
>>>
>>> Spark + Mesos scheduler + Distributed File System or to be precise I
>>> should say Distributed Storage since S3 is an object store so I guess this
>>> will be HDFS for us + etcd & consul. Now the big question for me is how do
>>> I set all this up
>>>
>>>
>>>
>>> On Thu, Aug 25, 2016 1:35 PM, Ofir Manor ofir.ma...@equalum.io wrote:
>>>
>>> Just to add one concrete example regarding HDFS dependency.
>>> Have a look at checkpointing https://spark.ap
>>> ache.org/docs/1.6.2/streaming-programming-guide.html#checkpointing
>>> For example, for Spark Streaming, you can not do any window operation in
>>> a cluster without checkpointing to HDFS (or S3).
>>>
>>> Ofir Manor
>>>
>>> Co-Founder & CTO | Equalum
>>>
>>> Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io
>>>
>>> On Thu, Aug 25, 2016 at 11:13 PM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>> Hi Kant,
>>>
>>> I trust the following would be of use.
>>>
>>> Big Data depends on Hadoop Ecosystem from whichever angle one looks at
>>> it.
>>>
>>> In the heart of it and with reference to points you raised about HDFS,
>>> one needs to have a working knowledge of Hadoop Core System including HDFS,
>>> Map-reduce algorithm and Yarn whether one uses them or not. After all Big
>>> Data is all about horizontal scaling with master and nodes (as opposed to
>>> vertical scaling like SQL Server running on a Host). and distributed data
>>> (by default data is replicated three times on different nodes for
>>> scalability and availability).
>>>
>>> Other members including Sean provided the limits on how far one operate
>>> Spark in its own space. If you are going to deal with data (data in motion
>>> and data at rest), then you will need to interact with some form of storage
>>> and HDFS and compatible file systems like S3 are the natural choices.
>>>
>>> Zookeeper is not just about high availability. It is used in Spark
>>> Streaming with Kafka, it is also used with Hive for concurrency. It is also
>>> a distributed locking system.
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 25 August 2016 at 20:52, Mark Hamstra 
>>> wrote:
>>>
>>> s/playing a role/paying a role/
>>>
>>> On Thu, Aug 25, 2016 at 12:51 PM, Mark Hamstra 
>>> wrote:
>>>
>>> One way you can start to make this make more sense, Sean, is if you
>>> exploit the code/data duality so that the non-distributed data that you are
>>> sending out from the driver is actually paying a role more like code (or at
>>> least parameters.)  What is sent from the driver to an Executer is then
>>> used (typically as seeds or parameters) to execute some procedure on the
>>> 

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread Michael Gummelt
> You would lose the ability to process data closest to where it resides if
you do not use hdfs.

This isn't true.  Many other data sources (e.g. Cassandra) support locality.

On Thu, Aug 25, 2016 at 3:36 PM, ayan guha  wrote:

> At the core of it map reduce relies heavily on data locality. You would
> lose the ability to process data closest to where it resides if you do not
> use hdfs.
> S3 or NFS will not able to provide that.
> On 26 Aug 2016 07:49, "kant kodali"  wrote:
>
>> yeah so its seems like its work in progress. At very least Mesos took the
>> initiative to provide alternatives to ZK. I am just really looking forward
>> for this.
>>
>> https://issues.apache.org/jira/browse/MESOS-3797
>>
>>
>>
>> On Thu, Aug 25, 2016 2:00 PM, Michael Gummelt mgumm...@mesosphere.io
>> wrote:
>>
>>> Mesos also uses ZK for leader election.  There seems to be some effort
>>> in supporting etcd, but it's in progress: https://issues.apache.org/jira
>>> /browse/MESOS-1806
>>>
>>> On Thu, Aug 25, 2016 at 1:55 PM, kant kodali  wrote:
>>>
>>> @Ofir @Sean very good points.
>>>
>>> @Mike We dont use Kafka or Hive and I understand that Zookeeper can do
>>> many things but for our use case all we need is for high availability and
>>> given the devops people frustrations here in our company who had extensive
>>> experience managing large clusters in the past we would be very happy to
>>> avoid Zookeeper. I also heard that Mesos can provide High Availability
>>> through etcd and consul and if that is true I will be left with the
>>> following stack
>>>
>>> Spark + Mesos scheduler + Distributed File System or to be precise I
>>> should say Distributed Storage since S3 is an object store so I guess this
>>> will be HDFS for us + etcd & consul. Now the big question for me is how do
>>> I set all this up
>>>
>>>
>>>
>>> On Thu, Aug 25, 2016 1:35 PM, Ofir Manor ofir.ma...@equalum.io wrote:
>>>
>>> Just to add one concrete example regarding HDFS dependency.
>>> Have a look at checkpointing https://spark.ap
>>> ache.org/docs/1.6.2/streaming-programming-guide.html#checkpointing
>>> For example, for Spark Streaming, you can not do any window operation in
>>> a cluster without checkpointing to HDFS (or S3).
>>>
>>> Ofir Manor
>>>
>>> Co-Founder & CTO | Equalum
>>>
>>> Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io
>>>
>>> On Thu, Aug 25, 2016 at 11:13 PM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>> Hi Kant,
>>>
>>> I trust the following would be of use.
>>>
>>> Big Data depends on Hadoop Ecosystem from whichever angle one looks at
>>> it.
>>>
>>> In the heart of it and with reference to points you raised about HDFS,
>>> one needs to have a working knowledge of Hadoop Core System including HDFS,
>>> Map-reduce algorithm and Yarn whether one uses them or not. After all Big
>>> Data is all about horizontal scaling with master and nodes (as opposed to
>>> vertical scaling like SQL Server running on a Host). and distributed data
>>> (by default data is replicated three times on different nodes for
>>> scalability and availability).
>>>
>>> Other members including Sean provided the limits on how far one operate
>>> Spark in its own space. If you are going to deal with data (data in motion
>>> and data at rest), then you will need to interact with some form of storage
>>> and HDFS and compatible file systems like S3 are the natural choices.
>>>
>>> Zookeeper is not just about high availability. It is used in Spark
>>> Streaming with Kafka, it is also used with Hive for concurrency. It is also
>>> a distributed locking system.
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 25 August 2016 at 20:52, Mark Hamstra 
>>> wrote:
>>>
>>> s/playing a role/paying a role/
>>>
>>> On Thu, Aug 25, 2016 at 12:51 PM, Mark Hamstra 
>>> wrote:
>>>
>>> One way you can start to make this make more sense, Sean, is if you
>>> exploit the code/data duality so that the non-distributed data that you are
>>> sending out from the driver is actually paying a role more like code (or at
>>> least parameters.)  What is sent from the driver to an Executer is then
>>> used (typically as seeds or parameters) to execute some procedure on the
>>> Worker node that generates the actual data on the Workers.  After that, you
>>> proceed 

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread ayan guha
At the core of it map reduce relies heavily on data locality. You would
lose the ability to process data closest to where it resides if you do not
use hdfs.
S3 or NFS will not able to provide that.
On 26 Aug 2016 07:49, "kant kodali"  wrote:

> yeah so its seems like its work in progress. At very least Mesos took the
> initiative to provide alternatives to ZK. I am just really looking forward
> for this.
>
> https://issues.apache.org/jira/browse/MESOS-3797
>
>
>
> On Thu, Aug 25, 2016 2:00 PM, Michael Gummelt mgumm...@mesosphere.io
> wrote:
>
>> Mesos also uses ZK for leader election.  There seems to be some effort in
>> supporting etcd, but it's in progress: https://issues.apache.org/
>> jira/browse/MESOS-1806
>>
>> On Thu, Aug 25, 2016 at 1:55 PM, kant kodali  wrote:
>>
>> @Ofir @Sean very good points.
>>
>> @Mike We dont use Kafka or Hive and I understand that Zookeeper can do
>> many things but for our use case all we need is for high availability and
>> given the devops people frustrations here in our company who had extensive
>> experience managing large clusters in the past we would be very happy to
>> avoid Zookeeper. I also heard that Mesos can provide High Availability
>> through etcd and consul and if that is true I will be left with the
>> following stack
>>
>> Spark + Mesos scheduler + Distributed File System or to be precise I
>> should say Distributed Storage since S3 is an object store so I guess this
>> will be HDFS for us + etcd & consul. Now the big question for me is how do
>> I set all this up
>>
>>
>>
>> On Thu, Aug 25, 2016 1:35 PM, Ofir Manor ofir.ma...@equalum.io wrote:
>>
>> Just to add one concrete example regarding HDFS dependency.
>> Have a look at checkpointing https://spark.ap
>> ache.org/docs/1.6.2/streaming-programming-guide.html#checkpointing
>> For example, for Spark Streaming, you can not do any window operation in
>> a cluster without checkpointing to HDFS (or S3).
>>
>> Ofir Manor
>>
>> Co-Founder & CTO | Equalum
>>
>> Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io
>>
>> On Thu, Aug 25, 2016 at 11:13 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>> Hi Kant,
>>
>> I trust the following would be of use.
>>
>> Big Data depends on Hadoop Ecosystem from whichever angle one looks at it.
>>
>> In the heart of it and with reference to points you raised about HDFS,
>> one needs to have a working knowledge of Hadoop Core System including HDFS,
>> Map-reduce algorithm and Yarn whether one uses them or not. After all Big
>> Data is all about horizontal scaling with master and nodes (as opposed to
>> vertical scaling like SQL Server running on a Host). and distributed data
>> (by default data is replicated three times on different nodes for
>> scalability and availability).
>>
>> Other members including Sean provided the limits on how far one operate
>> Spark in its own space. If you are going to deal with data (data in motion
>> and data at rest), then you will need to interact with some form of storage
>> and HDFS and compatible file systems like S3 are the natural choices.
>>
>> Zookeeper is not just about high availability. It is used in Spark
>> Streaming with Kafka, it is also used with Hive for concurrency. It is also
>> a distributed locking system.
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 25 August 2016 at 20:52, Mark Hamstra  wrote:
>>
>> s/playing a role/paying a role/
>>
>> On Thu, Aug 25, 2016 at 12:51 PM, Mark Hamstra 
>> wrote:
>>
>> One way you can start to make this make more sense, Sean, is if you
>> exploit the code/data duality so that the non-distributed data that you are
>> sending out from the driver is actually paying a role more like code (or at
>> least parameters.)  What is sent from the driver to an Executer is then
>> used (typically as seeds or parameters) to execute some procedure on the
>> Worker node that generates the actual data on the Workers.  After that, you
>> proceed to execute in a more typical fashion with Spark using the
>> now-instantiated distributed data.
>>
>> But I don't get the sense that this meta-programming-ish style is really
>> what the OP was aiming at.
>>
>> On Thu, Aug 25, 2016 at 12:39 PM, Sean Owen  wrote:
>>
>> Without a distributed storage system, your application can only create
>> data on the 

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread kant kodali

yeah so its seems like its work in progress. At very least Mesos took the
initiative to provide alternatives to ZK. I am just really looking forward for
this.
https://issues.apache.org/jira/browse/MESOS-3797





On Thu, Aug 25, 2016 2:00 PM, Michael Gummelt mgumm...@mesosphere.io wrote:
Mesos also uses ZK for leader election. There seems to be some effort in
supporting etcd, but it's in progress: 
https://issues.apache.org/jira/browse/MESOS-1806

On Thu, Aug 25, 2016 at 1:55 PM, kant kodali < kanth...@gmail.com > wrote:
@Ofir @Sean very good points.
@Mike We dont use Kafka or Hive and I understand that Zookeeper can do many
things but for our use case all we need is for high availability and given the
devops people frustrations here in our company who had extensive experience
managing large clusters in the past we would be very happy to avoid Zookeeper. I
also heard that Mesos can provide High Availability through etcd and consul and
if that is true I will be left with the following stack
Spark + Mesos scheduler + Distributed File System or to be precise I should say
Distributed Storage since S3 is an object store so I guess this will be HDFS for
us + etcd & consul. Now the big question for me is how do I set all this up





On Thu, Aug 25, 2016 1:35 PM, Ofir Manor ofir.ma...@equalum.io wrote:
Just to add one concrete example regarding HDFS dependency. Have a look at 
checkpointing https://spark. apache.org/docs/1.6.2/ streaming-programming-guide.
html#checkpointing For example, for Spark Streaming, you can not do any window 
operation in a
cluster without checkpointing to HDFS (or S3).
Ofir Manor


Co-Founder & CTO | Equalum



Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io


On Thu, Aug 25, 2016 at 11:13 PM, Mich Talebzadeh < mich.talebza...@gmail.com > 
wrote:
Hi Kant,
I trust the following would be of use.
Big Data depends on Hadoop Ecosystem from whichever angle one looks at it.
In the heart of it and with reference to points you raised about HDFS, one needs
to have a working knowledge of Hadoop Core System including HDFS, Map-reduce
algorithm and Yarn whether one uses them or not. After all Big Data is all about
horizontal scaling with master and nodes (as opposed to vertical scaling like
SQL Server running on a Host). and distributed data (by default data is
replicated three times on different nodes for scalability and availability).
Other members including Sean provided the limits on how far one operate Spark in
its own space. If you are going to deal with data (data in motion and data at
rest), then you will need to interact with some form of storage and HDFS and
compatible file systems like S3 are the natural choices.
Zookeeper is not just about high availability. It is used in Spark Streaming
with Kafka, it is also used with Hive for concurrency. It is also a distributed
locking system.
HTH
Dr Mich Talebzadeh



LinkedIn https://www.linkedin.com/prof ile/view?id=AAEWh2gBxianrb
Jd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpres s.com




Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any
other property which may arise from relying on this email's technical content is
explicitly disclaimed. The author will in no case be liable for any monetary
damages arising from such loss, damage or destruction.




On 25 August 2016 at 20:52, Mark Hamstra < m...@clearstorydata.com > wrote:
s/playing a role/paying a role/
On Thu, Aug 25, 2016 at 12:51 PM, Mark Hamstra < m...@clearstorydata.com > 
wrote:
One way you can start to make this make more sense, Sean, is if you exploit the
code/data duality so that the non-distributed data that you are sending out from
the driver is actually paying a role more like code (or at least parameters.)
What is sent from the driver to an Executer is then used (typically as seeds or
parameters) to execute some procedure on the Worker node that generates the
actual data on the Workers. After that, you proceed to execute in a more typical
fashion with Spark using the now-instantiated distributed data.
But I don't get the sense that this meta-programming-ish style is really what
the OP was aiming at.
On Thu, Aug 25, 2016 at 12:39 PM, Sean Owen < so...@cloudera.com > wrote:
Without a distributed storage system, your application can only create data on
the driver and send it out to the workers, and collect data back from the
workers. You can't read or write data in a distributed way. There are use cases
for this, but pretty limited (unless you're running on 1 machine).
I can't really imagine a serious use of (distributed) Spark without (distribute)
storage, in a way I don't think many apps exist that don't read/write data.
The premise here is not just replication, but partitioning data across compute
resources. With a distributed file system, your big input exists across a bunch
of machines and you can send the work to the pieces of data.
On Thu, Aug 25, 2016 at 7:57 PM, kant kodali < 

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread Michael Gummelt
Mesos also uses ZK for leader election.  There seems to be some effort in
supporting etcd, but it's in progress:
https://issues.apache.org/jira/browse/MESOS-1806

On Thu, Aug 25, 2016 at 1:55 PM, kant kodali  wrote:

> @Ofir @Sean very good points.
>
> @Mike We dont use Kafka or Hive and I understand that Zookeeper can do
> many things but for our use case all we need is for high availability and
> given the devops people frustrations here in our company who had extensive
> experience managing large clusters in the past we would be very happy to
> avoid Zookeeper. I also heard that Mesos can provide High Availability
> through etcd and consul and if that is true I will be left with the
> following stack
>
> Spark + Mesos scheduler + Distributed File System or to be precise I
> should say Distributed Storage since S3 is an object store so I guess this
> will be HDFS for us + etcd & consul. Now the big question for me is how do
> I set all this up
>
>
>
> On Thu, Aug 25, 2016 1:35 PM, Ofir Manor ofir.ma...@equalum.io wrote:
>
>> Just to add one concrete example regarding HDFS dependency.
>> Have a look at checkpointing https://spark.apache.org/docs/1.6.2/
>> streaming-programming-guide.html#checkpointing
>> For example, for Spark Streaming, you can not do any window operation in
>> a cluster without checkpointing to HDFS (or S3).
>>
>> Ofir Manor
>>
>> Co-Founder & CTO | Equalum
>>
>> Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io
>>
>> On Thu, Aug 25, 2016 at 11:13 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>> Hi Kant,
>>
>> I trust the following would be of use.
>>
>> Big Data depends on Hadoop Ecosystem from whichever angle one looks at it.
>>
>> In the heart of it and with reference to points you raised about HDFS,
>> one needs to have a working knowledge of Hadoop Core System including HDFS,
>> Map-reduce algorithm and Yarn whether one uses them or not. After all Big
>> Data is all about horizontal scaling with master and nodes (as opposed to
>> vertical scaling like SQL Server running on a Host). and distributed data
>> (by default data is replicated three times on different nodes for
>> scalability and availability).
>>
>> Other members including Sean provided the limits on how far one operate
>> Spark in its own space. If you are going to deal with data (data in motion
>> and data at rest), then you will need to interact with some form of storage
>> and HDFS and compatible file systems like S3 are the natural choices.
>>
>> Zookeeper is not just about high availability. It is used in Spark
>> Streaming with Kafka, it is also used with Hive for concurrency. It is also
>> a distributed locking system.
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 25 August 2016 at 20:52, Mark Hamstra  wrote:
>>
>> s/playing a role/paying a role/
>>
>> On Thu, Aug 25, 2016 at 12:51 PM, Mark Hamstra 
>> wrote:
>>
>> One way you can start to make this make more sense, Sean, is if you
>> exploit the code/data duality so that the non-distributed data that you are
>> sending out from the driver is actually paying a role more like code (or at
>> least parameters.)  What is sent from the driver to an Executer is then
>> used (typically as seeds or parameters) to execute some procedure on the
>> Worker node that generates the actual data on the Workers.  After that, you
>> proceed to execute in a more typical fashion with Spark using the
>> now-instantiated distributed data.
>>
>> But I don't get the sense that this meta-programming-ish style is really
>> what the OP was aiming at.
>>
>> On Thu, Aug 25, 2016 at 12:39 PM, Sean Owen  wrote:
>>
>> Without a distributed storage system, your application can only create
>> data on the driver and send it out to the workers, and collect data back
>> from the workers. You can't read or write data in a distributed way. There
>> are use cases for this, but pretty limited (unless you're running on 1
>> machine).
>>
>> I can't really imagine a serious use of (distributed) Spark without
>> (distribute) storage, in a way I don't think many apps exist that don't
>> read/write data.
>>
>> The premise here is not just replication, but partitioning data across
>> compute resources. With a distributed file system, your big input exists
>> across a bunch of machines and you can send the work to 

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread kant kodali

@Ofir @Sean very good points.
@Mike We dont use Kafka or Hive and I understand that Zookeeper can do many
things but for our use case all we need is for high availability and given the
devops people frustrations here in our company who had extensive experience
managing large clusters in the past we would be very happy to avoid Zookeeper. I
also heard that Mesos can provide High Availability through etcd and consul and
if that is true I will be left with the following stack
Spark + Mesos scheduler + Distributed File System or to be precise I should say
Distributed Storage since S3 is an object store so I guess this will be HDFS for
us + etcd & consul. Now the big question for me is how do I set all this up





On Thu, Aug 25, 2016 1:35 PM, Ofir Manor ofir.ma...@equalum.io wrote:
Just to add one concrete example regarding HDFS dependency. Have a look at checkpointing 
https://spark.apache.org/docs/1.6.2/streaming-programming-guide.html#checkpointing For example, for Spark Streaming, you can not do any window operation in a

cluster without checkpointing to HDFS (or S3).
Ofir Manor


Co-Founder & CTO | Equalum



Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io


On Thu, Aug 25, 2016 at 11:13 PM, Mich Talebzadeh < mich.talebza...@gmail.com > 
wrote:
Hi Kant,
I trust the following would be of use.
Big Data depends on Hadoop Ecosystem from whichever angle one looks at it.
In the heart of it and with reference to points you raised about HDFS, one needs
to have a working knowledge of Hadoop Core System including HDFS, Map-reduce
algorithm and Yarn whether one uses them or not. After all Big Data is all about
horizontal scaling with master and nodes (as opposed to vertical scaling like
SQL Server running on a Host). and distributed data (by default data is
replicated three times on different nodes for scalability and availability).
Other members including Sean provided the limits on how far one operate Spark in
its own space. If you are going to deal with data (data in motion and data at
rest), then you will need to interact with some form of storage and HDFS and
compatible file systems like S3 are the natural choices.
Zookeeper is not just about high availability. It is used in Spark Streaming
with Kafka, it is also used with Hive for concurrency. It is also a distributed
locking system.
HTH
Dr Mich Talebzadeh



LinkedIn https://www.linkedin.com/ profile/view?id= 
AAEWh2gBxianrbJd6zP6AcPCCd
OABUrV8Pw



http://talebzadehmich. wordpress.com




Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any
other property which may arise from relying on this email's technical content is
explicitly disclaimed. The author will in no case be liable for any monetary
damages arising from such loss, damage or destruction.




On 25 August 2016 at 20:52, Mark Hamstra < m...@clearstorydata.com > wrote:
s/playing a role/paying a role/
On Thu, Aug 25, 2016 at 12:51 PM, Mark Hamstra < m...@clearstorydata.com > 
wrote:
One way you can start to make this make more sense, Sean, is if you exploit the
code/data duality so that the non-distributed data that you are sending out from
the driver is actually paying a role more like code (or at least parameters.)
What is sent from the driver to an Executer is then used (typically as seeds or
parameters) to execute some procedure on the Worker node that generates the
actual data on the Workers. After that, you proceed to execute in a more typical
fashion with Spark using the now-instantiated distributed data.
But I don't get the sense that this meta-programming-ish style is really what
the OP was aiming at.
On Thu, Aug 25, 2016 at 12:39 PM, Sean Owen < so...@cloudera.com > wrote:
Without a distributed storage system, your application can only create data on
the driver and send it out to the workers, and collect data back from the
workers. You can't read or write data in a distributed way. There are use cases
for this, but pretty limited (unless you're running on 1 machine).
I can't really imagine a serious use of (distributed) Spark without (distribute)
storage, in a way I don't think many apps exist that don't read/write data.
The premise here is not just replication, but partitioning data across compute
resources. With a distributed file system, your big input exists across a bunch
of machines and you can send the work to the pieces of data.
On Thu, Aug 25, 2016 at 7:57 PM, kant kodali < kanth...@gmail.com > wrote:
@Mich I understand why I would need Zookeeper. It is there for fault tolerance
given that spark is a master-slave architecture and when a mater goes down
zookeeper will run a leader election algorithm to elect a new leader however
DevOps hate Zookeeper they would be much happier to go with etcd & consul and
looks like if we mesos scheduler we should be able to drop Zookeeper.
HDFS I am still trying to understand why I would need for spark. I understand
the purpose of distributed file systems in general but I 

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread Ofir Manor
Just to add one concrete example regarding HDFS dependency.
Have a look at checkpointing
https://spark.apache.org/docs/1.6.2/streaming-programming-guide.html#checkpointing
For example, for Spark Streaming, you can not do any window operation in a
cluster without checkpointing to HDFS (or S3).

Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io

On Thu, Aug 25, 2016 at 11:13 PM, Mich Talebzadeh  wrote:

> Hi Kant,
>
> I trust the following would be of use.
>
> Big Data depends on Hadoop Ecosystem from whichever angle one looks at it.
>
> In the heart of it and with reference to points you raised about HDFS, one
> needs to have a working knowledge of Hadoop Core System including HDFS,
> Map-reduce algorithm and Yarn whether one uses them or not. After all Big
> Data is all about horizontal scaling with master and nodes (as opposed to
> vertical scaling like SQL Server running on a Host). and distributed data
> (by default data is replicated three times on different nodes for
> scalability and availability).
>
> Other members including Sean provided the limits on how far one operate
> Spark in its own space. If you are going to deal with data (data in motion
> and data at rest), then you will need to interact with some form of storage
> and HDFS and compatible file systems like S3 are the natural choices.
>
> Zookeeper is not just about high availability. It is used in Spark
> Streaming with Kafka, it is also used with Hive for concurrency. It is also
> a distributed locking system.
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 25 August 2016 at 20:52, Mark Hamstra  wrote:
>
>> s/playing a role/paying a role/
>>
>> On Thu, Aug 25, 2016 at 12:51 PM, Mark Hamstra 
>> wrote:
>>
>>> One way you can start to make this make more sense, Sean, is if you
>>> exploit the code/data duality so that the non-distributed data that you are
>>> sending out from the driver is actually paying a role more like code (or at
>>> least parameters.)  What is sent from the driver to an Executer is then
>>> used (typically as seeds or parameters) to execute some procedure on the
>>> Worker node that generates the actual data on the Workers.  After that, you
>>> proceed to execute in a more typical fashion with Spark using the
>>> now-instantiated distributed data.
>>>
>>> But I don't get the sense that this meta-programming-ish style is really
>>> what the OP was aiming at.
>>>
>>> On Thu, Aug 25, 2016 at 12:39 PM, Sean Owen  wrote:
>>>
 Without a distributed storage system, your application can only create
 data on the driver and send it out to the workers, and collect data back
 from the workers. You can't read or write data in a distributed way. There
 are use cases for this, but pretty limited (unless you're running on 1
 machine).

 I can't really imagine a serious use of (distributed) Spark without
 (distribute) storage, in a way I don't think many apps exist that don't
 read/write data.

 The premise here is not just replication, but partitioning data across
 compute resources. With a distributed file system, your big input exists
 across a bunch of machines and you can send the work to the pieces of data.

 On Thu, Aug 25, 2016 at 7:57 PM, kant kodali 
 wrote:

> @Mich I understand why I would need Zookeeper. It is there for fault
> tolerance given that spark is a master-slave architecture and when a mater
> goes down zookeeper will run a leader election algorithm to elect a new
> leader however DevOps hate Zookeeper they would be much happier to go with
> etcd & consul and looks like if we mesos scheduler we should be able to
> drop Zookeeper.
>
> HDFS I am still trying to understand why I would need for spark. I
> understand the purpose of distributed file systems in general but I don't
> understand in the context of spark since many people say you can run a
> spark distributed cluster in a stand alone mode but I am not sure what are
> its pros/cons if we do it that way. In a hadoop world I understand that 
> one
> of the reasons HDFS is there is for replication other words if we write
> some data to a HDFS it will store that block across different nodes such
> that if one 

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread Mich Talebzadeh
Hi Kant,

I trust the following would be of use.

Big Data depends on Hadoop Ecosystem from whichever angle one looks at it.

In the heart of it and with reference to points you raised about HDFS, one
needs to have a working knowledge of Hadoop Core System including HDFS,
Map-reduce algorithm and Yarn whether one uses them or not. After all Big
Data is all about horizontal scaling with master and nodes (as opposed to
vertical scaling like SQL Server running on a Host). and distributed data
(by default data is replicated three times on different nodes for
scalability and availability).

Other members including Sean provided the limits on how far one operate
Spark in its own space. If you are going to deal with data (data in motion
and data at rest), then you will need to interact with some form of storage
and HDFS and compatible file systems like S3 are the natural choices.

Zookeeper is not just about high availability. It is used in Spark
Streaming with Kafka, it is also used with Hive for concurrency. It is also
a distributed locking system.

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 25 August 2016 at 20:52, Mark Hamstra  wrote:

> s/playing a role/paying a role/
>
> On Thu, Aug 25, 2016 at 12:51 PM, Mark Hamstra 
> wrote:
>
>> One way you can start to make this make more sense, Sean, is if you
>> exploit the code/data duality so that the non-distributed data that you are
>> sending out from the driver is actually paying a role more like code (or at
>> least parameters.)  What is sent from the driver to an Executer is then
>> used (typically as seeds or parameters) to execute some procedure on the
>> Worker node that generates the actual data on the Workers.  After that, you
>> proceed to execute in a more typical fashion with Spark using the
>> now-instantiated distributed data.
>>
>> But I don't get the sense that this meta-programming-ish style is really
>> what the OP was aiming at.
>>
>> On Thu, Aug 25, 2016 at 12:39 PM, Sean Owen  wrote:
>>
>>> Without a distributed storage system, your application can only create
>>> data on the driver and send it out to the workers, and collect data back
>>> from the workers. You can't read or write data in a distributed way. There
>>> are use cases for this, but pretty limited (unless you're running on 1
>>> machine).
>>>
>>> I can't really imagine a serious use of (distributed) Spark without
>>> (distribute) storage, in a way I don't think many apps exist that don't
>>> read/write data.
>>>
>>> The premise here is not just replication, but partitioning data across
>>> compute resources. With a distributed file system, your big input exists
>>> across a bunch of machines and you can send the work to the pieces of data.
>>>
>>> On Thu, Aug 25, 2016 at 7:57 PM, kant kodali  wrote:
>>>
 @Mich I understand why I would need Zookeeper. It is there for fault
 tolerance given that spark is a master-slave architecture and when a mater
 goes down zookeeper will run a leader election algorithm to elect a new
 leader however DevOps hate Zookeeper they would be much happier to go with
 etcd & consul and looks like if we mesos scheduler we should be able to
 drop Zookeeper.

 HDFS I am still trying to understand why I would need for spark. I
 understand the purpose of distributed file systems in general but I don't
 understand in the context of spark since many people say you can run a
 spark distributed cluster in a stand alone mode but I am not sure what are
 its pros/cons if we do it that way. In a hadoop world I understand that one
 of the reasons HDFS is there is for replication other words if we write
 some data to a HDFS it will store that block across different nodes such
 that if one of nodes goes down it can still retrieve that block from other
 nodes. In the context of spark I am not really sure because 1) I am new 2)
 Spark paper says it doesn't replicate data instead it stores the
 lineage(all the transformations) such that it can reconstruct it.






 On Thu, Aug 25, 2016 9:18 AM, Mich Talebzadeh mich.talebza...@gmail.com
 wrote:

> You can use Spark on Oracle as a query tool.
>
> It all depends on the mode of the operation.
>
> If you running Spark with yarn-client/cluster then you will need yarn.
> It comes as 

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread Mark Hamstra
s/playing a role/paying a role/

On Thu, Aug 25, 2016 at 12:51 PM, Mark Hamstra 
wrote:

> One way you can start to make this make more sense, Sean, is if you
> exploit the code/data duality so that the non-distributed data that you are
> sending out from the driver is actually paying a role more like code (or at
> least parameters.)  What is sent from the driver to an Executer is then
> used (typically as seeds or parameters) to execute some procedure on the
> Worker node that generates the actual data on the Workers.  After that, you
> proceed to execute in a more typical fashion with Spark using the
> now-instantiated distributed data.
>
> But I don't get the sense that this meta-programming-ish style is really
> what the OP was aiming at.
>
> On Thu, Aug 25, 2016 at 12:39 PM, Sean Owen  wrote:
>
>> Without a distributed storage system, your application can only create
>> data on the driver and send it out to the workers, and collect data back
>> from the workers. You can't read or write data in a distributed way. There
>> are use cases for this, but pretty limited (unless you're running on 1
>> machine).
>>
>> I can't really imagine a serious use of (distributed) Spark without
>> (distribute) storage, in a way I don't think many apps exist that don't
>> read/write data.
>>
>> The premise here is not just replication, but partitioning data across
>> compute resources. With a distributed file system, your big input exists
>> across a bunch of machines and you can send the work to the pieces of data.
>>
>> On Thu, Aug 25, 2016 at 7:57 PM, kant kodali  wrote:
>>
>>> @Mich I understand why I would need Zookeeper. It is there for fault
>>> tolerance given that spark is a master-slave architecture and when a mater
>>> goes down zookeeper will run a leader election algorithm to elect a new
>>> leader however DevOps hate Zookeeper they would be much happier to go with
>>> etcd & consul and looks like if we mesos scheduler we should be able to
>>> drop Zookeeper.
>>>
>>> HDFS I am still trying to understand why I would need for spark. I
>>> understand the purpose of distributed file systems in general but I don't
>>> understand in the context of spark since many people say you can run a
>>> spark distributed cluster in a stand alone mode but I am not sure what are
>>> its pros/cons if we do it that way. In a hadoop world I understand that one
>>> of the reasons HDFS is there is for replication other words if we write
>>> some data to a HDFS it will store that block across different nodes such
>>> that if one of nodes goes down it can still retrieve that block from other
>>> nodes. In the context of spark I am not really sure because 1) I am new 2)
>>> Spark paper says it doesn't replicate data instead it stores the
>>> lineage(all the transformations) such that it can reconstruct it.
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Aug 25, 2016 9:18 AM, Mich Talebzadeh mich.talebza...@gmail.com
>>> wrote:
>>>
 You can use Spark on Oracle as a query tool.

 It all depends on the mode of the operation.

 If you running Spark with yarn-client/cluster then you will need yarn.
 It comes as part of Hadoop core (HDFS, Map-reduce and Yarn).

 I have not gone and installed Yarn without installing Hadoop.

 What is the overriding reason to have the Spark on its own?

  You can use Spark in Local or Standalone mode if you do not want
 Hadoop core.

 HTH

 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *



 http://talebzadehmich.wordpress.com


 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.



 On 24 August 2016 at 21:54, kant kodali  wrote:

 What do I loose if I run spark without using HDFS or Zookeper ? which
 of them is almost a must in practice?



>>
>


Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread Mark Hamstra
One way you can start to make this make more sense, Sean, is if you exploit
the code/data duality so that the non-distributed data that you are sending
out from the driver is actually paying a role more like code (or at least
parameters.)  What is sent from the driver to an Executer is then used
(typically as seeds or parameters) to execute some procedure on the Worker
node that generates the actual data on the Workers.  After that, you
proceed to execute in a more typical fashion with Spark using the
now-instantiated distributed data.

But I don't get the sense that this meta-programming-ish style is really
what the OP was aiming at.

On Thu, Aug 25, 2016 at 12:39 PM, Sean Owen  wrote:

> Without a distributed storage system, your application can only create
> data on the driver and send it out to the workers, and collect data back
> from the workers. You can't read or write data in a distributed way. There
> are use cases for this, but pretty limited (unless you're running on 1
> machine).
>
> I can't really imagine a serious use of (distributed) Spark without
> (distribute) storage, in a way I don't think many apps exist that don't
> read/write data.
>
> The premise here is not just replication, but partitioning data across
> compute resources. With a distributed file system, your big input exists
> across a bunch of machines and you can send the work to the pieces of data.
>
> On Thu, Aug 25, 2016 at 7:57 PM, kant kodali  wrote:
>
>> @Mich I understand why I would need Zookeeper. It is there for fault
>> tolerance given that spark is a master-slave architecture and when a mater
>> goes down zookeeper will run a leader election algorithm to elect a new
>> leader however DevOps hate Zookeeper they would be much happier to go with
>> etcd & consul and looks like if we mesos scheduler we should be able to
>> drop Zookeeper.
>>
>> HDFS I am still trying to understand why I would need for spark. I
>> understand the purpose of distributed file systems in general but I don't
>> understand in the context of spark since many people say you can run a
>> spark distributed cluster in a stand alone mode but I am not sure what are
>> its pros/cons if we do it that way. In a hadoop world I understand that one
>> of the reasons HDFS is there is for replication other words if we write
>> some data to a HDFS it will store that block across different nodes such
>> that if one of nodes goes down it can still retrieve that block from other
>> nodes. In the context of spark I am not really sure because 1) I am new 2)
>> Spark paper says it doesn't replicate data instead it stores the
>> lineage(all the transformations) such that it can reconstruct it.
>>
>>
>>
>>
>>
>>
>> On Thu, Aug 25, 2016 9:18 AM, Mich Talebzadeh mich.talebza...@gmail.com
>> wrote:
>>
>>> You can use Spark on Oracle as a query tool.
>>>
>>> It all depends on the mode of the operation.
>>>
>>> If you running Spark with yarn-client/cluster then you will need yarn.
>>> It comes as part of Hadoop core (HDFS, Map-reduce and Yarn).
>>>
>>> I have not gone and installed Yarn without installing Hadoop.
>>>
>>> What is the overriding reason to have the Spark on its own?
>>>
>>>  You can use Spark in Local or Standalone mode if you do not want Hadoop
>>> core.
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 24 August 2016 at 21:54, kant kodali  wrote:
>>>
>>> What do I loose if I run spark without using HDFS or Zookeper ? which of
>>> them is almost a must in practice?
>>>
>>>
>>>
>


Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread Sean Owen
Without a distributed storage system, your application can only create data
on the driver and send it out to the workers, and collect data back from
the workers. You can't read or write data in a distributed way. There are
use cases for this, but pretty limited (unless you're running on 1 machine).

I can't really imagine a serious use of (distributed) Spark without
(distribute) storage, in a way I don't think many apps exist that don't
read/write data.

The premise here is not just replication, but partitioning data across
compute resources. With a distributed file system, your big input exists
across a bunch of machines and you can send the work to the pieces of data.

On Thu, Aug 25, 2016 at 7:57 PM, kant kodali  wrote:

> @Mich I understand why I would need Zookeeper. It is there for fault
> tolerance given that spark is a master-slave architecture and when a mater
> goes down zookeeper will run a leader election algorithm to elect a new
> leader however DevOps hate Zookeeper they would be much happier to go with
> etcd & consul and looks like if we mesos scheduler we should be able to
> drop Zookeeper.
>
> HDFS I am still trying to understand why I would need for spark. I
> understand the purpose of distributed file systems in general but I don't
> understand in the context of spark since many people say you can run a
> spark distributed cluster in a stand alone mode but I am not sure what are
> its pros/cons if we do it that way. In a hadoop world I understand that one
> of the reasons HDFS is there is for replication other words if we write
> some data to a HDFS it will store that block across different nodes such
> that if one of nodes goes down it can still retrieve that block from other
> nodes. In the context of spark I am not really sure because 1) I am new 2)
> Spark paper says it doesn't replicate data instead it stores the
> lineage(all the transformations) such that it can reconstruct it.
>
>
>
>
>
>
> On Thu, Aug 25, 2016 9:18 AM, Mich Talebzadeh mich.talebza...@gmail.com
> wrote:
>
>> You can use Spark on Oracle as a query tool.
>>
>> It all depends on the mode of the operation.
>>
>> If you running Spark with yarn-client/cluster then you will need yarn. It
>> comes as part of Hadoop core (HDFS, Map-reduce and Yarn).
>>
>> I have not gone and installed Yarn without installing Hadoop.
>>
>> What is the overriding reason to have the Spark on its own?
>>
>>  You can use Spark in Local or Standalone mode if you do not want Hadoop
>> core.
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 24 August 2016 at 21:54, kant kodali  wrote:
>>
>> What do I loose if I run spark without using HDFS or Zookeper ? which of
>> them is almost a must in practice?
>>
>>
>>


Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread kant kodali

@Mich I understand why I would need Zookeeper. It is there for fault tolerance
given that spark is a master-slave architecture and when a mater goes down
zookeeper will run a leader election algorithm to elect a new leader however
DevOps hate Zookeeper they would be much happier to go with etcd & consul and
looks like if we mesos scheduler we should be able to drop Zookeeper.
HDFS I am still trying to understand why I would need for spark. I understand
the purpose of distributed file systems in general but I don't understand in the
context of spark since many people say you can run a spark distributed cluster
in a stand alone mode but I am not sure what are its pros/cons if we do it that
way. In a hadoop world I understand that one of the reasons HDFS is there is for
replication other words if we write some data to a HDFS it will store that block
across different nodes such that if one of nodes goes down it can still retrieve
that block from other nodes. In the context of spark I am not really sure
because 1) I am new 2) Spark paper says it doesn't replicate data instead it
stores the lineage(all the transformations) such that it can reconstruct it.







On Thu, Aug 25, 2016 9:18 AM, Mich Talebzadeh mich.talebza...@gmail.com wrote:
You can use Spark on Oracle as a query tool.
It all depends on the mode of the operation.
If you running Spark with yarn-client/cluster then you will need yarn. It comes
as part of Hadoop core (HDFS, Map-reduce and Yarn).
I have not gone and installed Yarn without installing Hadoop.
What is the overriding reason to have the Spark on its own?
You can use Spark in Local or Standalone mode if you do not want Hadoop core.
HTH
Dr Mich Talebzadeh



LinkedIn 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw




http://talebzadehmich.wordpress.com




Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any
other property which may arise from relying on this email's technical content is
explicitly disclaimed. The author will in no case be liable for any monetary
damages arising from such loss, damage or destruction.




On 24 August 2016 at 21:54, kant kodali < kanth...@gmail.com > wrote:
What do I loose if I run spark without using HDFS or Zookeper ? which of them is
almost a must in practice?

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread Peter Figliozzi
Spark is a parallel computing framework.  There are many ways to give it
data to chomp down on.  If you don't know why you would need HDFS, then you
don't need it.  Same goes for Zookeeper.  Spark works fine without either.

Much of what we read online comes from people with specialized problems and
requirements (such as maintaining a 'highly available' service, or
accessing an existing HDFS).  It can be extremely confusing to the dude who
just needs to do some parallel computing.

Pete

On Wed, Aug 24, 2016 at 3:54 PM, kant kodali  wrote:

> What do I loose if I run spark without using HDFS or Zookeper ? which of
> them is almost a must in practice?
>


Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread Mich Talebzadeh
You can use Spark on Oracle as a query tool.

It all depends on the mode of the operation.

If you running Spark with yarn-client/cluster then you will need yarn. It
comes as part of Hadoop core (HDFS, Map-reduce and Yarn).

I have not gone and installed Yarn without installing Hadoop.

What is the overriding reason to have the Spark on its own?

 You can use Spark in Local or Standalone mode if you do not want Hadoop
core.

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 24 August 2016 at 21:54, kant kodali  wrote:

> What do I loose if I run spark without using HDFS or Zookeper ? which of
> them is almost a must in practice?
>


What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-24 Thread kant kodali

What do I loose if I run spark without using HDFS or Zookeper ? which of them is
almost a must in practice?