Re: An Architecture question on the use of virtualised clusters

2017-06-05 Thread Mich Talebzadeh
My main concern is that the choice of Isolin is not for one use case. It
will be a strategic decision for the client and if we decide to go that way
we are effectively moving away from HDFS principals (3x replication) etc as
well.

Granted one can argue this may be OK but of course we have to look at our
future needs. From my experience of these tools, you cannot simply roll it
back without incurring considerable work and considerable cost.

And after all will the cost justify the whole of this setup? How about
performance and other bottlenecks?

Thanks



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 5 June 2017 at 15:46, John Leach  wrote:

> Mich,
>
> Yes, Isilon is in production...
>
> Isilon is a serious product and has been around for quite a while.  For
> on-premise external storage, we see it quite a bit.  Separating the compute
> from the storage actually helps.  It is also a nice transition to the cloud
> providers.
>
> Have you looked at MapR?  Usually the system guys target snapshots,
> volumes, and posix compliance if they are bought into Isilon.
>
> Good luck Mich.
>
> Regards,
> John Leach
>
>
>
>
> On Jun 5, 2017, at 9:27 AM, Mich Talebzadeh 
> wrote:
>
> Hi John,
>
> Thanks. Did you end up in production or in other words besides PoC did you
> use it in anger?
>
> The intention is to build Isilon on top of the whole HDFS cluster!. If we
> go that way we also need to adopt it for DR as well.
>
> Cheers
>
>
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
> http://talebzadehmich.wordpress.com
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 5 June 2017 at 15:19, John Leach  wrote:
>
>> Mich,
>>
>> We used Isilon for a POC of Splice Machine (Spark for Analytics, HBase
>> for real-time).  We were concerned initially and the initial setup took a
>> bit longer than excepted, but it performed well on both low latency and
>> high throughput use cases at scale (our POC ~ 100 TB).
>>
>> Just a data point.
>>
>> Regards,
>> John Leach
>>
>> On Jun 5, 2017, at 9:11 AM, Mich Talebzadeh 
>> wrote:
>>
>> I am concerned about the use case of tools like Isilon or Panasas to
>> create a layer on top of HDFS, essentially a HCFS on top of HDFS with the
>> usual 3x replication gone into the tool itself.
>>
>> There is interest to push Isilon  as a the solution forward but my
>> caution is about scalability and future proof of such tools. So I was
>> wondering if anyone else has tried such solution.
>>
>> Thanks
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 2 June 2017 at 19:09, Gene Pang  wrote:
>>
>>> As Vincent mentioned earlier, I think Alluxio can work for this. You can 
>>> mount
>>> your (potentially remote) storage systems to Alluxio
>>> ,
>>> and deploy Alluxio co-located to the compute cluster. The computation
>>> framework will still achieve data locality since Alluxio workers are
>>> co-located, even though the existing storage systems may be remote. You can
>>> also use tiered storage
>>> 
>>> to deploy using only memory, and/or other physical media.
>>>
>>> Here are some blogs (Alluxio with Minio
>>> ,
>>> Alluxio with HDFS
>>> 

Re: An Architecture question on the use of virtualised clusters

2017-06-05 Thread John Leach
Mich,

Yes, Isilon is in production...

Isilon is a serious product and has been around for quite a while.  For 
on-premise external storage, we see it quite a bit.  Separating the compute 
from the storage actually helps.  It is also a nice transition to the cloud 
providers.  

Have you looked at MapR?  Usually the system guys target snapshots, volumes, 
and posix compliance if they are bought into Isilon.  

Good luck Mich.

Regards,
John Leach




> On Jun 5, 2017, at 9:27 AM, Mich Talebzadeh  wrote:
> 
> Hi John,
> 
> Thanks. Did you end up in production or in other words besides PoC did you 
> use it in anger?
> 
> The intention is to build Isilon on top of the whole HDFS cluster!. If we go 
> that way we also need to adopt it for DR as well.
> 
> Cheers
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 5 June 2017 at 15:19, John Leach  > wrote:
> Mich,
> 
> We used Isilon for a POC of Splice Machine (Spark for Analytics, HBase for 
> real-time).  We were concerned initially and the initial setup took a bit 
> longer than excepted, but it performed well on both low latency and high 
> throughput use cases at scale (our POC ~ 100 TB).  
> 
> Just a data point.
> 
> Regards,
> John Leach
> 
>> On Jun 5, 2017, at 9:11 AM, Mich Talebzadeh > > wrote:
>> 
>> I am concerned about the use case of tools like Isilon or Panasas to create 
>> a layer on top of HDFS, essentially a HCFS on top of HDFS with the usual 3x 
>> replication gone into the tool itself.
>> 
>> There is interest to push Isilon  as a the solution forward but my caution 
>> is about scalability and future proof of such tools. So I was wondering if 
>> anyone else has tried such solution.
>> 
>> Thanks
>>  
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> 
>>  
>> http://talebzadehmich.wordpress.com 
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case be liable for any monetary damages arising from such 
>> loss, damage or destruction.
>>  
>> 
>> On 2 June 2017 at 19:09, Gene Pang > > wrote:
>> As Vincent mentioned earlier, I think Alluxio can work for this. You can 
>> mount your (potentially remote) storage systems to Alluxio 
>> ,
>>  and deploy Alluxio co-located to the compute cluster. The computation 
>> framework will still achieve data locality since Alluxio workers are 
>> co-located, even though the existing storage systems may be remote. You can 
>> also use tiered storage 
>>  to 
>> deploy using only memory, and/or other physical media.
>> 
>> Here are some blogs (Alluxio with Minio 
>> ,
>>  Alluxio with HDFS 
>> ,
>>  Alluxio with S3 
>> )
>>  which use similar architecture.
>> 
>> Hope that helps,
>> Gene
>> 
>> On Thu, Jun 1, 2017 at 1:45 AM, Mich Talebzadeh > > wrote:
>> As a matter of interest what is the best way of creating virtualised 
>> clusters all pointing to the same physical data?
>> 
>> thanks
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> 
>>  
>> http://talebzadehmich.wordpress.com 
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case b

Re: An Architecture question on the use of virtualised clusters

2017-06-05 Thread Mich Talebzadeh
Hi John,

Thanks. Did you end up in production or in other words besides PoC did you
use it in anger?

The intention is to build Isilon on top of the whole HDFS cluster!. If we
go that way we also need to adopt it for DR as well.

Cheers



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 5 June 2017 at 15:19, John Leach  wrote:

> Mich,
>
> We used Isilon for a POC of Splice Machine (Spark for Analytics, HBase for
> real-time).  We were concerned initially and the initial setup took a bit
> longer than excepted, but it performed well on both low latency and high
> throughput use cases at scale (our POC ~ 100 TB).
>
> Just a data point.
>
> Regards,
> John Leach
>
> On Jun 5, 2017, at 9:11 AM, Mich Talebzadeh 
> wrote:
>
> I am concerned about the use case of tools like Isilon or Panasas to
> create a layer on top of HDFS, essentially a HCFS on top of HDFS with the
> usual 3x replication gone into the tool itself.
>
> There is interest to push Isilon  as a the solution forward but my caution
> is about scalability and future proof of such tools. So I was wondering if
> anyone else has tried such solution.
>
> Thanks
>
>
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
> http://talebzadehmich.wordpress.com
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 2 June 2017 at 19:09, Gene Pang  wrote:
>
>> As Vincent mentioned earlier, I think Alluxio can work for this. You can 
>> mount
>> your (potentially remote) storage systems to Alluxio
>> ,
>> and deploy Alluxio co-located to the compute cluster. The computation
>> framework will still achieve data locality since Alluxio workers are
>> co-located, even though the existing storage systems may be remote. You can
>> also use tiered storage
>> 
>> to deploy using only memory, and/or other physical media.
>>
>> Here are some blogs (Alluxio with Minio
>> ,
>> Alluxio with HDFS
>> ,
>> Alluxio with S3
>> )
>> which use similar architecture.
>>
>> Hope that helps,
>> Gene
>>
>> On Thu, Jun 1, 2017 at 1:45 AM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> As a matter of interest what is the best way of creating virtualised
>>> clusters all pointing to the same physical data?
>>>
>>> thanks
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 1 June 2017 at 09:27, vincent gromakowski <
>>> vincent.gromakow...@gmail.com> wrote:
>>>
 If mandatory, you can use a local cache like alluxio

 Le 1 juin 2017 10:23 AM, "Mich Talebzadeh" 
 a écrit :

> Thanks Vincent. I assume by physical data locality you mean you are
> going through Isilon and HCFS and not through direct HDFS.
>
> Also I agree with you that shared network could be an issue as well.
> However, it allows you to reduce data redundancy (you do not need R3 in
> HDFS anymore) and also you can build virtual clusters on the same data. 
> One
> cluster for read/writes and another for Reads? That is what has been
> suggestes!.
>
> regards
>
> Dr Mich T

Re: An Architecture question on the use of virtualised clusters

2017-06-05 Thread John Leach
Mich,

We used Isilon for a POC of Splice Machine (Spark for Analytics, HBase for 
real-time).  We were concerned initially and the initial setup took a bit 
longer than excepted, but it performed well on both low latency and high 
throughput use cases at scale (our POC ~ 100 TB).  

Just a data point.

Regards,
John Leach

> On Jun 5, 2017, at 9:11 AM, Mich Talebzadeh  wrote:
> 
> I am concerned about the use case of tools like Isilon or Panasas to create a 
> layer on top of HDFS, essentially a HCFS on top of HDFS with the usual 3x 
> replication gone into the tool itself.
> 
> There is interest to push Isilon  as a the solution forward but my caution is 
> about scalability and future proof of such tools. So I was wondering if 
> anyone else has tried such solution.
> 
> Thanks
>  
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 2 June 2017 at 19:09, Gene Pang  > wrote:
> As Vincent mentioned earlier, I think Alluxio can work for this. You can 
> mount your (potentially remote) storage systems to Alluxio 
> ,
>  and deploy Alluxio co-located to the compute cluster. The computation 
> framework will still achieve data locality since Alluxio workers are 
> co-located, even though the existing storage systems may be remote. You can 
> also use tiered storage 
>  to 
> deploy using only memory, and/or other physical media.
> 
> Here are some blogs (Alluxio with Minio 
> ,
>  Alluxio with HDFS 
> ,
>  Alluxio with S3 
> )
>  which use similar architecture.
> 
> Hope that helps,
> Gene
> 
> On Thu, Jun 1, 2017 at 1:45 AM, Mich Talebzadeh  > wrote:
> As a matter of interest what is the best way of creating virtualised clusters 
> all pointing to the same physical data?
> 
> thanks
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 1 June 2017 at 09:27, vincent gromakowski  > wrote:
> If mandatory, you can use a local cache like alluxio
> 
> Le 1 juin 2017 10:23 AM, "Mich Talebzadeh"  > a écrit :
> Thanks Vincent. I assume by physical data locality you mean you are going 
> through Isilon and HCFS and not through direct HDFS.
> 
> Also I agree with you that shared network could be an issue as well. However, 
> it allows you to reduce data redundancy (you do not need R3 in HDFS anymore) 
> and also you can build virtual clusters on the same data. One cluster for 
> read/writes and another for Reads? That is what has been suggestes!.
> 
> regards
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 1 June 2017 at 08:55, vincent gromakowski  > wrote:
> I don't recommend this kind of design because you loose physical data 
> locality and you will be affe

Re: An Architecture question on the use of virtualised clusters

2017-06-05 Thread Mich Talebzadeh
I am concerned about the use case of tools like Isilon or Panasas to create
a layer on top of HDFS, essentially a HCFS on top of HDFS with the usual 3x
replication gone into the tool itself.

There is interest to push Isilon  as a the solution forward but my caution
is about scalability and future proof of such tools. So I was wondering if
anyone else has tried such solution.

Thanks



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 2 June 2017 at 19:09, Gene Pang  wrote:

> As Vincent mentioned earlier, I think Alluxio can work for this. You can mount
> your (potentially remote) storage systems to Alluxio
> ,
> and deploy Alluxio co-located to the compute cluster. The computation
> framework will still achieve data locality since Alluxio workers are
> co-located, even though the existing storage systems may be remote. You can
> also use tiered storage
>  to
> deploy using only memory, and/or other physical media.
>
> Here are some blogs (Alluxio with Minio
> ,
> Alluxio with HDFS
> ,
> Alluxio with S3
> )
> which use similar architecture.
>
> Hope that helps,
> Gene
>
> On Thu, Jun 1, 2017 at 1:45 AM, Mich Talebzadeh  > wrote:
>
>> As a matter of interest what is the best way of creating virtualised
>> clusters all pointing to the same physical data?
>>
>> thanks
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 1 June 2017 at 09:27, vincent gromakowski <
>> vincent.gromakow...@gmail.com> wrote:
>>
>>> If mandatory, you can use a local cache like alluxio
>>>
>>> Le 1 juin 2017 10:23 AM, "Mich Talebzadeh" 
>>> a écrit :
>>>
 Thanks Vincent. I assume by physical data locality you mean you are
 going through Isilon and HCFS and not through direct HDFS.

 Also I agree with you that shared network could be an issue as well.
 However, it allows you to reduce data redundancy (you do not need R3 in
 HDFS anymore) and also you can build virtual clusters on the same data. One
 cluster for read/writes and another for Reads? That is what has been
 suggestes!.

 regards

 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *



 http://talebzadehmich.wordpress.com


 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.



 On 1 June 2017 at 08:55, vincent gromakowski <
 vincent.gromakow...@gmail.com> wrote:

> I don't recommend this kind of design because you loose physical data
> locality and you will be affected by "bad neighboors" that are also using
> the network storage... We have one similar design but restricted to small
> clusters (more for experiments than production)
>
> 2017-06-01 9:47 GMT+02:00 Mich Talebzadeh :
>
>> Thanks Jorn,
>>
>> This was a proposal made by someone as the firm is already using this
>> tool on other SAN based storage and extend it to Big Data
>>
>> On paper it seems like a good idea, in practice it may be a Wandisco
>> scenario again..  Of course as ever on

Re: An Architecture question on the use of virtualised clusters

2017-06-02 Thread Gene Pang
As Vincent mentioned earlier, I think Alluxio can work for this. You can mount
your (potentially remote) storage systems to Alluxio
,
and deploy Alluxio co-located to the compute cluster. The computation
framework will still achieve data locality since Alluxio workers are
co-located, even though the existing storage systems may be remote. You can
also use tiered storage
 to
deploy using only memory, and/or other physical media.

Here are some blogs (Alluxio with Minio
,
Alluxio with HDFS
,
Alluxio with S3
)
which use similar architecture.

Hope that helps,
Gene

On Thu, Jun 1, 2017 at 1:45 AM, Mich Talebzadeh 
wrote:

> As a matter of interest what is the best way of creating virtualised
> clusters all pointing to the same physical data?
>
> thanks
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 1 June 2017 at 09:27, vincent gromakowski <
> vincent.gromakow...@gmail.com> wrote:
>
>> If mandatory, you can use a local cache like alluxio
>>
>> Le 1 juin 2017 10:23 AM, "Mich Talebzadeh"  a
>> écrit :
>>
>>> Thanks Vincent. I assume by physical data locality you mean you are
>>> going through Isilon and HCFS and not through direct HDFS.
>>>
>>> Also I agree with you that shared network could be an issue as well.
>>> However, it allows you to reduce data redundancy (you do not need R3 in
>>> HDFS anymore) and also you can build virtual clusters on the same data. One
>>> cluster for read/writes and another for Reads? That is what has been
>>> suggestes!.
>>>
>>> regards
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 1 June 2017 at 08:55, vincent gromakowski <
>>> vincent.gromakow...@gmail.com> wrote:
>>>
 I don't recommend this kind of design because you loose physical data
 locality and you will be affected by "bad neighboors" that are also using
 the network storage... We have one similar design but restricted to small
 clusters (more for experiments than production)

 2017-06-01 9:47 GMT+02:00 Mich Talebzadeh :

> Thanks Jorn,
>
> This was a proposal made by someone as the firm is already using this
> tool on other SAN based storage and extend it to Big Data
>
> On paper it seems like a good idea, in practice it may be a Wandisco
> scenario again..  Of course as ever one needs to EMC for reference calls
> ans whether anyone is using this product in anger.
>
>
>
> At the end of the day it's not HDFS.  It is OneFS with a HCFS API.
>  However that may suit our needs.  But  would need to PoC it and test it
> thoroughly!
>
>
> Cheers
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for
> any loss, damage or destruction of data or any other property which may
> arise from relying on this email's technical content is explicitly
> disclaimed. The author will in no case be liable for any monetary damages
> arising from such loss, damage or destruction.
>
>
>
> On 1 June 2017 at 08:21, Jörn Franke  wrote:
>
>> Hi,
>>
>> I have done this (not Isilon, but another

Re: An Architecture question on the use of virtualised clusters

2017-06-01 Thread Mich Talebzadeh
As a matter of interest what is the best way of creating virtualised
clusters all pointing to the same physical data?

thanks

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 1 June 2017 at 09:27, vincent gromakowski 
wrote:

> If mandatory, you can use a local cache like alluxio
>
> Le 1 juin 2017 10:23 AM, "Mich Talebzadeh"  a
> écrit :
>
>> Thanks Vincent. I assume by physical data locality you mean you are going
>> through Isilon and HCFS and not through direct HDFS.
>>
>> Also I agree with you that shared network could be an issue as well.
>> However, it allows you to reduce data redundancy (you do not need R3 in
>> HDFS anymore) and also you can build virtual clusters on the same data. One
>> cluster for read/writes and another for Reads? That is what has been
>> suggestes!.
>>
>> regards
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 1 June 2017 at 08:55, vincent gromakowski <
>> vincent.gromakow...@gmail.com> wrote:
>>
>>> I don't recommend this kind of design because you loose physical data
>>> locality and you will be affected by "bad neighboors" that are also using
>>> the network storage... We have one similar design but restricted to small
>>> clusters (more for experiments than production)
>>>
>>> 2017-06-01 9:47 GMT+02:00 Mich Talebzadeh :
>>>
 Thanks Jorn,

 This was a proposal made by someone as the firm is already using this
 tool on other SAN based storage and extend it to Big Data

 On paper it seems like a good idea, in practice it may be a Wandisco
 scenario again..  Of course as ever one needs to EMC for reference calls
 ans whether anyone is using this product in anger.



 At the end of the day it's not HDFS.  It is OneFS with a HCFS API.
  However that may suit our needs.  But  would need to PoC it and test it
 thoroughly!


 Cheers



 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *



 http://talebzadehmich.wordpress.com


 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.



 On 1 June 2017 at 08:21, Jörn Franke  wrote:

> Hi,
>
> I have done this (not Isilon, but another storage system). It can be
> efficient for small clusters and depending on how you design the network.
>
> What I have also seen is the microservice approach with object stores
> (e.g. In the cloud s3, on premise swift) which is somehow also similar.
>
> If you want additional performance you could fetch the data from the
> object stores and store it temporarily in a local HDFS. Not sure to what
> extent this affects regulatory requirements though.
>
> Best regards
>
> On 31. May 2017, at 18:07, Mich Talebzadeh 
> wrote:
>
> Hi,
>
> I realize this may not have direct relevance to Spark but has anyone
> tried to create virtualized HDFS clusters using tools like ISILON or
> similar?
>
> The prime motive behind this approach is to minimize the propagation
> or copy of data which has regulatory implication. In shoret you want your
> data to be in one place regardless of artefacts used against it such as
> Spark?
>
> Thanks,
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> 

Re: An Architecture question on the use of virtualised clusters

2017-06-01 Thread vincent gromakowski
If mandatory, you can use a local cache like alluxio

Le 1 juin 2017 10:23 AM, "Mich Talebzadeh"  a
écrit :

> Thanks Vincent. I assume by physical data locality you mean you are going
> through Isilon and HCFS and not through direct HDFS.
>
> Also I agree with you that shared network could be an issue as well.
> However, it allows you to reduce data redundancy (you do not need R3 in
> HDFS anymore) and also you can build virtual clusters on the same data. One
> cluster for read/writes and another for Reads? That is what has been
> suggestes!.
>
> regards
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 1 June 2017 at 08:55, vincent gromakowski <
> vincent.gromakow...@gmail.com> wrote:
>
>> I don't recommend this kind of design because you loose physical data
>> locality and you will be affected by "bad neighboors" that are also using
>> the network storage... We have one similar design but restricted to small
>> clusters (more for experiments than production)
>>
>> 2017-06-01 9:47 GMT+02:00 Mich Talebzadeh :
>>
>>> Thanks Jorn,
>>>
>>> This was a proposal made by someone as the firm is already using this
>>> tool on other SAN based storage and extend it to Big Data
>>>
>>> On paper it seems like a good idea, in practice it may be a Wandisco
>>> scenario again..  Of course as ever one needs to EMC for reference calls
>>> ans whether anyone is using this product in anger.
>>>
>>>
>>>
>>> At the end of the day it's not HDFS.  It is OneFS with a HCFS API.
>>>  However that may suit our needs.  But  would need to PoC it and test it
>>> thoroughly!
>>>
>>>
>>> Cheers
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 1 June 2017 at 08:21, Jörn Franke  wrote:
>>>
 Hi,

 I have done this (not Isilon, but another storage system). It can be
 efficient for small clusters and depending on how you design the network.

 What I have also seen is the microservice approach with object stores
 (e.g. In the cloud s3, on premise swift) which is somehow also similar.

 If you want additional performance you could fetch the data from the
 object stores and store it temporarily in a local HDFS. Not sure to what
 extent this affects regulatory requirements though.

 Best regards

 On 31. May 2017, at 18:07, Mich Talebzadeh 
 wrote:

 Hi,

 I realize this may not have direct relevance to Spark but has anyone
 tried to create virtualized HDFS clusters using tools like ISILON or
 similar?

 The prime motive behind this approach is to minimize the propagation or
 copy of data which has regulatory implication. In shoret you want your data
 to be in one place regardless of artefacts used against it such as Spark?

 Thanks,

 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *



 http://talebzadehmich.wordpress.com


 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.




>>>
>>
>


Re: An Architecture question on the use of virtualised clusters

2017-06-01 Thread Mich Talebzadeh
Thanks Vincent. I assume by physical data locality you mean you are going
through Isilon and HCFS and not through direct HDFS.

Also I agree with you that shared network could be an issue as well.
However, it allows you to reduce data redundancy (you do not need R3 in
HDFS anymore) and also you can build virtual clusters on the same data. One
cluster for read/writes and another for Reads? That is what has been
suggestes!.

regards

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 1 June 2017 at 08:55, vincent gromakowski 
wrote:

> I don't recommend this kind of design because you loose physical data
> locality and you will be affected by "bad neighboors" that are also using
> the network storage... We have one similar design but restricted to small
> clusters (more for experiments than production)
>
> 2017-06-01 9:47 GMT+02:00 Mich Talebzadeh :
>
>> Thanks Jorn,
>>
>> This was a proposal made by someone as the firm is already using this
>> tool on other SAN based storage and extend it to Big Data
>>
>> On paper it seems like a good idea, in practice it may be a Wandisco
>> scenario again..  Of course as ever one needs to EMC for reference calls
>> ans whether anyone is using this product in anger.
>>
>>
>>
>> At the end of the day it's not HDFS.  It is OneFS with a HCFS API.
>>  However that may suit our needs.  But  would need to PoC it and test it
>> thoroughly!
>>
>>
>> Cheers
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 1 June 2017 at 08:21, Jörn Franke  wrote:
>>
>>> Hi,
>>>
>>> I have done this (not Isilon, but another storage system). It can be
>>> efficient for small clusters and depending on how you design the network.
>>>
>>> What I have also seen is the microservice approach with object stores
>>> (e.g. In the cloud s3, on premise swift) which is somehow also similar.
>>>
>>> If you want additional performance you could fetch the data from the
>>> object stores and store it temporarily in a local HDFS. Not sure to what
>>> extent this affects regulatory requirements though.
>>>
>>> Best regards
>>>
>>> On 31. May 2017, at 18:07, Mich Talebzadeh 
>>> wrote:
>>>
>>> Hi,
>>>
>>> I realize this may not have direct relevance to Spark but has anyone
>>> tried to create virtualized HDFS clusters using tools like ISILON or
>>> similar?
>>>
>>> The prime motive behind this approach is to minimize the propagation or
>>> copy of data which has regulatory implication. In shoret you want your data
>>> to be in one place regardless of artefacts used against it such as Spark?
>>>
>>> Thanks,
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>
>


Re: An Architecture question on the use of virtualised clusters

2017-06-01 Thread vincent gromakowski
I don't recommend this kind of design because you loose physical data
locality and you will be affected by "bad neighboors" that are also using
the network storage... We have one similar design but restricted to small
clusters (more for experiments than production)

2017-06-01 9:47 GMT+02:00 Mich Talebzadeh :

> Thanks Jorn,
>
> This was a proposal made by someone as the firm is already using this tool
> on other SAN based storage and extend it to Big Data
>
> On paper it seems like a good idea, in practice it may be a Wandisco
> scenario again..  Of course as ever one needs to EMC for reference calls
> ans whether anyone is using this product in anger.
>
>
>
> At the end of the day it's not HDFS.  It is OneFS with a HCFS API.
>  However that may suit our needs.  But  would need to PoC it and test it
> thoroughly!
>
>
> Cheers
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 1 June 2017 at 08:21, Jörn Franke  wrote:
>
>> Hi,
>>
>> I have done this (not Isilon, but another storage system). It can be
>> efficient for small clusters and depending on how you design the network.
>>
>> What I have also seen is the microservice approach with object stores
>> (e.g. In the cloud s3, on premise swift) which is somehow also similar.
>>
>> If you want additional performance you could fetch the data from the
>> object stores and store it temporarily in a local HDFS. Not sure to what
>> extent this affects regulatory requirements though.
>>
>> Best regards
>>
>> On 31. May 2017, at 18:07, Mich Talebzadeh 
>> wrote:
>>
>> Hi,
>>
>> I realize this may not have direct relevance to Spark but has anyone
>> tried to create virtualized HDFS clusters using tools like ISILON or
>> similar?
>>
>> The prime motive behind this approach is to minimize the propagation or
>> copy of data which has regulatory implication. In shoret you want your data
>> to be in one place regardless of artefacts used against it such as Spark?
>>
>> Thanks,
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>


Re: An Architecture question on the use of virtualised clusters

2017-06-01 Thread Mich Talebzadeh
Thanks Jorn,

This was a proposal made by someone as the firm is already using this tool
on other SAN based storage and extend it to Big Data

On paper it seems like a good idea, in practice it may be a Wandisco
scenario again..  Of course as ever one needs to EMC for reference calls
ans whether anyone is using this product in anger.



At the end of the day it's not HDFS.  It is OneFS with a HCFS API.  However
that may suit our needs.  But  would need to PoC it and test it thoroughly!


Cheers



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 1 June 2017 at 08:21, Jörn Franke  wrote:

> Hi,
>
> I have done this (not Isilon, but another storage system). It can be
> efficient for small clusters and depending on how you design the network.
>
> What I have also seen is the microservice approach with object stores
> (e.g. In the cloud s3, on premise swift) which is somehow also similar.
>
> If you want additional performance you could fetch the data from the
> object stores and store it temporarily in a local HDFS. Not sure to what
> extent this affects regulatory requirements though.
>
> Best regards
>
> On 31. May 2017, at 18:07, Mich Talebzadeh 
> wrote:
>
> Hi,
>
> I realize this may not have direct relevance to Spark but has anyone tried
> to create virtualized HDFS clusters using tools like ISILON or similar?
>
> The prime motive behind this approach is to minimize the propagation or
> copy of data which has regulatory implication. In shoret you want your data
> to be in one place regardless of artefacts used against it such as Spark?
>
> Thanks,
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>


Re: An Architecture question on the use of virtualised clusters

2017-06-01 Thread Jörn Franke
Hi,

I have done this (not Isilon, but another storage system). It can be efficient 
for small clusters and depending on how you design the network.

What I have also seen is the microservice approach with object stores (e.g. In 
the cloud s3, on premise swift) which is somehow also similar.

If you want additional performance you could fetch the data from the object 
stores and store it temporarily in a local HDFS. Not sure to what extent this 
affects regulatory requirements though.

Best regards

> On 31. May 2017, at 18:07, Mich Talebzadeh  wrote:
> 
> Hi,
> 
> I realize this may not have direct relevance to Spark but has anyone tried to 
> create virtualized HDFS clusters using tools like ISILON or similar?
> 
> The prime motive behind this approach is to minimize the propagation or copy 
> of data which has regulatory implication. In shoret you want your data to be 
> in one place regardless of artefacts used against it such as Spark?
> 
> Thanks,
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>