Re: An Architecture question on the use of virtualised clusters

John Leach Mon, 05 Jun 2017 07:19:49 -0700

Mich,

We used Isilon for a POC of Splice Machine (Spark for Analytics, HBase for 
real-time).  We were concerned initially and the initial setup took a bit 
longer than excepted, but it performed well on both low latency and high 
throughput use cases at scale (our POC ~ 100 TB).


Just a data point.

Regards,
John Leach

> On Jun 5, 2017, at 9:11 AM, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
> 
> I am concerned about the use case of tools like Isilon or Panasas to create a 
> layer on top of HDFS, essentially a HCFS on top of HDFS with the usual 3x 
> replication gone into the tool itself.
> 
> There is interest to push Isilon  as a the solution forward but my caution is 
> about scalability and future proof of such tools. So I was wondering if 
> anyone else has tried such solution.
> 
> Thanks
>  
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 2 June 2017 at 19:09, Gene Pang <gene.p...@gmail.com 
> <mailto:gene.p...@gmail.com>> wrote:
> As Vincent mentioned earlier, I think Alluxio can work for this. You can 
> mount your (potentially remote) storage systems to Alluxio 
> <http://www.alluxio.org/docs/master/en/Unified-and-Transparent-Namespace.html>,
>  and deploy Alluxio co-located to the compute cluster. The computation 
> framework will still achieve data locality since Alluxio workers are 
> co-located, even though the existing storage systems may be remote. You can 
> also use tiered storage 
> <http://www.alluxio.org/docs/master/en/Tiered-Storage-on-Alluxio.html> to 
> deploy using only memory, and/or other physical media.
> 
> Here are some blogs (Alluxio with Minio 
> <https://www.alluxio.com/blog/scalable-genomics-data-processing-pipeline-with-alluxio-mesos-and-minio>,
>  Alluxio with HDFS 
> <https://www.alluxio.com/blog/qunar-performs-real-time-data-analytics-up-to-300x-faster-with-alluxio>,
>  Alluxio with S3 
> <https://www.alluxio.com/blog/accelerating-on-demand-data-analytics-with-alluxio>)
>  which use similar architecture.
> 
> Hope that helps,
> Gene
> 
> On Thu, Jun 1, 2017 at 1:45 AM, Mich Talebzadeh <mich.talebza...@gmail.com 
> <mailto:mich.talebza...@gmail.com>> wrote:
> As a matter of interest what is the best way of creating virtualised clusters 
> all pointing to the same physical data?
> 
> thanks
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 1 June 2017 at 09:27, vincent gromakowski <vincent.gromakow...@gmail.com 
> <mailto:vincent.gromakow...@gmail.com>> wrote:
> If mandatory, you can use a local cache like alluxio
> 
> Le 1 juin 2017 10:23 AM, "Mich Talebzadeh" <mich.talebza...@gmail.com 
> <mailto:mich.talebza...@gmail.com>> a écrit :
> Thanks Vincent. I assume by physical data locality you mean you are going 
> through Isilon and HCFS and not through direct HDFS.
> 
> Also I agree with you that shared network could be an issue as well. However, 
> it allows you to reduce data redundancy (you do not need R3 in HDFS anymore) 
> and also you can build virtual clusters on the same data. One cluster for 
> read/writes and another for Reads? That is what has been suggestes!.
> 
> regards
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 1 June 2017 at 08:55, vincent gromakowski <vincent.gromakow...@gmail.com 
> <mailto:vincent.gromakow...@gmail.com>> wrote:
> I don't recommend this kind of design because you loose physical data 
> locality and you will be affected by "bad neighboors" that are also using the 
> network storage... We have one similar design but restricted to small 
> clusters (more for experiments than production)
> 
> 2017-06-01 9:47 GMT+02:00 Mich Talebzadeh <mich.talebza...@gmail.com 
> <mailto:mich.talebza...@gmail.com>>:
> Thanks Jorn,
> 
> This was a proposal made by someone as the firm is already using this tool on 
> other SAN based storage and extend it to Big Data
> 
> On paper it seems like a good idea, in practice it may be a Wandisco scenario 
> again..  Of course as ever one needs to EMC for reference calls ans whether 
> anyone is using this product in anger.
>  
> At the end of the day it's not HDFS.  It is OneFS with a HCFS API.  However 
> that may suit our needs.  But  would need to PoC it and test it thoroughly!
> 
> Cheers
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 1 June 2017 at 08:21, Jörn Franke <jornfra...@gmail.com 
> <mailto:jornfra...@gmail.com>> wrote:
> Hi,
> 
> I have done this (not Isilon, but another storage system). It can be 
> efficient for small clusters and depending on how you design the network.
> 
> What I have also seen is the microservice approach with object stores (e.g. 
> In the cloud s3, on premise swift) which is somehow also similar.
> 
> If you want additional performance you could fetch the data from the object 
> stores and store it temporarily in a local HDFS. Not sure to what extent this 
> affects regulatory requirements though.
> 
> Best regards
> 
> On 31. May 2017, at 18:07, Mich Talebzadeh <mich.talebza...@gmail.com 
> <mailto:mich.talebza...@gmail.com>> wrote:
> 
>> Hi,
>> 
>> I realize this may not have direct relevance to Spark but has anyone tried 
>> to create virtualized HDFS clusters using tools like ISILON or similar?
>> 
>> The prime motive behind this approach is to minimize the propagation or copy 
>> of data which has regulatory implication. In shoret you want your data to be 
>> in one place regardless of artefacts used against it such as Spark?
>> 
>> Thanks,
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case be liable for any monetary damages arising from such 
>> loss, damage or destruction.
>>  
> 
> 
> 
> 
> 
>

Re: An Architecture question on the use of virtualised clusters

Reply via email to