Re: fetching and joining data from two different clusters

Mich Talebzadeh Sun, 18 Jun 2017 12:49:18 -0700

thanks Jorn.

I have been told that Hadoop 3 (alpha testing now) will support Docking and
virtualised Hadoop clusters


Also if we decided to use something like Isolin and blue data to create
zoning (meaning two different Hadoop clusters migrated to Isolin storage
each residing on its zone/compartment) and virtualised clusters, we haave
to migrate two separate physical Hadoop clusters to Isolin and then create
the structure.

My point is if we went that way we have to weight up the cost and efforts
in migrating two Hadoop clusters to Isolin, versus adding one Hadoop
cluster to the other one to make one cluster out of two and still we have
the underlying HDFS file system. And then of course how many companies
going this way and overriding reason to use such approach. What will happen
if we have performance issues, where to pinpoint the bottleneck (Isolin) or
third party Hadoop vendor. There is really no community to rely on as well.

Your thoughts?

Thanks


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 15 June 2017 at 21:27, Jörn Franke <jornfra...@gmail.com> wrote:

> On HDFS you have storage policies where you can define ssd etc
> https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/
> ArchivalStorage.html
>
> Not sure if this is a similar offering to what you refer to.
>
> Open stack swift is similar to S3 but for your own data center
> https://docs.openstack.org/developer/swift/associated_projects.html
>
> On 15. Jun 2017, at 21:55, Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
> In Isilon etc you have SSD, middle layer and archive later where data is
> moved. Can that be implemented in HDFS itself Yorn? What is swift. Isa that
> low level archive disk?
>
> thanks
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 15 June 2017 at 20:42, Jörn Franke <jornfra...@gmail.com> wrote:
>
>> Well this happens also if you use amazon EMR - most data will be stored
>> on S3 and there you have also no data locality. You can move it temporary
>> to HDFS or in-memory (ignite) and you can use sampling etc to avoid the
>> need to process all the data. In fact, that is done in Spark machine
>> learning algorithms (stochastic gradient descent etc). This will avoid that
>> you need to move all the data through the networks and you loose only
>> little precision (and you can statistically reason on that).
>> For a lot of data I see also the trend that companies move it anyway to
>> cheap object storages (swift etc) to reduce cost - particularly because it
>> is not used often.
>>
>>
>> On 15. Jun 2017, at 21:34, Mich Talebzadeh <mich.talebza...@gmail.com>
>> wrote:
>>
>> thanks Jorn.
>>
>> If the idea is to separate compute from data using Isilon etc then one is
>> going to lose the locality of data.
>>
>> Also the argument is that we would like to run queries/reports against
>> two independent clusters simultaneously so do this
>>
>>
>>    1. Use Isilon OneFS
>>    <https://en.wikipedia.org/wiki/OneFS_distributed_file_system>for Big
>>    data to migrate two independent Hadoop clusters into Isilon OneFS
>>    2. Locate data from each cluster into its own zone in Isilon
>>    3. Run queries to combine data from each zone
>>    4. Use blue data
>>    
>> <https://www.bluedata.com/blog/2016/10/next-generation-big-data-with-dell-and-emc/>
>>    to create virtual Hadoop clusters on top of Isilon so one isolates the
>>    performance impact of analytics/Data Science versus other users
>>
>>
>> Now that is easily said than done as usual. First you have to migrate the
>> two existing clusters data into zones in Isilon. Then you are effectively
>> separating Compute from data so data locality is lost. This is no different
>> from your Spark cluster accessing data from each cluster. There are a lot
>> of tangential arguments here. Like Isilon will use RAID and you don't need
>> to replicate your data R3. Even including Isilon licensing cost, the total
>> cost goes down!
>>
>> The side effect is the network now that you have lost data locality. How
>> fast your network is going to be to handle the throughputs. Networks are
>> shared across say a Bank unless you spend $$$ creating private infiniband
>> networks. Standard 10Gbits/s is not going to be good enough.
>>
>> Also in reality blue data does not need Isilon. It runs on HP and other
>> hardware also.  In Apache Hadoop 3.0 docker engine on yarn is available.
>> Alpha currently, will be released at end of this year. As we have not
>> started on Isilon it may be worth looking at this also?
>>
>> Cheers
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 15 June 2017 at 17:05, Jörn Franke <jornfra...@gmail.com> wrote:
>>
>>> It does not matter to Spark you just put the HDFS URL of the namenode
>>> there. Of course the issue is that you loose data locality, but this would
>>> be also the case for Oracle.
>>>
>>> On 15. Jun 2017, at 18:03, Mich Talebzadeh <mich.talebza...@gmail.com>
>>> wrote:
>>>
>>> Hi,
>>>
>>> With Spark how easy is it to fetch data from two different clusters and
>>> do a join in Spark.
>>>
>>> I can use two JDBC connections to join two tables from two different
>>> Oracle instances in Spark though creating two Data Frames and joining them
>>> together.
>>>
>>> would that be possible for data residing on two different HDFS clusters?
>>>
>>> thanks
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>
>

Re: fetching and joining data from two different clusters

Reply via email to