I'll not recommend keeping two similar system(Kafka and kinesis) back to
back, as it will be a waste of servers. If you are worried about network
connectivity for Kinesis then you should also worry about Kafka going down
because of corrupted partitions or anything, It's better to have some
buffering logic(spilling on file or something) in the system injecting data
in these queues(source/producer reading your extract files) and make a
choice to use either Kafka or kinesis.

You can consider using AWS "Direct Connect" to connect your on-prem cluster
to AWS cloud with HA connectivity instead of going from the internet.

Is the staging area needed during the maintenance of production EDW only? I
think for staging, you can go with simple Files(with some optimised data
format for spark) on HDFS rather than deploying Phoenix until you are
planning to access the data using SQL for some other purpose.

Regards,
Ankit Singhal

On Tue, May 2, 2017 at 7:55 PM, Josh Elser <els...@apache.org> wrote:

> Planning for unexpected outages with HBase is a very good idea. At a
> minimum, there will likely be points in time where you want to change HBase
> configuration, apply some patched jars, etc. A staging area that can buffer
> data for later processing and avoid dropping data on the floor makes this
> process much easier.
>
> Apache Kafka is just one tool that can help with implementing such a
> staging area -- Apache NiFi is another you might want to look at. I'll
> avoid making any suggestions as to how you should do it because I don't
> know your requirements (nor really care to *wink*). There are lots of tools
> here, you'll need to do the research for your requirements and needs to
> evaluate what tools would work best.
>
> Ash N wrote:
>
>>
>> Hello,
>>
>> We are building an Enterprise Datawarehouse on Phoenix(HBase)
>> Please refer the diagram attached.
>>
>> The EDW supports an unified architecture that serves both Streaming and
>> batch use cases.
>>
>> I am recommending a staging area that is source compliant (i.e. that
>> mimics source structure)
>> In the EDW path - data is always loaded into staging and then gets moved
>> to EDW.
>>
>> Folks are not liking the idea due to an additional hop. They are saying
>> the hop is unnecessary and will cause latency issues.
>>
>> I am saying latency can be handled in two ways:
>>
>> 1. The caching layer will take care
>> 2. If designed properly, Latency is a function of hardware
>>
>> What are your thoughts?
>>
>> One other question -  is Kafka required at all???
>> It is introduced in the architecture for replay messages in case kinesis
>> connectivity issues.  So that we can replay messages.
>> Is there a better way to do it?
>>
>> help as always is appreciated.
>>
>>
>> Inline image 1
>>
>>
>>
>> thanks,
>> -ash
>>
>>
>>
>>
>>

Reply via email to