For fairly simple transformations, Flume is great, and works fine
subscribing
βto some pretty β
high volumes of messages from Kafka
β (I think we hit 50M/second at one point)β
. If you need to do complex transformations, e.g. database lookups for the
Kafka to Hadoop ETL, then you will start having
Kafka is capable of processing billions of events per second. You can scale
it horizontally with Kafka broker servers.
You can try out these steps
1. Create a topic in Kafka to get your all data. You have to use Kafka
producer to ingest data into Kafka.
2. If you are going to write your own HDFS
Thanks! What about Kafka with Flume? And also I would like to tell that
everyday data intake is in millions and can't afford to loose even a single
piece of data. Which makes a need of high availablity.
Warm Regards
Sidharth Kumar | Mob: +91 8197 555 599/7892 192 367 | LinkedIn:
www.linkedin.co
The ideal sequence should be:
1. Ingress using Kafka -> Validation and processing using Spark -> Write
into any NoSql DB or Hive.
>From my recent experience, writing directly to HDFS can be slow depending on
>the data format.
Thanks
JP
From: Sudeep Singh Thakur [mailto:sudeepth
In your use Kafka would be better because you want some transformations and
validations.
Kind regards,
Sudeep Singh Thakur
On Jun 30, 2017 8:57 AM, "Sidharth Kumar"
wrote:
> Hi,
>
> I have a requirement where I have all transactional data injestion into
> hadoop in real time and before storing
Hi,
I have a requirement where I have all transactional data injestion into
hadoop in real time and before storing the data into hadoop, process it to
validate the data. If the data failed to pass validation process , it will
not be stored into hadoop. The validation process also make use of
histo
Hi Omprakash!
If both datanodes die at the same time, then yes, data will be lost. In
that case, you should increase dfs.replication to 3 (so that there will be
3 copies). This obviously adversely affects the total amount of data you
can store on HDFS.
However if only 1 datanode dies, the namenod
unsubscribe
On 29 June 2017 at 17:20, omprakash wrote:
> Hi Sidharth,
>
>
>
> Thanks a lot for the clarification. May you suggest parameters that can
> improve the re-replication in case of failure.
>
>
>
> Regards
>
> Om
>
>
>
> *From:* Sidharth Kumar [mailto:sidharthkumar2...@gmail.com]
> *Sen
Hi Sidharth,
Thanks a lot for the clarification. May you suggest parameters that can improve
the re-replication in case of failure.
Regards
Om
From: Sidharth Kumar [mailto:sidharthkumar2...@gmail.com]
Sent: 29 June 2017 16:06
To: omprakash
Cc: Arpit Agarwal ; common-u...@hadoop.apa
Hi,
No, as there will be no copy exists of that file. You can increase the
replication factor to 3 so that there will be 3 copies created and even if
2 data nodes goes down you will still have one copy available which will be
again replicated to 3 by the namenode in due course of time.
Warm Rega
Hi Ravi,
I have 5 nodes in Hadoop cluster and all have same configurations. After
setting dfs.replication=2 , I did a clean start of hdfs.
As per your suggestion, I added 2 more datanodes and clean all the data and
metadata. The performance of the cluster has dramatically improved. I can
11 matches
Mail list logo