Self-healing hdfs functionality

2018-02-02 Thread sidharth kumar
Hi,

I would like to have a list of issues we generally face in hdfs and how can we 
make it a self healing distributed file system. I do know there are few 
functionalities of hdfs which is self healing like under replication of blocks. 
But still there are multiple issues a hadoop administrator face day to day life 
which can be automated as self healing feature .

So I would like to request all community member to provide list of issue they 
face everyday and can be taken as a feature for self healing hdfs

Warm Regards

Sidharth Kumar | Mob: +91 8197 555 599
LinkedIn:www.linkedin.com/in/sidharthkumar2792<http://:www.linkedin.com/in/sidharthkumar2792>





Hbase trace

2018-01-24 Thread sidharth kumar
Hi Team,

I want to know what read and write requests operations are being carried out on 
HBase. I enabled trace in log4j but could not get info. Could you please help 
me how to extract this info from hbase and which log could give me better info.

Warm Regards

Sidharth Kumar | Mob: +91 8197 555 599
LinkedIn:www.linkedin.com/in/sidharthkumar2792<http://:www.linkedin.com/in/sidharthkumar2792>





Local read-only users in ambari

2017-12-01 Thread sidharth kumar
Hi Team,

I am trying to create readonly user in ambari. I created a group provide the 
permission as cluster user but when i try to create folder or create or drop 
table in hive it is executing it and operations are performed. In one of the 
blog i read we should install ambari-server in remote server and register 
cluster as remote cluster and set hive views but still i have the same problem. 
Kindly help to resolve this.


Warm Regards

Sidharth Kumar | Mob: +91 8197 555 599
LinkedIn:www.linkedin.com/in/sidharthkumar2792<http://:www.linkedin.com/in/sidharthkumar2792>





Apache ambari

2017-09-08 Thread sidharth kumar
Hi,




Apache ambari is open source. So,can we setup Apache ambari to manage existing 
Apache Hadoop cluster ?




Warm Regards




Sidharth Kumar | Mob: +91 8197 555 599 / 7892 192 367 


LinkedIn:www.linkedin.com/in/sidharthkumar2792











spark on yarn error -- Please help

2017-08-28 Thread sidharth kumar
Hi,

I have configured apace spark over yarn. I am able to run map reduce job
successfully but spark-shell gives below error.


Kindly help me to resolve this issue




*SPARK-DEFAULT.CONF*

spark.master spark://master2:7077

spark.eventLog.enabled   true

spark.eventLog.dir
hdfs://ha-cluster/user/spark/ApplicationHistory

spark.shuffle.service.enabledtrue

spark.shuffle.sevice.port7337

spark.yarn.historyServer.address http://master2:18088

spark.yarn.archive
hdfs://jio-cluster/user/spark/share/spark-archive.zip

spark.master yarn

spark.dynamicAllocaltion.enabled true

spark.dynamicAllocaltion.executorIdleTimeout60





*YARN-SITE.XML*







 

 yarn.nodemanager.aux-services

 mapreduce_shuffle







 


yarn.nodemanager.aux-services.mapreduce.shuffle.class

 org.apache.hadoop.mapred.ShuffleHandler





 

 yarn.resourcemanager.resource-tracker.address

 master1:8025





 

 yarn.resourcemanager.scheduler.address

 master1:8040







 

 yarn.resourcemanager.address

 master1:8032









yarn.log-aggregation-enable 

true







*ERROR:*

17/08/28 14:47:05 ERROR cluster.YarnClientSchedulerBackend: Yarn
application has already exited with state FAILED!

17/08/28 14:47:05 ERROR spark.SparkContext: Error initializing SparkContext.

java.lang.IllegalStateException: Spark context stopped while waiting for
backend

at
org.apache.spark.scheduler.TaskSchedulerImpl.waitBackendReady(TaskSchedulerImpl.scala:673)

at
org.apache.spark.scheduler.TaskSchedulerImpl.postStartHook(TaskSchedulerImpl.scala:186)

at org.apache.spark.SparkContext.(SparkContext.scala:567)

at
org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2509)

at
org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:909)

at
org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:901)

at scala.Option.getOrElse(Option.scala:121)

at
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:901)

at org.apache.spark.repl.Main$.createSparkSession(Main.scala:97)

at $line3.$read$$iw$$iw.(:15)

at $line3.$read$$iw.(:42)

at $line3.$read.(:44)

at $line3.$read$.(:48)

at $line3.$read$.()

at $line3.$eval$.$print$lzycompute(:7)

17/08/28 14:47:05 ERROR client.TransportClient: Failed to send RPC
5084109606506612903 to / slave3:55375:
java.nio.channels.ClosedChannelException

java.nio.channels.ClosedChannelException

at
io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)

17/08/28 14:47:05 ERROR cluster.YarnSchedulerBackend$YarnSchedulerEndpoint:
Sending RequestExecutors(0,0,Map(),Set()) to AM was unsuccessful

java.io.IOException: Failed to send RPC 5084109606506612903 to /
slave3:55375: java.nio.channels.ClosedChannelException

at
org.apache.spark.network.client.TransportClient.lambda$sendRpc$2(TransportClient.java:237)

at
io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)

at
io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)

at
io.netty.util.concurrent.DefaultPromise.access$000(DefaultPromise.java:34)

at
io.netty.util.concurrent.DefaultPromise$1.run(DefaultPromise.java:431)

at
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399)

at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:446)

at
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)

at
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)

at java.lang.Thread.run(Thread.java:748)

Caused by: java.nio.channels.ClosedChannelException

at
io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)

17/08/28 14:47:05 ERROR util.Utils: Uncaught exception in thread Yarn
application state monitor

org.apache.spark.SparkException: Exception thrown in awaitResult:

at
org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)

at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)

at
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$MonitorThread.run(YarnClientSchedulerBackend.scala:108)

Caused by: java.io.IOException: Failed to send RPC 5084109606506612903 to
/slave3:55375: java.nio.channels.ClosedChannelException

at
org.apache.spark.network.client.TransportClient.lambda$sendRpc$2(TransportClient.java:237)

at

Hadoop 3.0

2017-07-09 Thread sidharth kumar
Hi,




Is there any documentation through which we can know what are the changes 
targeted in Hadoop 3.0




Warm Regards




Sidharth Kumar | Mob: +91 8197 555 599 / 7892 192 367 


LinkedIn:www.linkedin.com/in/sidharthkumar2792











Re: reconfiguring storage

2017-07-07 Thread sidharth kumar
Hi,




Just want to add on daemeon, if the miss configuration happened  on couple of 
nodes. It's better to do it one at a time or else take backup of your data. 




Warm Regards




Sidharth Kumar | Mob: +91 8197 555 599 / 7892 192 367 


LinkedIn:www.linkedin.com/in/sidharthkumar2792













From: daemeon reiydelle


Sent: Thursday, 6 July, 9:59 PM


Subject: Re: reconfiguring storage


To: Brian Jeltema


Cc: user






Another option is to stop the node's relevant Hadoop services (including e.g 
spark, impala, etc. if applicable), move the existing local storage, mount the 
desired file system, and move the data over. Then just restart hadoop. As long 
as this does not take too long, you don't have write consistency that forces 
that shard to be written, etc. you will be fine.










Daemeon C.M. Reiydelle


USA (+1) 415.501.0198


London (+44) (0) 20 8144 9872






On Thu, Jul 6, 2017 at 9:17 AM, Brian Jeltema <bdjelt...@gmail.com> wrote:




I recently discovered that I made a mistake setting up some cluster nodes and 
didn’t


attach storage to some mount points for HDFS. To fix this, I presume I should 
decommission


the relevant nodes, fix the mounts, then recommission the nodes.




My question is, when the nodes are recommissioned, will the HDFS storage


automatically be reset to ‘empty’, or do I need to perform some sort of explicit


initialization on those volumes before returning the nodes to active status.


-


To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org


For additional commands, e-mail: user-h...@hadoop.apache.org














Re: Kafka or Flume

2017-07-02 Thread Sidharth Kumar
Thank you very much for your help. What about the flow as  Nifi --> Kafka
--> storm for  real time processing and then storing into HBase ?

Warm Regards

Sidharth Kumar | Mob: +91 8197 555 599/7892 192 367 |  LinkedIn:
www.linkedin.com/in/sidharthkumar2792







On 02-Jul-2017 12:40 PM, "Gagan Brahmi" <gaganbra...@gmail.com> wrote:

NiFi can do that job as well. While using NiFi to ingest data from the
source, you can apply validation and direct the flow of the data.

Even if you need to combine the incoming flow with another source of data
it is possible using NiFi. It is an intelligent tool to have while
designing any kind of data flow.


Regards,
Gagan Brahmi

On Sat, Jul 1, 2017 at 8:42 PM, Sidharth Kumar <sidharthkumar2...@gmail.com>
wrote:

> Great, thanks! It's a great tool but you mentioned
>
> For ingestion
>
> NiFi -> Kafka
>
> For data verification
>
> Kafka -> NiFi -> HDFS/Hive/HBase
>
> Whereas I have to apply validation while ingesting data and then route
> them based on validation output. This validation will make use of history
> data stored in hadoop.
>
> So can you suggest a flow with a little more in detail
>
>
> Warm Regards
>
> Sidharth Kumar | Mob: +91 8197 555 599 <+91%2081975%2055599>/7892 192 367
> |  LinkedIn:www.linkedin.com/in/sidharthkumar2792
>
>
>
>
>
>
> On 01-Jul-2017 9:46 PM, "Gagan Brahmi" <gaganbra...@gmail.com> wrote:
>
> I'd say the data flow should be simpler since you might need some basic
> verification of the data. You may want to include NiFi in the mix which
> should do the job.
>
> It can look something like this:
>
>
> Regards,
> Gagan Brahmi
>
> On Sat, Jul 1, 2017 at 7:26 AM, Sidharth Kumar <
> sidharthkumar2...@gmail.com> wrote:
>
>> Thanks for your suggestions. I feel kafka will be better but need some
>> extra like either kafka with flume or kafka with spark streaming. Can you
>> kindly suggest which will be better and in which situation which
>> combination will perform best.
>>
>> Thanks in advance for your help.
>>
>> Warm Regards
>>
>> Sidharth Kumar | Mob: +91 8197 555 599 <+91%2081975%2055599>/7892 192
>> 367 |  LinkedIn:www.linkedin.com/in/sidharthkumar2792
>>
>>
>>
>>
>>
>>
>> On 30-Jun-2017 11:18 AM, "daemeon reiydelle" <daeme...@gmail.com> wrote:
>>
>>> For fairly simple transformations, Flume is great, and works fine
>>> subscribing
>>> ​to some pretty ​
>>> high volumes of messages from Kafka
>>> ​ (I think we hit 50M/second at one point)​
>>> . If you need to do complex transformations, e.g. database lookups for
>>> the Kafka to Hadoop ETL, then you will start having complexity issues which
>>> will exceed the capability of Flume.
>>> ​There are git repos that have everything you need, which include the
>>> kafka adapter, hdfs writer, etc. A lot of this is built into flume. ​
>>> I assume this might be a bit off topic, so googling flume & kafka will
>>> help you?
>>>
>>> On Thu, Jun 29, 2017 at 10:14 PM, Mallanagouda Patil <
>>> mallanagouda.c.pa...@gmail.com> wrote:
>>>
>>>> Kafka is capable of processing billions of events per second. You can
>>>> scale it horizontally with Kafka broker servers.
>>>>
>>>> You can try out these steps
>>>>
>>>> 1. Create a topic in Kafka to get your all data. You have to use Kafka
>>>> producer to ingest data into Kafka.
>>>> 2. If you are going to write your own HDFS client to put data into HDFS
>>>> then, you can read data from topic in step-1, validate and store into HDFS.
>>>> 3. If you want to OpenSource tool (Gobbling or confluent Kafka HDFS
>>>> connector) to put data into HDFS then
>>>> Write tool to read data from topic, validate and store in other topic.
>>>>
>>>> We are using combination of these steps to process over 10 million
>>>> events/second.
>>>>
>>>> I hope it helps..
>>>>
>>>> Thanks
>>>> Mallan
>>>>
>>>> On Jun 30, 2017 10:31 AM, "Sidharth Kumar" <sidharthkumar2...@gmail.com>
>>>> wrote:
>>>>
>>>>> Thanks! What about Kafka with Flume? And also I would like to tell
>>>>> that everyday data intake is in millions and can't afford to loose even a
>>>>> single piece of data. Which makes a need of  high availablity.
>>>>>
>>&g

Re: Kafka or Flume

2017-07-01 Thread Sidharth Kumar
Great, thanks! It's a great tool but you mentioned

For ingestion

NiFi -> Kafka

For data verification

Kafka -> NiFi -> HDFS/Hive/HBase

Whereas I have to apply validation while ingesting data and then route them
based on validation output. This validation will make use of history data
stored in hadoop.

So can you suggest a flow with a little more in detail


Warm Regards

Sidharth Kumar | Mob: +91 8197 555 599/7892 192 367 |  LinkedIn:
www.linkedin.com/in/sidharthkumar2792






On 01-Jul-2017 9:46 PM, "Gagan Brahmi" <gaganbra...@gmail.com> wrote:

I'd say the data flow should be simpler since you might need some basic
verification of the data. You may want to include NiFi in the mix which
should do the job.

It can look something like this:


Regards,
Gagan Brahmi

On Sat, Jul 1, 2017 at 7:26 AM, Sidharth Kumar <sidharthkumar2...@gmail.com>
wrote:

> Thanks for your suggestions. I feel kafka will be better but need some
> extra like either kafka with flume or kafka with spark streaming. Can you
> kindly suggest which will be better and in which situation which
> combination will perform best.
>
> Thanks in advance for your help.
>
> Warm Regards
>
> Sidharth Kumar | Mob: +91 8197 555 599 <+91%2081975%2055599>/7892 192 367
> |  LinkedIn:www.linkedin.com/in/sidharthkumar2792
>
>
>
>
>
>
> On 30-Jun-2017 11:18 AM, "daemeon reiydelle" <daeme...@gmail.com> wrote:
>
>> For fairly simple transformations, Flume is great, and works fine
>> subscribing
>> ​to some pretty ​
>> high volumes of messages from Kafka
>> ​ (I think we hit 50M/second at one point)​
>> . If you need to do complex transformations, e.g. database lookups for
>> the Kafka to Hadoop ETL, then you will start having complexity issues which
>> will exceed the capability of Flume.
>> ​There are git repos that have everything you need, which include the
>> kafka adapter, hdfs writer, etc. A lot of this is built into flume. ​
>> I assume this might be a bit off topic, so googling flume & kafka will
>> help you?
>>
>> On Thu, Jun 29, 2017 at 10:14 PM, Mallanagouda Patil <
>> mallanagouda.c.pa...@gmail.com> wrote:
>>
>>> Kafka is capable of processing billions of events per second. You can
>>> scale it horizontally with Kafka broker servers.
>>>
>>> You can try out these steps
>>>
>>> 1. Create a topic in Kafka to get your all data. You have to use Kafka
>>> producer to ingest data into Kafka.
>>> 2. If you are going to write your own HDFS client to put data into HDFS
>>> then, you can read data from topic in step-1, validate and store into HDFS.
>>> 3. If you want to OpenSource tool (Gobbling or confluent Kafka HDFS
>>> connector) to put data into HDFS then
>>> Write tool to read data from topic, validate and store in other topic.
>>>
>>> We are using combination of these steps to process over 10 million
>>> events/second.
>>>
>>> I hope it helps..
>>>
>>> Thanks
>>> Mallan
>>>
>>> On Jun 30, 2017 10:31 AM, "Sidharth Kumar" <sidharthkumar2...@gmail.com>
>>> wrote:
>>>
>>>> Thanks! What about Kafka with Flume? And also I would like to tell that
>>>> everyday data intake is in millions and can't afford to loose even a single
>>>> piece of data. Which makes a need of  high availablity.
>>>>
>>>> Warm Regards
>>>>
>>>> Sidharth Kumar | Mob: +91 8197 555 599 <+91%2081975%2055599>/7892 192
>>>> 367 |  LinkedIn:www.linkedin.com/in/sidharthkumar2792
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 30-Jun-2017 10:04 AM, "JP gupta" <jp.gu...@altruistindia.com> wrote:
>>>>
>>>>> The ideal sequence should be:
>>>>>
>>>>> 1.  Ingress using Kafka -> Validation and processing using Spark
>>>>> -> Write into any NoSql DB or Hive.
>>>>>
>>>>> From my recent experience, writing directly to HDFS can be slow
>>>>> depending on the data format.
>>>>>
>>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>> JP
>>>>>
>>>>>
>>>>>
>>>>> *From:* Sudeep Singh Thakur [mailto:sudeepthaku...@gmail.com]
>>>>> *Sent:* 30 June 2017 09:26
>>>>> *To:* Sidharth Kumar
>>>>> *Cc:* Maggy; common-u...@hadoop.apache.org
>>>>> *Subj

Re: Kafka or Flume

2017-07-01 Thread Sidharth Kumar
Thanks for your suggestions. I feel kafka will be better but need some
extra like either kafka with flume or kafka with spark streaming. Can you
kindly suggest which will be better and in which situation which
combination will perform best.

Thanks in advance for your help.

Warm Regards

Sidharth Kumar | Mob: +91 8197 555 599/7892 192 367 |  LinkedIn:
www.linkedin.com/in/sidharthkumar2792






On 30-Jun-2017 11:18 AM, "daemeon reiydelle" <daeme...@gmail.com> wrote:

> For fairly simple transformations, Flume is great, and works fine
> subscribing
> ​to some pretty ​
> high volumes of messages from Kafka
> ​ (I think we hit 50M/second at one point)​
> . If you need to do complex transformations, e.g. database lookups for the
> Kafka to Hadoop ETL, then you will start having complexity issues which
> will exceed the capability of Flume.
> ​There are git repos that have everything you need, which include the
> kafka adapter, hdfs writer, etc. A lot of this is built into flume. ​
> I assume this might be a bit off topic, so googling flume & kafka will
> help you?
>
> On Thu, Jun 29, 2017 at 10:14 PM, Mallanagouda Patil <
> mallanagouda.c.pa...@gmail.com> wrote:
>
>> Kafka is capable of processing billions of events per second. You can
>> scale it horizontally with Kafka broker servers.
>>
>> You can try out these steps
>>
>> 1. Create a topic in Kafka to get your all data. You have to use Kafka
>> producer to ingest data into Kafka.
>> 2. If you are going to write your own HDFS client to put data into HDFS
>> then, you can read data from topic in step-1, validate and store into HDFS.
>> 3. If you want to OpenSource tool (Gobbling or confluent Kafka HDFS
>> connector) to put data into HDFS then
>> Write tool to read data from topic, validate and store in other topic.
>>
>> We are using combination of these steps to process over 10 million
>> events/second.
>>
>> I hope it helps..
>>
>> Thanks
>> Mallan
>>
>> On Jun 30, 2017 10:31 AM, "Sidharth Kumar" <sidharthkumar2...@gmail.com>
>> wrote:
>>
>>> Thanks! What about Kafka with Flume? And also I would like to tell that
>>> everyday data intake is in millions and can't afford to loose even a single
>>> piece of data. Which makes a need of  high availablity.
>>>
>>> Warm Regards
>>>
>>> Sidharth Kumar | Mob: +91 8197 555 599 <+91%2081975%2055599>/7892 192
>>> 367 |  LinkedIn:www.linkedin.com/in/sidharthkumar2792
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 30-Jun-2017 10:04 AM, "JP gupta" <jp.gu...@altruistindia.com> wrote:
>>>
>>>> The ideal sequence should be:
>>>>
>>>> 1.  Ingress using Kafka -> Validation and processing using Spark
>>>> -> Write into any NoSql DB or Hive.
>>>>
>>>> From my recent experience, writing directly to HDFS can be slow
>>>> depending on the data format.
>>>>
>>>>
>>>>
>>>> Thanks
>>>>
>>>> JP
>>>>
>>>>
>>>>
>>>> *From:* Sudeep Singh Thakur [mailto:sudeepthaku...@gmail.com]
>>>> *Sent:* 30 June 2017 09:26
>>>> *To:* Sidharth Kumar
>>>> *Cc:* Maggy; common-u...@hadoop.apache.org
>>>> *Subject:* Re: Kafka or Flume
>>>>
>>>>
>>>>
>>>> In your use Kafka would be better because you want some transformations
>>>> and validations.
>>>>
>>>> Kind regards,
>>>> Sudeep Singh Thakur
>>>>
>>>>
>>>>
>>>> On Jun 30, 2017 8:57 AM, "Sidharth Kumar" <sidharthkumar2...@gmail.com>
>>>> wrote:
>>>>
>>>> Hi,
>>>>
>>>>
>>>>
>>>> I have a requirement where I have all transactional data injestion into
>>>> hadoop in real time and before storing the data into hadoop, process it to
>>>> validate the data. If the data failed to pass validation process , it will
>>>> not be stored into hadoop. The validation process also make use of
>>>> historical data which is stored in hadoop. So, my question is which
>>>> injestion tool will be best for this Kafka or Flume?
>>>>
>>>>
>>>>
>>>> Any suggestions will be a great help for me.
>>>>
>>>>
>>>> Warm Regards
>>>>
>>>> Sidharth Kumar | Mob: +91 8197 555 599 <+91%2081975%2055599>/7892 192
>>>> 367 |  LinkedIn:www.linkedin.com/in/sidharthkumar2792
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>


RE: Kafka or Flume

2017-06-29 Thread Sidharth Kumar
Thanks! What about Kafka with Flume? And also I would like to tell that
everyday data intake is in millions and can't afford to loose even a single
piece of data. Which makes a need of  high availablity.

Warm Regards

Sidharth Kumar | Mob: +91 8197 555 599/7892 192 367 |  LinkedIn:
www.linkedin.com/in/sidharthkumar2792






On 30-Jun-2017 10:04 AM, "JP gupta" <jp.gu...@altruistindia.com> wrote:

> The ideal sequence should be:
>
> 1.  Ingress using Kafka -> Validation and processing using Spark ->
> Write into any NoSql DB or Hive.
>
> From my recent experience, writing directly to HDFS can be slow depending
> on the data format.
>
>
>
> Thanks
>
> JP
>
>
>
> *From:* Sudeep Singh Thakur [mailto:sudeepthaku...@gmail.com]
> *Sent:* 30 June 2017 09:26
> *To:* Sidharth Kumar
> *Cc:* Maggy; common-u...@hadoop.apache.org
> *Subject:* Re: Kafka or Flume
>
>
>
> In your use Kafka would be better because you want some transformations
> and validations.
>
> Kind regards,
> Sudeep Singh Thakur
>
>
>
> On Jun 30, 2017 8:57 AM, "Sidharth Kumar" <sidharthkumar2...@gmail.com>
> wrote:
>
> Hi,
>
>
>
> I have a requirement where I have all transactional data injestion into
> hadoop in real time and before storing the data into hadoop, process it to
> validate the data. If the data failed to pass validation process , it will
> not be stored into hadoop. The validation process also make use of
> historical data which is stored in hadoop. So, my question is which
> injestion tool will be best for this Kafka or Flume?
>
>
>
> Any suggestions will be a great help for me.
>
>
> Warm Regards
>
> Sidharth Kumar | Mob: +91 8197 555 599/7892 192 367 |  LinkedIn:
> www.linkedin.com/in/sidharthkumar2792
>
>
>
>
>
>


Kafka or Flume

2017-06-29 Thread Sidharth Kumar
Hi,

I have a requirement where I have all transactional data injestion into
hadoop in real time and before storing the data into hadoop, process it to
validate the data. If the data failed to pass validation process , it will
not be stored into hadoop. The validation process also make use of
historical data which is stored in hadoop. So, my question is which
injestion tool will be best for this Kafka or Flume?

Any suggestions will be a great help for me.


Warm Regards

Sidharth Kumar | Mob: +91 8197 555 599/7892 192 367 |  LinkedIn:
www.linkedin.com/in/sidharthkumar2792


RE: Lots of warning messages and exception in namenode logs

2017-06-29 Thread Sidharth Kumar
Hi,

No, as there will be no copy exists of that file. You can increase the
replication factor to 3 so that there will be 3 copies created and even if
2 data nodes goes down you will still have one copy available which will be
again replicated to 3 by the namenode in due course of time.


Warm Regards

Sidharth Kumar | Mob: +91 8197 555 599/7892 192 367 |  LinkedIn:
www.linkedin.com/in/sidharthkumar2792






On 29-Jun-2017 3:45 PM, "omprakash" <ompraka...@cdac.in> wrote:

> Hi Ravi,
>
>
>
> I have 5 nodes in Hadoop cluster and all have same configurations. After
> setting *dfs.replication=2 *, I did a clean start of hdfs.
>
>
>
> As per your suggestion, I added 2 more datanodes and clean all the data
> and metadata. The performance of the cluster has dramatically improved. I
> can see through logs that the files are randomly replicated to four
> datanodes (2 replica of each file).
>
>
>
> But here my problem arise. I want redundant datanodes such that if any two
> of the datanodes goes down I still be able to get files from other two. In
> above case suppose file block-xyz get stored on datanode1 and datanode2,
> and some day these two datanodes goes down , will I be able to access the
> block-xyz? This is what I am worried about.
>
>
>
>
>
> Regards
>
> Om
>
>
>
>
>
> *From:* Ravi Prakash [mailto:ravihad...@gmail.com]
> *Sent:* 27 June 2017 22:36
> *To:* omprakash <ompraka...@cdac.in>
> *Cc:* Arpit Agarwal <aagar...@hortonworks.com>; user <
> user@hadoop.apache.org>
> *Subject:* Re: Lots of warning messages and exception in namenode logs
>
>
>
> Hi Omprakash!
>
> This is *not* ok. Please go through the datanode logs of the inactive
> datanode and figure out why its inactive. If you set dfs.replication to 2,
> atleast as many datanodes (and ideally a LOT more datanodes) should be
> active and participating in the cluster.
>
> Do you have the hdfs-site.xml you posted to the mailing list on all the
> nodes (including the Namenode)? Was the file containing block
> *blk_1074074104_337394* created when you had the cluster misconfigured to
> dfs.replication=3 ? You can determine which file the block belongs to using
> this command:
>
> hdfs fsck -blockId blk_1074074104
>
> Once you have the file, you can set its replication using
> hdfs dfs -setrep 2 
>
> I'm guessing that you probably have a lot of files with this replication,
> in which case you should set it on / (This would overwrite the replication
> on all the files)
>
>
>
> If the data on this cluster is important I would be very worried about the
> condition its in.
>
> HTH
>
> Ravi
>
>
>
> On Mon, Jun 26, 2017 at 11:22 PM, omprakash <ompraka...@cdac.in> wrote:
>
> Hi all,
>
>
>
> I started the HDFS in DEBUG mode. After examining the logs I found below
> logs which read that the replication factor required is 3 (as against the
> specified *dfs.replication=2*).
>
>
>
> *DEBUG BlockStateChange: BLOCK* NameSystem.UnderReplicationBlock.add:
> blk_1074074104_337394 has only 1 replicas and need 3 replicas so is added
> to neededReplications at priority level 0*
>
>
>
> *P.S : I have 1 datanode active out of 2. *
>
>
>
> I can also see from Namenode UI that the no. of under replicated blocks
> are growing.
>
>
>
> Any idea? Or this is OK.
>
>
>
> regards
>
>
>
>
>
> *From:* omprakash [mailto:ompraka...@cdac.in]
> *Sent:* 23 June 2017 11:02
> *To:* 'Ravi Prakash' <ravihad...@gmail.com>; 'Arpit Agarwal' <
> aagar...@hortonworks.com>
> *Cc:* 'user' <user@hadoop.apache.org>
> *Subject:* RE: Lots of warning messages and exception in namenode logs
>
>
>
> Hi Arpit,
>
>
>
> I will enable the settings as suggested and will post the results.
>
>
>
> I am just curious about setting *Namenode RPC service  port*. As I have
> checked the *hdfs-site.xml* properties, *dfs.namenode.rpc-address* is
> already set which will be default value to RPC service port also. Does
> specifying any other port have advantage over default one?
>
>
>
> Regarding JvmPauseMonitor Error, there are 5-6 instances of this error in 
> namenode logs. Here is one of them.
>
>
>
> How to identify the size of heap In such cases as I have 4GB of RAM on the
> namenode VM.?
>
>
>
> *@Ravi* Since the file size are very small thus I have only configured a
> VM with 20 GB space. The additional disk is simple SATA disk not SSD.
>
>
>
> As I can see from Namenode UI there are more than 50% of block under
> replicated. I have now 400K blocks out of which 200K are under-replicated.
>
> I will po

Re: GARBAGE COLLECTOR

2017-06-19 Thread Sidharth Kumar
I am running a Map reduce job which run fine(without Out of Memory) with
8GB map and reduce memory but do have long GC pauses. Currently i am using
default JVM. And as G1 is recommended when JVM size grow more than 4GB. I
want to know if this will resolve the issue without impacting any other
services or Non - Map reduce applications

On Mon, Jun 19, 2017 at 6:04 PM, Harsh J <ha...@cloudera.com> wrote:

> You can certainly configure it this way without any ill effects, but note
> that MR job tasks are typically short lived and GC isn't really a big issue
> for most of what it does.
>
> On Mon, 19 Jun 2017 at 14:20 Sidharth Kumar <sidharthkumar2...@gmail.com>
> wrote:
>
>> Hi Team,
>>
>> How feasible will it be, if I configure CMS Garbage collector for Hadoop
>> daemons and configure G1 for Map Reduce jobs which run for hours?
>>
>> Thanks for your help ...!
>>
>>
>> --
>> Regards
>> Sidharth Kumar | Mob: +91 8197 555 599 <081975%2055599> | LinkedIn
>> <https://www.linkedin.com/in/sidharthkumar2792/>
>>
>


-- 
Regards
Sidharth Kumar | Mob: +91 8197 555 599 | LinkedIn
<https://www.linkedin.com/in/sidharthkumar2792/>


GARBAGE COLLECTOR

2017-06-19 Thread Sidharth Kumar
Hi Team,

How feasible will it be, if I configure CMS Garbage collector for Hadoop
daemons and configure G1 for Map Reduce jobs which run for hours?

Thanks for your help ...!

-- 
Regards
Sidharth Kumar | Mob: +91 8197 555 599 | LinkedIn
<https://www.linkedin.com/in/sidharthkumar2792/>


Re: How to monitor YARN application memory per container?

2017-06-13 Thread Sidharth Kumar
Hi,

I guess you can get it from http://:/jmx or
/metrics

Regards
Sidharth
LinkedIn: www.linkedin.com/in/sidharthkumar2792

On 13-Jun-2017 6:26 PM, "Shmuel Blitz"  wrote:

> (This question has also been published on StackOveflow
> )
>
> I am looking for a way to monitor memory usage of YARN containers over
> time.
>
> Specifically - given a YARN application-id, how can you get a graph,
> showing the memory usage of each of its containers over time?
>
> The main goal is to better fit memory allocation requirements for our YARN
> applications (Spark / Map-Reduce), to avoid over allocation and cluster
> resource waste. A side goal would be the ability to debug memory issues
> when developing our jobs and attempting to pick reasonable resource
> allocations.
>
> We've tried using the Data-Dog integration, But it doesn't break down the
> metrics by container.
>
> Another approach was to parse the hadoop-yarn logs. These logs have
> messages like:
>
> Memory usage of ProcessTree 57251 for container-id
> container_e116_1495951495692_35134_01_01: 1.9 GB of 11 GB physical
> memory used; 14.4 GB of 23.1 GB virtual memory used
> Parsing the logs correctly can yield data that can be used to plot a graph
> of memory usage over time.
>
> That's exactly what we want, but there are two downsides:
>
> It involves reading human-readable log lines and parsing them into numeric
> data. We'd love to avoid that.
> If this data can be consumed otherwise, we're hoping it'll have more
> information that we might be interest in in the future. We wouldn't want to
> put the time into parsing the logs just to realize we need something else.
> Is there any other way to extract these metrics, either by plugging in to
> an existing producer or by writing a simple listener?
>
> Perhaps a whole other approach?
>
> --
> [image: Logo]
> 
> Shmuel Blitz
> *Big Data Developer*
> www.similarweb.com
> 
>
> 
>  Like
> Us
> 
>
> 
>  Follow
> Us
> 
>
> 
>  Watch
> Us
> 
>
> 
>  Read
> Us
> 
>


Re: When i run wordcount of Hadoop in Win10, i got wrong info

2017-06-12 Thread Sidharth Kumar
Check /tmp directory permissions and owners

Sidharth


On 13-Jun-2017 3:20 AM, "Deng Yong"  wrote:

> D:\hdp\sbin>yarn jar d:/hdp/share/hadoop/mapreduce/
> hadoop-mapreduce-examples-2.7.3.jar wordcount /aa.txt /out
>
> 17/06/10 15:27:32 INFO client.RMProxy: Connecting to ResourceManager at /
> 0.0.0.0:8032
>
> 17/06/10 15:27:33 INFO input.FileInputFormat: Total input paths to process
> : 1
>
> 17/06/10 15:27:33 INFO mapreduce.JobSubmitter: number of splits:1
>
> 17/06/10 15:27:33 INFO mapreduce.JobSubmitter: Submitting tokens for job:
> job_1497079386122_0001
>
> 17/06/10 15:27:34 INFO impl.YarnClientImpl: Submitted application
> application_1497079386122_0001
>
> 17/06/10 15:27:34 INFO mapreduce.Job: The url to track the job:
> http://DESKTOP-6LF1EE1:8088/proxy/application_1497079386122_0001/
>
> 17/06/10 15:27:34 INFO mapreduce.Job: Running job: job_1497079386122_0001
>
> 17/06/10 15:27:38 INFO mapreduce.Job: Job job_1497079386122_0001 running
> in uber mode : false
>
> 17/06/10 15:27:38 INFO mapreduce.Job:  map 0% reduce 0%
>
> *17/06/10 15:27:38 INFO mapreduce.Job: Job job_1497079386122_0001 failed
> with state FAILED due to: Application application_1497079386122_0001 failed
> 2 times due to AM Container for appattempt_1497079386122_0001_02 exited
> with  exitCode: -1000*
>
> *For more detailed output, check application tracking
> page:http://DESKTOP-6LF1EE1:8088/cluster/app/application_1497079386122_0001Then
> ,
> click on links to logs of each attempt.*
>
> Diagnostics: Failed to setup local dir /tmp/hadoop-d00338403/nm-local-dir,
> which was marked as good.
>
> Failing this attempt. Failing the application.
>
> 17/06/10 15:27:38 INFO mapreduce.Job: Counters: 0
>
> v
>
> 发送自 Windows 10 版邮件 应用
>
>
>


How to Contribute as hadoop admin

2017-05-31 Thread Sidharth Kumar
Hi,

I have been working as hadoop admin since 2 years, I subscribed to this
group 3 months before but since then never able to figure out something
which a hadoop admin contribute can do. It will be great full if someone
help me out to contribute in hadoop 3.0 development.

Thanks for  help in advance

Sidharth
Mob: +91 819799
LinkedIn: www.linkedin.com/in/sidharthkumar2792


Re: Why hdfs don't have current working directory

2017-05-26 Thread Sidharth Kumar
Thanks, I'll check it out.


Sidharth

On 26-May-2017 4:10 PM, "Hariharan" <hariharan...@gmail.com> wrote:

> The concept of working directory is only useful for processes, and HDFS
> does not have executables. I guess what you're looking for is absolute vs
> relative paths (so that you can do something like hdfs cat foo instead of
> hdfs cat /user/me/foo). HDFS does have this to a limited extent - if your
> path is not absolute, it is relative from your home directory (or root if
> there is no home directory for your user).
>
> Thanks,
> Hariharan
>
> On Fri, May 26, 2017 at 3:44 PM, Sidharth Kumar <
> sidharthkumar2...@gmail.com> wrote:
>
>> Hi,
>>
>> Can you kindly explain me why hdfs doesnt have current directory concept.
>> Why Hadoop is not implement to use pwd? Why command like cd and PWD cannot
>> be implemented in hdfs?
>>
>> Regards
>> Sidharth
>> Mob: +91 819799
>> LinkedIn: www.linkedin.com/in/sidharthkumar2792
>>
>
>


Why hdfs don't have current working directory

2017-05-26 Thread Sidharth Kumar
Hi,

Can you kindly explain me why hdfs doesnt have current directory concept.
Why Hadoop is not implement to use pwd? Why command like cd and PWD cannot
be implemented in hdfs?

Regards
Sidharth
Mob: +91 819799
LinkedIn: www.linkedin.com/in/sidharthkumar2792


Re: access error while trying to run distcp from source cluster

2017-05-25 Thread Sidharth Kumar
Hi ,

It may be because user don't have the write permission in destination
cluster path.

For example
$Su - abcde
$hadoop  distcp /data/sample1 hdfs://destclstnn:8020/data/

So,in the above case user abcde should have the write permission at
destination path hdfs://destclstnn:8020/data/


Regards
Sidharth



On 25-May-2017 10:25 PM, "nancy henry"  wrote:

Hi Team,

I am trying to copy data from A cluster to B cluster and same user for both

I am running distcp command on source cluster A

but i am getting error

17/05/25 07:24:08 INFO mapreduce.Job: Running job: job_1492549627402_344485
17/05/25 07:24:17 INFO mapreduce.Job: Job job_1492549627402_344485 running
in uber mode : false 17/05/25 07:24:17 INFO mapreduce.Job:  map 0% reduce
0% 17/05/25 07:24:26 INFO mapreduce.Job: Task Id :
attempt_1492549627402_344485_m_00_0, Status : FAILED Error:
org.apache.hadoop.security.AccessControlException: User abcde (user id
50006054)  has been denied access to create distcptest2 at
com.mapr.fs.MapRFileSystem.makeDir(MapRFileSystem.java:1282) at
com.mapr.fs.MapRFileSystem.mkdirs(MapRFileSystem.java:1302) at
org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1913) at
org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:272)
at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:51)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146) at
org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:796) at
org.apache.hadoop.mapred.MapTask.run(MapTask.java:346) at
org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163) at
java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:415) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGro
upInformation.java:1595) at org.apache.hadoop.mapred.YarnC
hild.main(YarnChild.java:158)
this is the error


Re: Block pool error in datanode

2017-05-24 Thread Sidharth Kumar
So I guess this is due to change in blockpool id. If you have older fsimage
backup ,start namenode using that fsimage or delete the current directory
of datanodes hdfs storage and re-format the namenode once again

Regards
Sidharth
Mob: +91 819799
LinkedIn: www.linkedin.com/in/sidharthkumar2792

On 23-May-2017 7:52 PM, "Dhanushka Parakrama" 
wrote:

> Hi All
>
> I have 5 node cluster  where  2 name node are runs in active and standby
> mode . when i try to add the data nodes after name node format i get the
> below error . is there any way to fix it
>
>
>
>
>
> Error
> 
> 2017-05-23 14:10:29,805 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode:
> Initialization failed for Block pool 
> BP-1938612011-192.168.56.148-1495541823828
> (Datanode Uuid 5c652b07-e0cb-4917-9435-f534a680dbad) service to
> nn2.cluster.com/192.168.56.149:9000 Blockpool ID mismatch: previously
> connected to Blockpool ID BP-1938612011-192.168.56.148-1495541823828 but
> now connected to Blockpool ID BP-769166748-192.168.56.148-1494926203928
> 2017-05-23 14:10:34,808 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode:
> Initialization failed for Block pool 
> BP-1938612011-192.168.56.148-1495541823828
> (Datanode Uuid 5c652b07-e0cb-4917-9435-f534a680dbad) service to
> nn2.cluster.com/192.168.56.149:9000 Blockpool ID mismatch: previously
> connected to Blockpool ID BP-1938612011-192.168.56.148-1495541823828 but
> now connected to Blockpool ID BP-769166748-192.168.56.148-1494926203928
>
>
>
>
>
>


RE: Hdfs default block size

2017-05-22 Thread Sidharth Kumar
Thank you for your help.


On 22-May-2017 5:23 PM, "surendra lilhore" <surendra.lilh...@huawei.com>
wrote:

> Hi Sidharth,
>
>
>
> It is 128MB.
>
>
>
> You can refer  this link https://hadoop.apache.org/
> docs/r2.7.3/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
>
>
>
> Regards
>
> Surendra
>
>
>
>
>
> *From:* Sidharth Kumar [mailto:sidharthkumar2...@gmail.com]
> *Sent:* 22 May 2017 19:36
> *To:* common-u...@hadoop.apache.org
> *Subject:* Hdfs default block size
>
>
>
> Hi,
>
>
>
> Can you kindly tell me what is the default block size in apache hadoop
> 2.7.3? Is it 64mb or 128mb?
>
>
> Thanks
>
>
>
> Sidharth
>


Hdfs default block size

2017-05-22 Thread Sidharth Kumar
Hi,

Can you kindly tell me what is the default block size in apache hadoop
2.7.3? Is it 64mb or 128mb?


Thanks

Sidharth


Re: Hadoop 2.7.3 cluster namenode not starting

2017-05-17 Thread Sidharth Kumar
Hi,

The error you mentioned below " 'Name or service not known'" means servers
not able to communicate to each other. Check network configurations.

Sidharth
Mob: +91 819799
LinkedIn: www.linkedin.com/in/sidharthkumar2792

On 17-May-2017 12:13 PM, "Bhushan Pathak" <bhushan.patha...@gmail.com>
wrote:

Apologies for the delayed reply, was away due to some personal issues.

I tried the telnet command as well, but no luck. I get the response that
'Name or service not known'

Thanks
Bhushan Pathak

Thanks
Bhushan Pathak

On Wed, May 3, 2017 at 7:48 AM, Sidharth Kumar <sidharthkumar2...@gmail.com>
wrote:

> Can you check if the ports are opened by running telnet command.
> Run below command from source machine to destination machine and check if
> this help
>
> $telnet  
> Ex: $telnet 192.168.1.60 9000
>
>
> Let's Hadooping!
>
> Bests
> Sidharth
> Mob: +91 819799
> LinkedIn: www.linkedin.com/in/sidharthkumar2792
>
> On 28-Apr-2017 10:32 AM, "Bhushan Pathak" <bhushan.patha...@gmail.com>
> wrote:
>
>> Hello All,
>>
>> 1. The slave & master can ping each other as well as use passwordless SSH
>> 2. The actual IP starts with 10.x.x.x, I have put in the config file as I
>> cannot share  the actual IP
>> 3. The namenode is formatted. I executed the 'hdfs namenode -format'
>> again just to rule out the possibility
>> 4. I did not configure anything in the master file. I don;t think Hadoop
>> 2.7.3 has a master file to be configured
>> 5. The netstat command [sudo netstat -tulpn | grep '51150' ] does not
>> give any output.
>>
>> Even if I change  the port number to a different one, say 52220, 5, I
>> still get the same error.
>>
>> Thanks
>> Bhushan Pathak
>>
>> Thanks
>> Bhushan Pathak
>>
>> On Fri, Apr 28, 2017 at 7:52 AM, Lei Cao <charlie.c...@hotmail.com>
>> wrote:
>>
>>> Hi Mr. Bhushan,
>>>
>>> Have you tried to format namenode?
>>> Here's the command:
>>> hdfs namenode -format
>>>
>>> I've encountered such problem as namenode cannot be started. This
>>> command line easily fixed my problem.
>>>
>>> Hope this can help you.
>>>
>>> Sincerely,
>>> Lei Cao
>>>
>>>
>>> On Apr 27, 2017, at 12:09, Brahma Reddy Battula <
>>> brahmareddy.batt...@huawei.com> wrote:
>>>
>>> *Please check “hostname –i” .*
>>>
>>>
>>>
>>>
>>>
>>> *1)  **What’s configured in the “master” file.(you shared only
>>> slave file).?*
>>>
>>>
>>>
>>> *2)  **Can you able to “ping master”?*
>>>
>>>
>>>
>>> *3)  **Can you configure like this check once..?*
>>>
>>> *1.1.1.1 master*
>>>
>>>
>>>
>>>
>>>
>>> Regards
>>>
>>> Brahma Reddy Battula
>>>
>>>
>>>
>>> *From:* Bhushan Pathak [mailto:bhushan.patha...@gmail.com
>>> <bhushan.patha...@gmail.com>]
>>> *Sent:* 27 April 2017 18:16
>>> *To:* Brahma Reddy Battula
>>> *Cc:* user@hadoop.apache.org
>>> *Subject:* Re: Hadoop 2.7.3 cluster namenode not starting
>>>
>>>
>>>
>>> Some additional info -
>>>
>>> OS: CentOS 7
>>>
>>> RAM: 8GB
>>>
>>>
>>>
>>> Thanks
>>>
>>> Bhushan Pathak
>>>
>>>
>>> Thanks
>>>
>>> Bhushan Pathak
>>>
>>>
>>>
>>> On Thu, Apr 27, 2017 at 3:34 PM, Bhushan Pathak <
>>> bhushan.patha...@gmail.com> wrote:
>>>
>>> Yes, I'm running the command on the master node.
>>>
>>>
>>>
>>> Attached are the config files & the hosts file. I have updated the IP
>>> address only as per company policy, so that original IP addresses are not
>>> shared.
>>>
>>>
>>>
>>> The same config files & hosts file exist on all 3 nodes.
>>>
>>>
>>>
>>> Thanks
>>>
>>> Bhushan Pathak
>>>
>>>
>>> Thanks
>>>
>>> Bhushan Pathak
>>>
>>>
>>>
>>> On Thu, Apr 27, 2017 at 3:02 PM, Brahma Reddy Battula <
>>> brahmareddy.batt...@huawei.com> wrote:
>>>
>>> Are you sure that you are starting in same machine (mast

Re: Hadoop 2.7.3 cluster namenode not starting

2017-05-02 Thread Sidharth Kumar
Can you check if the ports are opened by running telnet command.
Run below command from source machine to destination machine and check if
this help

$telnet  
Ex: $telnet 192.168.1.60 9000


Let's Hadooping!

Bests
Sidharth
Mob: +91 819799
LinkedIn: www.linkedin.com/in/sidharthkumar2792

On 28-Apr-2017 10:32 AM, "Bhushan Pathak" 
wrote:

> Hello All,
>
> 1. The slave & master can ping each other as well as use passwordless SSH
> 2. The actual IP starts with 10.x.x.x, I have put in the config file as I
> cannot share  the actual IP
> 3. The namenode is formatted. I executed the 'hdfs namenode -format' again
> just to rule out the possibility
> 4. I did not configure anything in the master file. I don;t think Hadoop
> 2.7.3 has a master file to be configured
> 5. The netstat command [sudo netstat -tulpn | grep '51150' ] does not
> give any output.
>
> Even if I change  the port number to a different one, say 52220, 5, I
> still get the same error.
>
> Thanks
> Bhushan Pathak
>
> Thanks
> Bhushan Pathak
>
> On Fri, Apr 28, 2017 at 7:52 AM, Lei Cao  wrote:
>
>> Hi Mr. Bhushan,
>>
>> Have you tried to format namenode?
>> Here's the command:
>> hdfs namenode -format
>>
>> I've encountered such problem as namenode cannot be started. This command
>> line easily fixed my problem.
>>
>> Hope this can help you.
>>
>> Sincerely,
>> Lei Cao
>>
>>
>> On Apr 27, 2017, at 12:09, Brahma Reddy Battula <
>> brahmareddy.batt...@huawei.com> wrote:
>>
>> *Please check “hostname –i” .*
>>
>>
>>
>>
>>
>> *1)  **What’s configured in the “master” file.(you shared only slave
>> file).?*
>>
>>
>>
>> *2)  **Can you able to “ping master”?*
>>
>>
>>
>> *3)  **Can you configure like this check once..?*
>>
>> *1.1.1.1 master*
>>
>>
>>
>>
>>
>> Regards
>>
>> Brahma Reddy Battula
>>
>>
>>
>> *From:* Bhushan Pathak [mailto:bhushan.patha...@gmail.com
>> ]
>> *Sent:* 27 April 2017 18:16
>> *To:* Brahma Reddy Battula
>> *Cc:* user@hadoop.apache.org
>> *Subject:* Re: Hadoop 2.7.3 cluster namenode not starting
>>
>>
>>
>> Some additional info -
>>
>> OS: CentOS 7
>>
>> RAM: 8GB
>>
>>
>>
>> Thanks
>>
>> Bhushan Pathak
>>
>>
>> Thanks
>>
>> Bhushan Pathak
>>
>>
>>
>> On Thu, Apr 27, 2017 at 3:34 PM, Bhushan Pathak <
>> bhushan.patha...@gmail.com> wrote:
>>
>> Yes, I'm running the command on the master node.
>>
>>
>>
>> Attached are the config files & the hosts file. I have updated the IP
>> address only as per company policy, so that original IP addresses are not
>> shared.
>>
>>
>>
>> The same config files & hosts file exist on all 3 nodes.
>>
>>
>>
>> Thanks
>>
>> Bhushan Pathak
>>
>>
>> Thanks
>>
>> Bhushan Pathak
>>
>>
>>
>> On Thu, Apr 27, 2017 at 3:02 PM, Brahma Reddy Battula <
>> brahmareddy.batt...@huawei.com> wrote:
>>
>> Are you sure that you are starting in same machine (master)..?
>>
>>
>>
>> Please share “/etc/hosts” and configuration files..
>>
>>
>>
>>
>>
>> Regards
>>
>> Brahma Reddy Battula
>>
>>
>>
>> *From:* Bhushan Pathak [mailto:bhushan.patha...@gmail.com]
>> *Sent:* 27 April 2017 17:18
>> *To:* user@hadoop.apache.org
>> *Subject:* Fwd: Hadoop 2.7.3 cluster namenode not starting
>>
>>
>>
>> Hello
>>
>>
>>
>> I have a 3-node cluster where I have installed hadoop 2.7.3. I have
>> updated core-site.xml, mapred-site.xml, slaves, hdfs-site.xml,
>> yarn-site.xml, hadoop-env.sh files with basic settings on all 3 nodes.
>>
>>
>>
>> When I execute start-dfs.sh on the master node, the namenode does not
>> start. The logs contain the following error -
>>
>> 2017-04-27 14:17:57,166 ERROR 
>> org.apache.hadoop.hdfs.server.namenode.NameNode:
>> Failed to start namenode.
>>
>> java.net.BindException: Problem binding to [master:51150]
>> java.net.BindException: Cannot assign requested address; For more details
>> see:  http://wiki.apache.org/hadoop/BindException
>>
>> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>> Method)
>>
>> at sun.reflect.NativeConstructorAccessorImpl.newInstance(Native
>> ConstructorAccessorImpl.java:62)
>>
>> at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(De
>> legatingConstructorAccessorImpl.java:45)
>>
>> at java.lang.reflect.Constructor.newInstance(Constructor.java:4
>> 23)
>>
>> at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.
>> java:792)
>>
>> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:
>> 721)
>>
>> at org.apache.hadoop.ipc.Server.bind(Server.java:425)
>>
>> at org.apache.hadoop.ipc.Server$Listener.(Server.java:574)
>>
>> at org.apache.hadoop.ipc.Server.(Server.java:2215)
>>
>> at org.apache.hadoop.ipc.RPC$Server.(RPC.java:951)
>>
>> at org.apache.hadoop.ipc.ProtobufRpcEngine$Server.(
>> ProtobufRpcEngine.java:534)
>>
>> at org.apache.hadoop.ipc.ProtobufRpcEngine.getServer(ProtobufRp
>> cEngine.java:509)
>>
>> at 

Re: Hdfs read and write operation

2017-04-20 Thread Sidharth Kumar
Hi,

Could anyone kindly help me to clear my below doubts

Thanks

On 19-Apr-2017 8:08 PM, "Sidharth Kumar" <sidharthkumar2...@gmail.com>
wrote:

Hi,

please help me to understand it
1) If we read anatomy of hdfs read in hadoop definitive guide it says data
queue is consumed by streamer. So, can you just tell me that will there be
only one streamer in a cluster which consume packets from data queue and
create pipeline for each packets to store into data node or there will be
multiple streamer which will consume packets from data queue and store into
data node parallel .
2) There are multiple blogs has been written claiming read and write is a
parallel process(below I have pasted one such link). Can you also help me
by justifying  if they are wrong
http://stackoverflow.com/questions/30400249/hadoop-pipeline-
write-and-parallel-read

Bests
Sidharth
LinkedIn: www.linkedin.com/in/sidharthkumar2792


Re: Hadoop namespace format user and permissions

2017-04-20 Thread Sidharth Kumar
Hi James,

Please create a user hadoop or hdfs and change the ownership of directory
to hdfs:hadoop. Hdfs run with hdfs user. This should probably resolve your​
issue. If you need I can share document  which i made for pseudo mode
installation to help my mates.

Please let me know if issue still persists.

Let's hadooping.!

Best regards
Sidharth
Mob: +91 819799
LinkedIn: www.linkedin.com/in/sidharthkumar2792

On 20-Apr-2017 9:21 AM, "Gabriel James"  wrote:

> Hi All,
>
>
>
> First time installation, not an expert, think probably a permissions
> issue, not sure best way to resolve, forgive basic questions.
>
>
>
> Trying to setup Hadoop 2.7.3 as single node on Ubuntu 16.04 with Java Open
> JDK 8. Then will setup a three node cluster.
>
>
>
>1. I think the directory paths are correct in hdfs-site.xml
>2. Not sure what user namenode uses?
>3. Not sure what permissions namenode needs?
>4. Not sure what user and group configuration for running as service?
>
>
>
> The error below seems to common on the web, but couldn’t find a clear
> answer, the Hadoop documentation seems quite minimal.
>
>
>
> */usr/local/hadoop/bin/hdfs namenode -format*
>
>
>
> 17/04/20 03:27:34 WARN namenode.NameNode: Encountered exception during
> format:
>
> java.io.IOException: Cannot create directory /res0/hdfs/name/current
>
> at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.
> clearDirectory(Storage.java:337)
>
> at org.apache.hadoop.hdfs.server.namenode.NNStorage.format(
> NNStorage.java:564)
>
> at org.apache.hadoop.hdfs.server.namenode.NNStorage.format(
> NNStorage.java:585)
>
> at org.apache.hadoop.hdfs.server.namenode.FSImage.format(
> FSImage.java:161)
>
> at org.apache.hadoop.hdfs.server.namenode.NameNode.format(
> NameNode.java:992)
>
> at org.apache.hadoop.hdfs.server.namenode.NameNode.
> createNameNode(NameNode.java:1434)
>
> at org.apache.hadoop.hdfs.server.namenode.NameNode.main(
> NameNode.java:1559)
>
> 17/04/20 03:27:34 ERROR namenode.NameNode: Failed to start namenode.
>
> java.io.IOException: Cannot create directory /res0/hdfs/name/current
>
> at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.
> clearDirectory(Storage.java:337)
>
> at org.apache.hadoop.hdfs.server.namenode.NNStorage.format(
> NNStorage.java:564)
>
> at org.apache.hadoop.hdfs.server.namenode.NNStorage.format(
> NNStorage.java:585)
>
> at org.apache.hadoop.hdfs.server.namenode.FSImage.format(
> FSImage.java:161)
>
> at org.apache.hadoop.hdfs.server.namenode.NameNode.format(
> NameNode.java:992)
>
> at org.apache.hadoop.hdfs.server.namenode.NameNode.
> createNameNode(NameNode.java:1434)
>
> at org.apache.hadoop.hdfs.server.namenode.NameNode.main(
> NameNode.java:1559)
>
> 17/04/20 03:27:34 INFO util.ExitUtil: Exiting with status 1
>
> 17/04/20 03:27:34 INFO namenode.NameNode: SHUTDOWN_MSG:
>
>
>
> */usr/local/hadoop/etc/hadoop/Hadoop-env.sh*
>
>
>
> export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/
>
>
>
> */usr/local/hadoop/etc/hadoop/hdfs-site.xml*
>
>
>
> 
>
> 
>
>
>
> 
>
> 
>
> dfs.replication
>
> 1
>
> 
>
> 
>
> dfs.namenode.name.dir
>
> file:///res0/hdfs/name
>
> 
>
> 
>
> dfs.datanode.data.dir
>
> file:///res0/hdfs/data
>
> 
>
> 
>
>
>
> */res0/hdfs$ ls -la*
>
>
>
> drwxr-xr-x 4 root root 4096 Apr 19 05:19 .
>
> drwxr-xr-x 4 root root 4096 Apr 19 05:18 ..
>
> drwxr-xr-x 2 root root 4096 Apr 19 05:18 data
>
> drwxr-xr-x 2 root root 4096 Apr 20 03:27 name
>
>
>
> Thanks in advance,
>
>
>
> Gabe
>


Hdfs read and write operation

2017-04-19 Thread Sidharth Kumar
Hi,

please help me to understand it
1) If we read anatomy of hdfs read in hadoop definitive guide it says data
queue is consumed by streamer. So, can you just tell me that will there be
only one streamer in a cluster which consume packets from data queue and
create pipeline for each packets to store into data node or there will be
multiple streamer which will consume packets from data queue and store into
data node parallel .
2) There are multiple blogs has been written claiming read and write is a
parallel process(below I have pasted one such link). Can you also help me
by justifying  if they are wrong
http://stackoverflow.com/questions/30400249/hadoop-
pipeline-write-and-parallel-read

Bests
Sidharth
LinkedIn: www.linkedin.com/in/sidharthkumar2792


Re: Disk full errors in local-dirs, what data is stored in yarn.nodemanager.local-dirs?

2017-04-12 Thread Sidharth Kumar
Hi,

Can you paste the output of "df -h" command here.

Regards
Sidharth

On Wednesday, April 12, 2017, Albert Chu <ch...@llnl.gov> wrote:

> Hi,
>
> I have a cluster where we have a parallel networked file system for our
> major data storage and our nodes have ~750G of local SSD space.  To
> speed up things, we configure yarn.nodemanager.local-dirs to use the
> local SSD for local caching.
>
> Recently, I've been trying to do a terasort of 2 terabytes of data over
> 8 nodes w/ Hadoop 2.7.3.  So that's about 6000 gigs of local SSD space
> for caching, or 5400 gigs when hadoop uses its 90% disk full checking
> limit.
>
> I always get diskfull errors such as the below when running:
>
> 2017-04-11 12:31:44,062 WARN 
> org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection:
> Directory /l/ssd/achutest/localstore/yarn-nm error, used space above
> threshold of 90.0%, removing from list of valid directories
> 2017-04-11 12:31:44,063 INFO 
> org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService:
> Disk(s) failed: 1/1 local-dirs are bad: /l/ssd/achutest/localstore/
> yarn-nm;
> 2017-04-11 12:31:44,063 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService:
> Most of the disks failed. 1/1 local-dirs are bad:
> /l/ssd/achutest/localstore/yarn-nm;
>
> What I don't understand is how I am getting diskfull errors.  Within
> terasort, I should have at most 2000 gigs of mapped intermediate data
> and at most 2000 gigs of merged data in reducers.  Even assuming some
> overhead from Hadoop, I should have more than enough space for this
> benchmark to complete given maps and reducers are spread out evenly
> across nodes.
>
> So my assumption is something else is being cached in local-dirs that
> I'm not accounting for.  Is there any other data I should consider when
> coming up with my estimates?
>
> One guess I had.  Is it possible spilled data from reducer merges are
> not deleted until a reducer completes?  Given my example above, the
> total amount of merged data in reducers may exceed 2000 gigs at some
> point?
>
> Al
>
> --
> Albert Chu
> ch...@llnl.gov <javascript:;>
> Computer Scientist
> High Performance Systems Division
> Lawrence Livermore National Laboratory
>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org <javascript:;>
> For additional commands, e-mail: user-h...@hadoop.apache.org
> <javascript:;>
>
>

-- 
Regards
Sidharth Kumar | Mob: +91 8197 555 599 | LinkedIn
<https://www.linkedin.com/in/sidharthkumar2792/>


Re: Anatomy of read in hdfs

2017-04-10 Thread Sidharth Kumar
Thanks Philippe but your answers raised another sets of questions to me
.please help me to understand it
1) If we read anatomy of hdfs read in hadoop definitive guide it says data
queue is consumed by streamer. So, can you just tell me that will there be
only one streamer in a cluster which consume packets from data queue and
create pipeline for each packets to store into data node or there will be
multiple streamer which will consume packets from data queue and store into
data node parallel .
2) There are multiple blogs has been written claiming read and write is a
parallel process(below I have pasted one such link). Can you also help me
by justifying  if they are wrong
http://stackoverflow.com/questions/30400249/hadoop-pipeline-write-and-parallel-read


Thanks for your help in advance

Sidharth

On 10-Apr-2017 3:31 PM, "Philippe Kernévez" <pkerne...@octo.com> wrote:

>
>
> On Mon, Apr 10, 2017 at 11:46 AM, Sidharth Kumar <
> sidharthkumar2...@gmail.com> wrote:
>
>> Thanks Philippe,
>>
>> I am looking for answer only restricted to HDFS. Because we can do read
>> and write operations from CLI using commands like "*hadoop fs
>> -copyfromlocal /(local disk location) /(hdfs path)" *and read using "*hadoop
>> fs -text /(hdfs file)" *as well.
>>
>> So my question are
>> 1) when I write data using -copyfromlocal command how data from data
>> queue is being pushed to data streamer ? Do we have only one data streamer
>> which listen to data queue and store data into individual datanode one by
>> one or we have multiple streamer which listen to data queue and create
>> pipeline for each individual packets?
>>
> ​On stream per command. You may start several command, one per file, but
> the bottleneck will quickly be ​the network.
> This command is only used to do import/export data from/to hadoop cluster.
> The main reads and writes should occurs inside the cluster, when you will
> do processing.
>
>
>> 2) Similarly when we read data, client will receive packets one after
>> another in sequential manner like 2nd data node will wait for 1st node to
>> send it's block first or it will be a parallel process.
>>
> ​Depend on the reader. If you use cmd cli, yes the reads will be
> sequential. If you use Hadoop Yarn processing patterns (MapReduce, Spark,
> Tez, etc.)​ then multiple reader (Map) will be started to do parallel
> processing of you data.
>
> ​What do you want to do with the data that you read ?
>
> Regards,
> Philippe​
>
>
>
>>
>>
>> Thanks for your help in advance.
>>
>> Sidharth
>>
>>
>> On 10-Apr-2017 1:50 PM, "Philippe Kernévez" <pkerne...@octo.com> wrote:
>>
>>> Hi Sidharth,
>>>
>>> As it has been explained, HDFS is not just a file system. It's a part of
>>> the Hadoop platform. To take advantage of HDFS you have to understand how
>>> Hadoop storage (HDFS) AND Yarn processing (say MapReduce) work all together
>>> to implements jobs and parallel processing.
>>> That says that you will have to rethink the design of your programs to
>>> take advantage of HDFS.
>>>
>>> You may start with this kind of tutorial
>>> https://www.tutorialspoint.com/map_reduce/map_reduce_introduction.htm
>>>
>>> Then have a deeper read of the Hadoop documentation
>>> http://hadoop.apache.org/docs/r2.7.3/hadoop-mapreduce-client
>>> /hadoop-mapreduce-client-core/MapReduceTutorial.html
>>>
>>> Regards,
>>> Philippe
>>>
>>>
>>>
>>> On Sun, Apr 9, 2017 at 11:13 PM, daemeon reiydelle <daeme...@gmail.com>
>>> wrote:
>>>
>>>> Readers ARE parallel processes, one per map task. There are defaults in
>>>> map phase, about how many readers there are for the input file(s). Default
>>>> is one mapper task block (or file, where any file is smaller than the hdfs
>>>> block size). There is no java framework per se for splitting up an file
>>>> (technically not so, but let's simplify, outside of your own custom code).
>>>>
>>>>
>>>> *...*
>>>>
>>>>
>>>>
>>>> *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198 <(415)%20501-0198>London
>>>> (+44) (0) 20 8144 9872 <+44%2020%208144%209872>*
>>>>
>>>> On Sun, Apr 9, 2017 at 2:50 AM, Sidharth Kumar <
>>>> sidharthkumar2...@gmail.com> wrote:
>>>>
>>>>> Thanks Tariq, It really helped me to understand but just one another
>>>>> doubt that if reading 

Re: Anatomy of read in hdfs

2017-04-10 Thread Sidharth Kumar
Thanks Philippe,

I am looking for answer only restricted to HDFS. Because we can do read and
write operations from CLI using commands like "*hadoop fs -copyfromlocal
/(local disk location) /(hdfs path)" *and read using "*hadoop fs -text
/(hdfs file)" *as well.

So my question are
1) when I write data using -copyfromlocal command how data from data queue
is being pushed to data streamer ? Do we have only one data streamer which
listen to data queue and store data into individual datanode one by one or
we have multiple streamer which listen to data queue and create pipeline
for each individual packets?

2) Similarly when we read data, client will receive packets one after
another in sequential manner like 2nd data node will wait for 1st node to
send it's block first or it will be a parallel process.


Thanks for your help in advance.

Sidharth


On 10-Apr-2017 1:50 PM, "Philippe Kernévez" <pkerne...@octo.com> wrote:

> Hi Sidharth,
>
> As it has been explained, HDFS is not just a file system. It's a part of
> the Hadoop platform. To take advantage of HDFS you have to understand how
> Hadoop storage (HDFS) AND Yarn processing (say MapReduce) work all together
> to implements jobs and parallel processing.
> That says that you will have to rethink the design of your programs to
> take advantage of HDFS.
>
> You may start with this kind of tutorial
> https://www.tutorialspoint.com/map_reduce/map_reduce_introduction.htm
>
> Then have a deeper read of the Hadoop documentation
> http://hadoop.apache.org/docs/r2.7.3/hadoop-mapreduce-
> client/hadoop-mapreduce-client-core/MapReduceTutorial.html
>
> Regards,
> Philippe
>
>
>
> On Sun, Apr 9, 2017 at 11:13 PM, daemeon reiydelle <daeme...@gmail.com>
> wrote:
>
>> Readers ARE parallel processes, one per map task. There are defaults in
>> map phase, about how many readers there are for the input file(s). Default
>> is one mapper task block (or file, where any file is smaller than the hdfs
>> block size). There is no java framework per se for splitting up an file
>> (technically not so, but let's simplify, outside of your own custom code).
>>
>>
>> *...*
>>
>>
>>
>> *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198 <(415)%20501-0198>London
>> (+44) (0) 20 8144 9872 <+44%2020%208144%209872>*
>>
>> On Sun, Apr 9, 2017 at 2:50 AM, Sidharth Kumar <
>> sidharthkumar2...@gmail.com> wrote:
>>
>>> Thanks Tariq, It really helped me to understand but just one another
>>> doubt that if reading is not a parallel process then to ready a file of
>>> 100GB and  hdfs block size is 128MB. It will take lot much to read the
>>> complete file but it's not the scenerio in the real time. And second
>>> question is write operations as well is sequential process ? And will every
>>> datanode have their own data streamer which listen to data queue to get the
>>> packets and create pipeline. So, can you kindly help me to get clear idea
>>> of hdfs read and write operations.
>>>
>>> Regards
>>> Sidharth
>>>
>>> On 08-Apr-2017 12:49 PM, "Mohammad Tariq" <donta...@gmail.com> wrote:
>>>
>>> Hi Sidhart,
>>>
>>> When you read data from HDFS using a framework, like MapReduce, blocks
>>> of a HDFS file are read in parallel by multiple mappers created in that
>>> particular program. Input splits to be precise.
>>>
>>> On the other hand if you have a standalone java program then it's just a
>>> single thread process and will read the data sequentially.
>>>
>>>
>>> On Friday, April 7, 2017, Sidharth Kumar <sidharthkumar2...@gmail.com>
>>> wrote:
>>>
>>>> Thanks for your response . But I dint understand yet,if you don't mind
>>>> can you tell me what do you mean by "*With Hadoop, the idea is to
>>>> parallelize the readers (one per block for the mapper) with processing
>>>> framework like MapReduce.*"
>>>>
>>>> And also how the concept of parallelize the readers will work with hdfs
>>>>
>>>> Thanks a lot in advance for your help.
>>>>
>>>>
>>>> Regards
>>>> Sidharth
>>>>
>>>> On 07-Apr-2017 1:04 PM, "Philippe Kernévez" <pkerne...@octo.com> wrote:
>>>>
>>>> Hi Sidharth,
>>>>
>>>> The reads are sequential.
>>>> With Hadoop, the idea is to parallelize the readers (one per block for
>>>> the mapper) with processing framework like MapReduce.
>>>>
>&

Re: Anatomy of read in hdfs

2017-04-09 Thread Sidharth Kumar
Thanks Tariq, It really helped me to understand but just one another doubt
that if reading is not a parallel process then to ready a file of 100GB and
 hdfs block size is 128MB. It will take lot much to read the complete file
but it's not the scenerio in the real time. And second question is write
operations as well is sequential process ? And will every datanode have
their own data streamer which listen to data queue to get the packets and
create pipeline. So, can you kindly help me to get clear idea of hdfs read
and write operations.

Regards
Sidharth

On 08-Apr-2017 12:49 PM, "Mohammad Tariq" <donta...@gmail.com> wrote:

Hi Sidhart,

When you read data from HDFS using a framework, like MapReduce, blocks of a
HDFS file are read in parallel by multiple mappers created in that
particular program. Input splits to be precise.

On the other hand if you have a standalone java program then it's just a
single thread process and will read the data sequentially.


On Friday, April 7, 2017, Sidharth Kumar <sidharthkumar2...@gmail.com>
wrote:

> Thanks for your response . But I dint understand yet,if you don't mind can
> you tell me what do you mean by "*With Hadoop, the idea is to parallelize
> the readers (one per block for the mapper) with processing framework like
> MapReduce.*"
>
> And also how the concept of parallelize the readers will work with hdfs
>
> Thanks a lot in advance for your help.
>
>
> Regards
> Sidharth
>
> On 07-Apr-2017 1:04 PM, "Philippe Kernévez" <pkerne...@octo.com> wrote:
>
> Hi Sidharth,
>
> The reads are sequential.
> With Hadoop, the idea is to parallelize the readers (one per block for the
> mapper) with processing framework like MapReduce.
>
> Regards,
> Philippe
>
>
> On Thu, Apr 6, 2017 at 9:55 PM, Sidharth Kumar <
> sidharthkumar2...@gmail.com> wrote:
>
>> Hi Genies,
>>
>> I have a small doubt that hdfs read operation is parallel or sequential
>> process. Because from my understanding it should be parallel but if I read
>> "hadoop definitive guide 4" in anatomy of read it says "*Data is
>> streamed from the datanode back **to the client, which calls read()
>> repeatedly on the stream (step 4). When the end of the **block is
>> reached, DFSInputStream will close the connection to the datanode, then
>> find **the best datanode for the next block (step 5). This happens
>> transparently to the client, **which from its point of view is just
>> reading a continuous stream*."
>>
>> So can you kindly explain me how read operation will exactly happens.
>>
>>
>> Thanks for your help in advance
>>
>> Sidharth
>>
>>
>
>
> --
> Philippe Kernévez
>
>
>
> Directeur technique (Suisse),
> pkerne...@octo.com
> +41 79 888 33 32
>
> Retrouvez OCTO sur OCTO Talk : http://blog.octo.com
> OCTO Technology http://www.octo.ch
>
>
>

-- 


[image: http://]

Tariq, Mohammad
about.me/mti
[image: http://]
<http://about.me/mti>


Re: Anatomy of read in hdfs

2017-04-07 Thread Sidharth Kumar
Thanks for your response . But I dint understand yet,if you don't mind can
you tell me what do you mean by "*With Hadoop, the idea is to parallelize
the readers (one per block for the mapper) with processing framework like
MapReduce.*"

And also how the concept of parallelize the readers will work with hdfs

Thanks a lot in advance for your help.


Regards
Sidharth

On 07-Apr-2017 1:04 PM, "Philippe Kernévez" <pkerne...@octo.com> wrote:

Hi Sidharth,

The reads are sequential.
With Hadoop, the idea is to parallelize the readers (one per block for the
mapper) with processing framework like MapReduce.

Regards,
Philippe


On Thu, Apr 6, 2017 at 9:55 PM, Sidharth Kumar <sidharthkumar2...@gmail.com>
wrote:

> Hi Genies,
>
> I have a small doubt that hdfs read operation is parallel or sequential
> process. Because from my understanding it should be parallel but if I read
> "hadoop definitive guide 4" in anatomy of read it says "*Data is streamed
> from the datanode back **to the client, which calls read() repeatedly on
> the stream (step 4). When the end of the **block is reached,
> DFSInputStream will close the connection to the datanode, then find **the
> best datanode for the next block (step 5). This happens transparently to
> the client, **which from its point of view is just reading a continuous
> stream*."
>
> So can you kindly explain me how read operation will exactly happens.
>
>
> Thanks for your help in advance
>
> Sidharth
>
>


-- 
Philippe Kernévez



Directeur technique (Suisse),
pkerne...@octo.com
+41 79 888 33 32

Retrouvez OCTO sur OCTO Talk : http://blog.octo.com
OCTO Technology http://www.octo.ch


Anatomy of read in hdfs

2017-04-06 Thread Sidharth Kumar
Hi Genies,

I have a small doubt that hdfs read operation is parallel or sequential
process. Because from my understanding it should be parallel but if I read
"hadoop definitive guide 4" in anatomy of read it says "*Data is streamed
from the datanode back **to the client, which calls read() repeatedly on
the stream (step 4). When the end of the **block is reached, DFSInputStream
will close the connection to the datanode, then find **the best datanode
for the next block (step 5). This happens transparently to the client, **which
from its point of view is just reading a continuous stream*."

So can you kindly explain me how read operation will exactly happens.


Thanks for your help in advance

Sidharth


Customize Sqoop default property

2017-04-06 Thread Sidharth Kumar
Hi,

I am importing data from RDBMS to hadoop using sqoop but my RDBMS data is
multi valued and contains "," special character.
So, While importing data using sqoop into hadoop ,sqoop by default it
separate the columns by using "," character. Is there any property through
which we can customize this  character from  "," to "|" or any other
special character which is not a part of data.

Thanks
Sidharth


Request for Hadoop mailing list subscription and 3.0.0 issues

2017-03-28 Thread Sidharth Kumar
Hi Folks,

I am working as full time Hadoop Administrator since 2 years and want to be
a part of apache hadoop foundation so that i can contribute best of my
knowledge and learn from more about it from experts.So, can you please help
me to subscribe to other mailing lists.

I also want to know about the issue i was facing while setting apache
hadoop 3.0.0-alpha 2 cluster. I installed and configured the cluster and
HDFS was working all great but mapreduce was failing. I made multiple
attempts by adding few more additional configuration but it was looking
that job was not able to pickup the configurations.While the same set of
configuration worked fine for hadoop2.7.2 and other stable versions.


Thanks for your help in advance

-- 
Regards
Sidharth Kumar | Mob: +91 8197 555 599 | LinkedIn
<https://www.linkedin.com/in/sidharthkumar2792/>