hadoop cluster with mixed servers(different memory, speed, etc)

2015-09-17 Thread Demai Ni
hi, folks,

I am wondering how hadoop cluster handle commodity hardware with different
speed, capacity .

This situation is happening and probably become very common soon. That a
cluster starts with 100 machines, and in a couple years, add another 100
machines. With Moore's law as an indicator, the new vs. old machines are at
least one generation apart. The situation get even more complex if the
'new' 100 join the cluster gradually. How hadoop handles this situation and
avoid the weakest link problem?

thanks

Demai


Re: HDFS ShortCircuit Read on Mac?

2015-09-08 Thread Demai Ni
Chris, many thanks for the quick response. I will disable the shortcircuit
on my mac for now. :-)  Demai

On Tue, Sep 8, 2015 at 4:57 PM, Chris Nauroth <cnaur...@hortonworks.com>
wrote:

> Hello Demai,
>
> HDFS short-circuit read currently does not work on Mac, due to some
> platform differences in handling of domain sockets.  The last time I
> checked, our Hadoop code was exceeding a maximum path length enforced on
> Mac for domain socket paths.  I haven't had availability to look at this in
> a while, but the prior work is tracked in JIRA issues HDFS-3296 and
> HADOOP-11957 if you want to see the current progress.
>
> --Chris Nauroth
>
> From: Demai Ni <nid...@gmail.com>
> Reply-To: "user@hadoop.apache.org" <user@hadoop.apache.org>
> Date: Tuesday, September 8, 2015 at 4:46 PM
> To: "user@hadoop.apache.org" <user@hadoop.apache.org>
> Subject: HDFS ShortCircuit Read on Mac?
>
> hi, folks,
>
> wondering anyone has setup HDFS shortcircuit Read on Mac? I installed
> hadoop through homebrew on Mac. It is up and running. But I cannot
> config "dfs.domain.socket.path" as instructed here:
>
> http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/ShortCircuitLocalReads.html
> Since there is no dn_socket on the mac.
>
> any pointers are appreciated.
>
> Demai
>
>


HDFS ShortCircuit Read on Mac?

2015-09-08 Thread Demai Ni
hi, folks,

wondering anyone has setup HDFS shortcircuit Read on Mac? I installed
hadoop through homebrew on Mac. It is up and running. But I cannot
config "dfs.domain.socket.path" as instructed here:
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/ShortCircuitLocalReads.html
Since there is no dn_socket on the mac.

any pointers are appreciated.

Demai


hadoop/hdfs cache question, do client processes share cache?

2015-08-11 Thread Demai Ni
hi, folks,

I have a quick question about how hdfs handle cache? In this lab
experiment, I have a 4 node hadoop cluster (2.x) and each node has a fair
large memory (96GB).  And have a single hdfs file with 256MB, and also fit
in one HDFS block. The local filesystem is linux.

Now from one of the DataNode, I started 10 hadoop client processes to
repeatedly read the above file. With the assumption that HDFS will cache
the 256MB in memory, so (after the 1st read) READs will have no disk I/O
involved anymore.

My question is : *how many COPY of the 256MB will be in memory of this
DataNode? 10 or 1?*

How about the 10 client processes are located at the 5th linux box
 independent of the cluster? Will we have 10 copies of the 256 MB or just
1?

Many thanks. Appreciate your help on this.

Demai


Re: hadoop/hdfs cache question, do client processes share cache?

2015-08-11 Thread Demai Ni
Ritesh,

many thanks for your response. I just read through the centralized Cache
document. Thanks for the pointer. A couple follow-up questions.

First, the centralized cache required 'explicit' configuration, so by
default, there is no HDFS-managed cache? Will the cache occur at local
filesystem level like Linux?

The 2nd question. The centralized Cache is among the DN of HDFS. Let's say
the client is a stand-alone Linux(not part of the cluster), which connects
to the HDFS cluster with centralized cache configured. So on HDFS cluster,
the file is cached. In the same scenario, the client has 10 processes
repeatedly read the same HDFS file. will HDFS client API be able to cache
the file content at Client side? or every READ will have to move the whole
file through network, and no sharing  between processes?

Demai


On Tue, Aug 11, 2015 at 12:58 PM, Ritesh Kumar Singh 
riteshoneinamill...@gmail.com wrote:

 Let's assume that hdfs maintains 3 replicas of the 256MB block, then all
 of these 3 datanodes will have only one copy of the block in their
 respective mem cache and thus avoiding the repeated i/o reads. This goes
 with the centralized cache management policy of hdfs that also gives you an
 option to pin 2 of these 3 blocks in cache and save the remaining 256MB of
 cache space. Here's a link
 https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html
  on
 the same.

 Hope that helps.

 Ritesh



Re: a non-commerial distribution of hadoop ecosystem?

2015-06-01 Thread Demai Ni
Chris and Roman,

many thanks for the quick response.  I will take a look at bigtop.
Actually, I heard about it, but thought it is a installation framework,
instead of a hadoop distribution. Now I am looking at the BigTop 0.7.0
hadoop instruction, which probably will work fine for my needs. Appreciate
the pointer.

Roman, I will ping you off list for ODP. I was hoping ODP will be the one
for me. Well, in reality, it is owned by a few companies, at least not by
ONE company. :-)  It is fine with me, as long as ODP is open to be used by
others. I am just having trouble to find document/installation info of the
ODP. maybe I should google harder? :-)

Demai


On Mon, Jun 1, 2015 at 1:46 PM, Roman Shaposhnik r...@apache.org wrote:

 On Mon, Jun 1, 2015 at 1:37 PM, Demai Ni nid...@gmail.com wrote:
  My question is besides the commercial distributions: CDH(Cloudera)  , HDP
  (Horton work), and others like Mapr, IBM... Is there a distribution that
 is
  NOT owned by a company?  I am looking for something simple for cluster
  configuration/installation for multiple components: hdfs, yarn,
 zookeeper,
  hive, hbase, maybe Spark. Surely, for a well-experience person(not me),
  he/she can build the distribution from Apache releases. Well, I am more
  interested on building application on top of it, and hopefully to find
 one
  packed them together.

 Apache Bigtop (CCed) aims at delivering a 100% open and
 community-driven distribution of big data management technologies
 around Apache Hadoop. Same as, for example, what Debian is trying
 to do for Linux.

  BTW, I don't need the latest releases like other commercial distribution
  offered.  I am also looking into the ODP(the open data platform), but
 that
  project is kind of quiet after the initial Feb announcement.

 Feel free to ping me off list if you want more details on ODP.

 Thanks,
 Roman.



Re: Connect c language with HDFS

2015-05-04 Thread Demai Ni
I would also suggest to take a look at
https://issues.apache.org/jira/browse/HDFS-6994. I have been using libhdfs3
for POC in past few months, and highly recommend it.  the only drawback is
the libhdfs3 has not been formed committed into hadoop/hdfs yet.

if you only like to play with hdfs, using the existing libhdfs lib is fine.
but if you are looking for some serious development, libhdfs3 has a lot of
advantage.


On Mon, May 4, 2015 at 3:59 AM, unmesha sreeveni unmeshab...@gmail.com
wrote:

 Thanks
 Did it.

 http://unmeshasreeveni.blogspot.in/2015/05/hadoop-word-count-using-c-hadoop.html

 On Mon, May 4, 2015 at 3:43 PM, Alexander Alten-Lorenz 
 wget.n...@gmail.com wrote:

 That depends on the installation source (rpm, tgz or parcels). Usually,
 when you use parcels, libhdfs.so* should be within /opt/cloudera/parcels/
 CDH/lib64/ (or similar). Or just use linux' locate (locate
 libhdfs.so*) to find the library.




 --
 Alexander Alten-Lorenz
 m: wget.n...@gmail.com
 b: mapredit.blogspot.com

 On May 4, 2015, at 11:39 AM, unmesha sreeveni unmeshab...@gmail.com
 wrote:

 thanks alex
   I have gone through the same. but once I checked my cloudera
 distribution I am not able to get those folders ..Thats y I posted here. I
 dont know if I made any mistake.

 On Mon, May 4, 2015 at 2:40 PM, Alexander Alten-Lorenz 
 wget.n...@gmail.com wrote:

 Google:

 http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/LibHdfs.html

 --
 Alexander Alten-Lorenz
 m: wget.n...@gmail.com
 b: mapredit.blogspot.com

 On May 4, 2015, at 10:57 AM, unmesha sreeveni unmeshab...@gmail.com
 wrote:

 Hi
   Can we connect c with HDFS using cloudera hadoop distribution.

 --
 *Thanks  Regards *


 *Unmesha Sreeveni U.B*
 *Hadoop, Bigdata Developer*
 *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
 http://www.unmeshasreeveni.blogspot.in/






 --
 *Thanks  Regards *


 *Unmesha Sreeveni U.B*
 *Hadoop, Bigdata Developer*
 *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
 http://www.unmeshasreeveni.blogspot.in/






 --
 *Thanks  Regards *


 *Unmesha Sreeveni U.B*
 *Hadoop, Bigdata Developer*
 *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
 http://www.unmeshasreeveni.blogspot.in/





Re: Data locality

2015-03-02 Thread Demai Ni
hi, folks,

I have the similar question. Is there an easy way to tell(from a user
perspective) whether short circuit is enabled? thanks

Demai

On Mon, Mar 2, 2015 at 11:46 AM, Fei Hu hufe...@gmail.com wrote:

 Hi All,

 I developed a scheduler for data locality. Now I want to test the
 performance of the scheduler, so I need to monitor how many data are read
 remotely. Is there any tool for monitoring the volume of data moved around
 the cluster?

 Thanks,
 Fei


[HDFS] result order of getFileBlockLocations() and listFiles()?

2014-10-29 Thread Demai Ni
hi, Guys,

I am trying to implement a simple program(that is not for production,
experimental). And invoke FileSystem.listFiles() to get a list of files
under a hdfs folder, and then use FileSystem.getFileBlockLocations() to get
replica locations of each file/blocks.

Since it is a controlled environment, I can make sure the files are static
and don't worry about datanode crash, fail-over, etc.

Assuming at a small time-window(say, 1 minute), I have 100~1000s client
invoke the same program to look up the same folder. Will the above two APIs
guarantee *same result in the same order* for all clients?

To elaborate a bit more, say there is a folder called /dfs/dn/user/data
contains three files: file1, file2, and file3.  If client1 gets:
listFiles() : file1,file2,file3
getFileBlockLocation(file1) - datanode1, datanode3, datanode6

Will all other clients get the same information(I think so) and in the same
order?  or I have to do a sort by each client to guarantee the order?

Many thanks for your inputs

Demai


Re: read from a hdfs file on the same host as client

2014-10-13 Thread Demai Ni
Shivram,

many thanks for confirming the behavior. I will also turn on the
shortcircuit as you suggested. Appreciate the help

Demai

On Mon, Oct 13, 2014 at 3:42 PM, Shivram Mani sm...@pivotal.io wrote:

 Demai, you are right. HDFS's default BlockPlacementPolicyDefault makes
 sure one replica of your block is available on the writer's datanode.
 The replica selection for the read operation is also aimed at minimizing
 bandwidth/latency and will serve the block from the reader's local node.
 If you want to further optimize this, you can set 
 'dfs.client.read.shortcircuit'
 to true. This would allow the client to bypass the datanode to read the
 file directly.

 On Mon, Oct 13, 2014 at 11:58 AM, Demai Ni nid...@gmail.com wrote:

 hi, folks,

 a very simple question, looking forward a couple pointers.

 Let's say I have a hdfs file: testfile, which only have one block(256MB),
 and the block has a replica on datanode: host1.hdfs.com (the whole hdfs
 may have 100 nodes though, and the other 2 replica are available at other
 datanode).

 If on host1.hdfs.com, I did a hadoop fs -cat testfile or a java client
 to read the file. Should I assume there won't be any significant data
 movement through network?  That is the namenode is smart enough to give me
 the data on host1.hdfs.com directly?

 thanks

 Demai




 --
 Thanks
 Shivram



hdfs: a C API call to getFileSize() through libhdfs or libhdfs3?

2014-10-02 Thread Demai Ni
hi, folks,

To get the size of a hdfs file, jave API has
FileSystem#getFileStatus(PATH)#getLen();
now I am trying to use a C client to do the same thing.

For a file on local file system, I can grab the info like this:
fseeko(file, 0, SEEK_END);
size = ftello(file);

But I can't find the SEEK_END or a getFileSize() call in the existing
libhdfs or the newly libhdfs3
https://issues.apache.org/jira/browse/HDFS-6994

Can someone point me to the right direction? many thanks

Demai


Re: Planning to propose Hadoop initiative to company. Need some inputs please.

2014-10-01 Thread Demai Ni
hi,

glad to see another person moving from mainframe world to the 'big' data
one. I was in the same boat a few years back after working on mainframe for
10+ years.

Wilm got to the pointers already. I'd like to just chime in a bit from
mainframe side.

The example of website usage is a very good one for bigdata comparing to
mainframe, as mainframe is very expensive to provide reliability for
mission-critical workload. One approach is to look at what the current
application running on mainframe or your guys are considering to implement
on mainframe. For a website usage case, the cost to implement and running
would be only 1/10 if on hadoop/hbase, comparing to mainframe. And
mainframe probably not able to scale up if the data goes to TB.

2nd, be careful that Hadoop is not for all your cases. I am pretty such
that your IT department is handling some mission-critical workloads, like
payroll, employee info, customer-payment, etc. Leaving those workloads on
mainframe. for 1) hbase/hadoop are not design for such RDMS workload; for
2) moving from one database to another is way too much risk unless the top
boss force you do so... :-)

Demai


On Wed, Oct 1, 2014 at 11:02 AM, Wilm Schumacher wilm.schumac...@cawoom.com
 wrote:

 Hi,

 first: I think hbase is what you are looking for. If I understand
 correctly you want to show the customer his or her data very fast and
 let them manipulate their data. So you need something like a data
 warehouse system. Thus, hbase is the method of choice for you (and I
 think for your kind of data, hbase is a better choice than cassandra or
 mongoDB). But of course you need a running hadoop system to run a hbase.
 So it's not an either/or ;)

 (my answers are for hbase, as I think it's what you are looking for. If
 you are not interested, just ignore the following text. Sry @all by
 writing about hbase on this list ;).)

 Am 01.10.2014 um 17:24 schrieb mani kandan:
  1) How much web usage data will a typical website like ours collect on a
  daily basis? (I know I can ask our IT department, but I would like to
  gather some background idea before talking to them.)
 well, if you have the option to ask your IT department you should do
 that, because everyone here would have to guess. You would have to
 explain very detailed what you have to do to let us guess. If you e.g.
 want to track the user on what he or she has clicked, perhaps to make
 personalized ads, than you have to save more data. So, you should ask
 the persons who have the data right away without guessing.

  3) How many clusters/nodes would I need to ​run a web usage analytics
  system?
 in the book hbase in action there are some recommendations for some
 case studies (part IV deploying hbase). There are some thoughts on
 the number of nodes, and how to use them, depending on the size of your
 data

  4) What are the ways for me to use our data? (One use case I'm thinking
  of is to analyze the error messages log for each page on quote process
  to redesign the UI. Is this possible?)
 sure. And this should be very easy. I would pump the error log into a
 hbase table. By this method you could read the messages directly from
 the hbase shell (if they are few enough). Or you could use hive to query
 your log a little more sql like and make statistics very easy.

  5) How long would it take for me to set up and start such a system?
 for a novice who have to do it for the first time: for the stand alone
 hbase system perhaps 2 hours. For a complete distributed test cluster
 ... perhaps a day. For the real producing system, with all security
 features ... a little longer ;).

  I'm sorry if some/all of these questions are unanswerable. I just want
  to discuss my thoughts, and get an idea of what things can I achieve by
  going the way of Hadoop.
 well, I think, but I could err, that you think of hadoop (or hbase) in a
 way that you just can change the database backend from SQL to
 hbase/hadoop and everything would run right away. This will not be
 that easy. You would have to change the code of your web application in
 a very fundamental way. You have to rethink all the table designs etc.,
 so this could be more complicate than you think right know.

 However, hbase/hadoop hase some advantages which are very interesing for
 you. Well first, it is distributed, which enables your company to grow
 almost limitless, or to collect more data about your customers so you
 can get more informations (and sell more stuff). And map reduce is a
 wonderful tool for making real fancy statistics, which is very
 interesting for an insurance company. Your mathematical economist will
 REALLY love it ;).

 Hope this helped.

 best wishes

 Wilm





Re: conf.get(dfs.data.dir) return null when hdfs-site.xml doesn't set it explicitly

2014-09-09 Thread Demai Ni
Susheel actually brought up a good point.

once the client code connects to the cluster, is there way to get the real
cluster configuration variables/values instead of relying on the .xml files
on client side?

Demai

On Mon, Sep 8, 2014 at 10:12 PM, Susheel Kumar Gadalay skgada...@gmail.com
wrote:

 One doubt on building Configuration object.

 I have a Hadoop remote client and Hadoop cluster.
 When a client submitted a MR job, the Configuration object is built
 from Hadoop cluster node xml files, basically the resource manager
 node core-site.xml and mapred-site.xml and yarn-site.xml.
 Am I correct?

 TIA
 Susheel Kumar

 On 9/9/14, Bhooshan Mogal bhooshan.mo...@gmail.com wrote:
  Hi Demai,
 
  conf = new Configuration()
 
  will create a new Configuration object and only add the properties from
  core-default.xml and core-site.xml in the conf object.
 
  This is basically a new configuration object, not the same that the
 daemons
  in the hadoop cluster use.
 
 
 
  I think what you are trying to ask is if you can get the Configuration
  object that a daemon in your live cluster (e.g. datanode) is using. I am
  not sure if the datanode or any other daemon on a hadoop cluster exposes
  such an API.
 
  I would in fact be tempted to get this information from the configuration
  management daemon instead - in your case cloudera manager. But I am not
  sure if CM exposes that API either. You could probably find out on the
  Cloudera mailing list.
 
 
  HTH,
  Bhooshan
 
 
  On Mon, Sep 8, 2014 at 3:52 PM, Demai Ni nid...@gmail.com wrote:
 
  hi, Bhooshan,
 
  thanks for your kind response.  I run the code on one of the data node
 of
  my cluster, with only one hadoop daemon running. I believe my java
 client
  code connect to the cluster correctly as I am able to retrieve
  fileStatus,
  and list files under a particular hdfs path, and similar things...
  However, you are right that the daemon process use the hdfs-site.xml
  under
  another folder for cloudera :
  /var/run/cloudera-scm-agent/process/90-hdfs-DATANODE/hdfs-site.xml.
 
  about  retrieving the info from a live cluster, I would like to get
 the
  information beyond the configuration files(that is beyond the .xml
  files).
  Since I am able to use :
  conf = new Configuration()
  to connect to hdfs and did other operations, shouldn't I be able to
  retrieve the configuration variables?
 
  Thanks
 
  Demai
 
 
  On Mon, Sep 8, 2014 at 2:40 PM, Bhooshan Mogal 
 bhooshan.mo...@gmail.com
  wrote:
 
  Hi Demai,
 
  When you read a property from the conf object, it will only have a
 value
  if the conf object contains that property.
 
  In your case, you created the conf object as new Configuration() --
 adds
  core-default and core-site.xml.
 
  Then you added site.xmls (hdfs-site.xml and core-site.xml) from
 specific
  locations. If none of these files have defined dfs.data.dir, then you
  will
  get NULL. This is expected behavior.
 
  What do you mean by retrieving the info from a live cluster? Even for
  processes like datanode, namenode etc, the source of truth for these
  properties is hdfs-site.xml. It is loaded from a specific location when
  you
  start these services.
 
  Question: Where are you running the above code? Is it on a node which
  has
  other hadoop daemons as well?
 
  My guess is that the path you are referring to (/etc/hadoop/conf.
  cloudera.hdfs/core-site.xml) is not the right path where these config
  properties are defined. Since this is a CDH cluster, you would probably
  be
  best served by asking on the CDH mailing list as to where the right
 path
  to
  these files is.
 
 
  HTH,
  Bhooshan
 
 
  On Mon, Sep 8, 2014 at 11:47 AM, Demai Ni nid...@gmail.com wrote:
 
  hi, experts,
 
  I am trying to get the local filesystem directory of data node. My
  cluster is using CDH5.x (hadoop 2.3) and the default configuration. So
  the
  datanode is under file:///dfs/dn. I didn't specify the value in
  hdfs-site.xml.
 
  My code is something like:
 
  conf = new Configuration()
 
  // test both with and without the following two lines
  conf.addResource (new
  Path(/etc/hadoop/conf.cloudera.hdfs/hdfs-site.xml));
  conf.addResource (new
  Path(/etc/hadoop/conf.cloudera.hdfs/core-site.xml));
 
  // I also tried get(dfs.datanode.data.dir), which also return NULL
  String dnDir = conf.get(dfs.data.dir);  // return NULL
 
  It looks like the get only look at the configuration file instead of
  retrieving the info from the live cluster?
 
  Many thanks for your help in advance.
 
  Demai
 
 
 
 
  --
  Bhooshan
 
 
 
 
 
  --
  Bhooshan
 



Re: conf.get(dfs.data.dir) return null when hdfs-site.xml doesn't set it explicitly

2014-09-08 Thread Demai Ni
hi, Bhooshan,

thanks for your kind response.  I run the code on one of the data node of
my cluster, with only one hadoop daemon running. I believe my java client
code connect to the cluster correctly as I am able to retrieve fileStatus,
and list files under a particular hdfs path, and similar things... However,
you are right that the daemon process use the hdfs-site.xml under another
folder for cloudera :
/var/run/cloudera-scm-agent/process/90-hdfs-DATANODE/hdfs-site.xml.

about  retrieving the info from a live cluster, I would like to get the
information beyond the configuration files(that is beyond the .xml files).
Since I am able to use :
conf = new Configuration()
to connect to hdfs and did other operations, shouldn't I be able to
retrieve the configuration variables?

Thanks

Demai


On Mon, Sep 8, 2014 at 2:40 PM, Bhooshan Mogal bhooshan.mo...@gmail.com
wrote:

 Hi Demai,

 When you read a property from the conf object, it will only have a value
 if the conf object contains that property.

 In your case, you created the conf object as new Configuration() -- adds
 core-default and core-site.xml.

 Then you added site.xmls (hdfs-site.xml and core-site.xml) from specific
 locations. If none of these files have defined dfs.data.dir, then you will
 get NULL. This is expected behavior.

 What do you mean by retrieving the info from a live cluster? Even for
 processes like datanode, namenode etc, the source of truth for these
 properties is hdfs-site.xml. It is loaded from a specific location when you
 start these services.

 Question: Where are you running the above code? Is it on a node which has
 other hadoop daemons as well?

 My guess is that the path you are referring to (/etc/hadoop/conf.
 cloudera.hdfs/core-site.xml) is not the right path where these config
 properties are defined. Since this is a CDH cluster, you would probably be
 best served by asking on the CDH mailing list as to where the right path to
 these files is.


 HTH,
 Bhooshan


 On Mon, Sep 8, 2014 at 11:47 AM, Demai Ni nid...@gmail.com wrote:

 hi, experts,

 I am trying to get the local filesystem directory of data node. My
 cluster is using CDH5.x (hadoop 2.3) and the default configuration. So the
 datanode is under file:///dfs/dn. I didn't specify the value in
 hdfs-site.xml.

 My code is something like:

 conf = new Configuration()

 // test both with and without the following two lines
 conf.addResource (new
 Path(/etc/hadoop/conf.cloudera.hdfs/hdfs-site.xml));
 conf.addResource (new
 Path(/etc/hadoop/conf.cloudera.hdfs/core-site.xml));

 // I also tried get(dfs.datanode.data.dir), which also return NULL
 String dnDir = conf.get(dfs.data.dir);  // return NULL

 It looks like the get only look at the configuration file instead of
 retrieving the info from the live cluster?

 Many thanks for your help in advance.

 Demai




 --
 Bhooshan



Re: conf.get(dfs.data.dir) return null when hdfs-site.xml doesn't set it explicitly

2014-09-08 Thread Demai Ni
Bhooshan,

Many thanks. I appreciate the help. I will also try out Cloudera mailing
list/community

Demai

On Mon, Sep 8, 2014 at 4:58 PM, Bhooshan Mogal bhooshan.mo...@gmail.com
wrote:

 Hi Demai,

 conf = new Configuration()

 will create a new Configuration object and only add the properties from
 core-default.xml and core-site.xml in the conf object.

 This is basically a new configuration object, not the same that the
 daemons in the hadoop cluster use.



 I think what you are trying to ask is if you can get the Configuration
 object that a daemon in your live cluster (e.g. datanode) is using. I am
 not sure if the datanode or any other daemon on a hadoop cluster exposes
 such an API.

 I would in fact be tempted to get this information from the configuration
 management daemon instead - in your case cloudera manager. But I am not
 sure if CM exposes that API either. You could probably find out on the
 Cloudera mailing list.


 HTH,
 Bhooshan


 On Mon, Sep 8, 2014 at 3:52 PM, Demai Ni nid...@gmail.com wrote:

 hi, Bhooshan,

 thanks for your kind response.  I run the code on one of the data node of
 my cluster, with only one hadoop daemon running. I believe my java client
 code connect to the cluster correctly as I am able to retrieve fileStatus,
 and list files under a particular hdfs path, and similar things...
 However, you are right that the daemon process use the hdfs-site.xml under
 another folder for cloudera :
 /var/run/cloudera-scm-agent/process/90-hdfs-DATANODE/hdfs-site.xml.

 about  retrieving the info from a live cluster, I would like to get the
 information beyond the configuration files(that is beyond the .xml files).
 Since I am able to use :
 conf = new Configuration()
 to connect to hdfs and did other operations, shouldn't I be able to
 retrieve the configuration variables?

 Thanks

 Demai


 On Mon, Sep 8, 2014 at 2:40 PM, Bhooshan Mogal bhooshan.mo...@gmail.com
 wrote:

 Hi Demai,

 When you read a property from the conf object, it will only have a value
 if the conf object contains that property.

 In your case, you created the conf object as new Configuration() -- adds
 core-default and core-site.xml.

 Then you added site.xmls (hdfs-site.xml and core-site.xml) from specific
 locations. If none of these files have defined dfs.data.dir, then you will
 get NULL. This is expected behavior.

 What do you mean by retrieving the info from a live cluster? Even for
 processes like datanode, namenode etc, the source of truth for these
 properties is hdfs-site.xml. It is loaded from a specific location when you
 start these services.

 Question: Where are you running the above code? Is it on a node which
 has other hadoop daemons as well?

 My guess is that the path you are referring to (/etc/hadoop/conf.
 cloudera.hdfs/core-site.xml) is not the right path where these config
 properties are defined. Since this is a CDH cluster, you would probably be
 best served by asking on the CDH mailing list as to where the right path to
 these files is.


 HTH,
 Bhooshan


 On Mon, Sep 8, 2014 at 11:47 AM, Demai Ni nid...@gmail.com wrote:

 hi, experts,

 I am trying to get the local filesystem directory of data node. My
 cluster is using CDH5.x (hadoop 2.3) and the default configuration. So the
 datanode is under file:///dfs/dn. I didn't specify the value in
 hdfs-site.xml.

 My code is something like:

 conf = new Configuration()

 // test both with and without the following two lines
 conf.addResource (new
 Path(/etc/hadoop/conf.cloudera.hdfs/hdfs-site.xml));
 conf.addResource (new
 Path(/etc/hadoop/conf.cloudera.hdfs/core-site.xml));

 // I also tried get(dfs.datanode.data.dir), which also return NULL
 String dnDir = conf.get(dfs.data.dir);  // return NULL

 It looks like the get only look at the configuration file instead of
 retrieving the info from the live cluster?

 Many thanks for your help in advance.

 Demai




 --
 Bhooshan





 --
 Bhooshan



Re: question about matching java API with libHDFS

2014-09-04 Thread Demai Ni
hi, Yi A,

Thanks for your response. I took a look at hdfs.h and hdfs.c, it seems the
lib only exposes some of APIs, as there are a lot of other public methods
can be accessed through java API/client, but not implemented in libhdfs,
such as the one I am using now: DFSclient.getNamenode().
getBlockLocations(...)..

Is the libhdfs designed to limit the access? Thanks

Demai


On Thu, Sep 4, 2014 at 2:36 AM, Liu, Yi A yi.a@intel.com wrote:

  You could refer to the header file: “src/main/native/libhdfs/hdfs.h”,
 you could get the APIs in detail.



 Regards,

 Yi Liu



 *From:* Demai Ni [mailto:nid...@gmail.com]
 *Sent:* Thursday, September 04, 2014 5:21 AM
 *To:* user@hadoop.apache.org
 *Subject:* question about matching java API with libHDFS



 hi, folks,

 I am currently using java to access HDFS. for example, I am using this API
  DFSclient.getNamenode().getBlockLocations(...)... to retrieve file
 block information.

 Now I need to move the same logic into C/C++. so I am looking at libHDFS,
 and this wiki page: http://wiki.apache.org/hadoop/LibHDFS. And I am also
 using the hdfs_test.c for some reference. However, I couldn't find a way to
 easily figure out whether above Java API is exposed through libHDFS?

 Probably not, since I couldn't find it. Then, it lead to my next question.
 Is there an easy way to plug in the libHDFS framework, to include additonal
 API?

 thanks a lot for your suggestions

 Demai



question about matching java API with libHDFS

2014-09-03 Thread Demai Ni
hi, folks,

I am currently using java to access HDFS. for example, I am using this API
 DFSclient.getNamenode().getBlockLocations(...)... to retrieve file block
information.

Now I need to move the same logic into C/C++. so I am looking at libHDFS,
and this wiki page: http://wiki.apache.org/hadoop/LibHDFS. And I am also
using the hdfs_test.c for some reference. However, I couldn't find a way to
easily figure out whether above Java API is exposed through libHDFS?

Probably not, since I couldn't find it. Then, it lead to my next question.
Is there an easy way to plug in the libHDFS framework, to include additonal
API?

thanks a lot for your suggestions

Demai


Re: Local file system to access hdfs blocks

2014-08-29 Thread Demai Ni
Stanley, 

Thanks. 

Btw, I found this jira hdfs-2246, which probably match what I am looking for.  

Demai on the run

On Aug 28, 2014, at 11:34 PM, Stanley Shi s...@pivotal.io wrote:

 BP-13-7914115-10.122.195.197-14909166276345 is the blockpool information
 blk_1073742025 is the block name;
 
 these names are private to teh HDFS system and user should not use them, 
 right?
 But if you really want ot know this, you can check the fsck code to see 
 whether they are available;
 
 
 On Fri, Aug 29, 2014 at 8:13 AM, Demai Ni nid...@gmail.com wrote:
 Stanley and all,
 
 thanks. I will write a client application to explore this path. A quick 
 question again. 
 Using the fsck command, I can retrieve all the necessary info
 $ hadoop fsck /tmp/list2.txt -files -blocks -racks
 .
  BP-13-7914115-10.122.195.197-14909166276345:blk_1073742025 len=8 repl=2
 [/default/10.122.195.198:50010, /default/10.122.195.196:50010]
 
 However, using getFileBlockLocations(), I can't get the block name/id info, 
 such as  BP-13-7914115-10.122.195.197-14909166276345:blk_1073742025
 seem the BlockLocation don't provide the public info here. 
 http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/BlockLocation.html
 
 is there another entry point? somethinig fsck is using? thanks 
 
 Demai
 
 
 
 
 On Wed, Aug 27, 2014 at 11:09 PM, Stanley Shi s...@pivotal.io wrote:
 As far as I know, there's no combination of hadoop API can do that.
 You can easily get the location of the block (on which DN), but there's no 
 way to get the local address of that block file.
 
 
 
 On Thu, Aug 28, 2014 at 11:54 AM, Demai Ni nid...@gmail.com wrote:
 Yehia,
 
 No problem at all. I really appreciate your willingness to help. Yeah. now 
 I am able to get such information through two steps, and the first step 
 will be either hadoop fsck or getFileBlockLocations(). and then search the 
 local filesystem, my cluster is using the default from CDH, which is 
 /dfs/dn
 
 I would like to it programmatically, so wondering whether someone already 
 done it? or maybe better a hadoop API call already implemented for this 
 exact purpose
 
 Demai
 
 
 On Wed, Aug 27, 2014 at 7:58 PM, Yehia Elshater y.z.elsha...@gmail.com 
 wrote:
 Hi Demai,
 
 Sorry, I missed that you are already tried this out. I think you can 
 construct the block location on the local file system if you have the 
 block pool id and the block id. If you are using cloudera distribution, 
 the default location is under /dfs/dn ( the value of dfs.data.dir, 
 dfs.datanode.data.dir configuration keys).
 
 Thanks
 Yehia 
 
 
 On 27 August 2014 21:20, Yehia Elshater y.z.elsha...@gmail.com wrote:
 Hi Demai,
 
 You can use fsck utility like the following:
 
 hadoop fsck /path/to/your/hdfs/file -files -blocks -locations -racks
 
 This will display all the information you need about the blocks of your 
 file.
 
 Hope it helps.
 Yehia
 
 
 On 27 August 2014 20:18, Demai Ni nid...@gmail.com wrote:
 Hi, Stanley,
 
 Many thanks. Your method works. For now, I can have two steps approach:
 1) getFileBlockLocations to grab hdfs BlockLocation[]
 2) use local file system call(like find command) to match the block to 
 files on local file system .
 
 Maybe there is an existing Hadoop API to return such info in already?
 
 Demai on the run
 
 On Aug 26, 2014, at 9:14 PM, Stanley Shi s...@pivotal.io wrote:
 
 I am not sure this is what you want but you can try this shell command:
 
 find [DATANODE_DIR] -name [blockname]
 
 
 On Tue, Aug 26, 2014 at 6:42 AM, Demai Ni nid...@gmail.com wrote:
 Hi, folks,
 
 New in this area. Hopefully to get a couple pointers.
 
 I am using Centos and have Hadoop set up using cdh5.1(Hadoop 2.3)
 
 I am wondering whether there is a interface to get each hdfs block 
 information in the term of local file system.
 
 For example, I can use Hadoop fsck /tmp/test.txt -files -blocks 
 -racks to get blockID and its replica on the nodes, such as: repl 
 =3[ /rack/hdfs01, /rack/hdfs02...]
 
  With such info, is there a way to
 1) login to hfds01, and read the block directly at local file system 
 level?
 
 
 Thanks
 
 Demai on the run
 
 
 
 -- 
 Regards,
 Stanley Shi,
 
 
 
 
 -- 
 Regards,
 Stanley Shi,
 
 
 
 
 -- 
 Regards,
 Stanley Shi,
 


Re: Local file system to access hdfs blocks

2014-08-28 Thread Demai Ni
Stanley and all,

thanks. I will write a client application to explore this path. A quick
question again.
Using the fsck command, I can retrieve all the necessary info
$ hadoop fsck /tmp/list2.txt -files -blocks -racks
.
 *BP-13-7914115-10.122.195.197-14909166276345:blk_1073742025* len=8 repl=2
[/default/10.122.195.198:50010, /default/10.122.195.196:50010]

However, using getFileBlockLocations(), I can't get the block name/id info,
such as
*BP-13-7914115-10.122.195.197-14909166276345:blk_1073742025*seem the
BlockLocation don't provide the public info here.
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/BlockLocation.html

is there another entry point? somethinig fsck is using? thanks

Demai




On Wed, Aug 27, 2014 at 11:09 PM, Stanley Shi s...@pivotal.io wrote:

 As far as I know, there's no combination of hadoop API can do that.
 You can easily get the location of the block (on which DN), but there's no
 way to get the local address of that block file.



 On Thu, Aug 28, 2014 at 11:54 AM, Demai Ni nid...@gmail.com wrote:

 Yehia,

 No problem at all. I really appreciate your willingness to help. Yeah.
 now I am able to get such information through two steps, and the first step
 will be either hadoop fsck or getFileBlockLocations(). and then search
 the local filesystem, my cluster is using the default from CDH, which is
 /dfs/dn

 I would like to it programmatically, so wondering whether someone already
 done it? or maybe better a hadoop API call already implemented for this
 exact purpose

 Demai


 On Wed, Aug 27, 2014 at 7:58 PM, Yehia Elshater y.z.elsha...@gmail.com
 wrote:

 Hi Demai,

 Sorry, I missed that you are already tried this out. I think you can
 construct the block location on the local file system if you have the block
 pool id and the block id. If you are using cloudera distribution, the
 default location is under /dfs/dn ( the value of dfs.data.dir,
 dfs.datanode.data.dir configuration keys).

 Thanks
 Yehia


 On 27 August 2014 21:20, Yehia Elshater y.z.elsha...@gmail.com wrote:

 Hi Demai,

 You can use fsck utility like the following:

 hadoop fsck /path/to/your/hdfs/file -files -blocks -locations -racks

 This will display all the information you need about the blocks of your
 file.

 Hope it helps.
 Yehia


 On 27 August 2014 20:18, Demai Ni nid...@gmail.com wrote:

 Hi, Stanley,

 Many thanks. Your method works. For now, I can have two steps approach:
 1) getFileBlockLocations to grab hdfs BlockLocation[]
 2) use local file system call(like find command) to match the block to
 files on local file system .

 Maybe there is an existing Hadoop API to return such info in already?

 Demai on the run

 On Aug 26, 2014, at 9:14 PM, Stanley Shi s...@pivotal.io wrote:

 I am not sure this is what you want but you can try this shell command:

 find [DATANODE_DIR] -name [blockname]


 On Tue, Aug 26, 2014 at 6:42 AM, Demai Ni nid...@gmail.com wrote:

 Hi, folks,

 New in this area. Hopefully to get a couple pointers.

 I am using Centos and have Hadoop set up using cdh5.1(Hadoop 2.3)

 I am wondering whether there is a interface to get each hdfs block
 information in the term of local file system.

 For example, I can use Hadoop fsck /tmp/test.txt -files -blocks
 -racks to get blockID and its replica on the nodes, such as: repl =3[
 /rack/hdfs01, /rack/hdfs02...]

  With such info, is there a way to
 1) login to hfds01, and read the block directly at local file system
 level?


 Thanks

 Demai on the run




 --
 Regards,
 *Stanley Shi,*







 --
 Regards,
 *Stanley Shi,*




Re: Local file system to access hdfs blocks

2014-08-27 Thread Demai Ni
Hi, Stanley,

Many thanks. Your method works. For now, I can have two steps approach:
1) getFileBlockLocations to grab hdfs BlockLocation[]
2) use local file system call(like find command) to match the block to files on 
local file system .

Maybe there is an existing Hadoop API to return such info in already?

Demai on the run

On Aug 26, 2014, at 9:14 PM, Stanley Shi s...@pivotal.io wrote:

 I am not sure this is what you want but you can try this shell command:
 
 find [DATANODE_DIR] -name [blockname]
 
 
 On Tue, Aug 26, 2014 at 6:42 AM, Demai Ni nid...@gmail.com wrote:
 Hi, folks,
 
 New in this area. Hopefully to get a couple pointers.
 
 I am using Centos and have Hadoop set up using cdh5.1(Hadoop 2.3)
 
 I am wondering whether there is a interface to get each hdfs block 
 information in the term of local file system.
 
 For example, I can use Hadoop fsck /tmp/test.txt -files -blocks -racks to 
 get blockID and its replica on the nodes, such as: repl =3[ /rack/hdfs01, 
 /rack/hdfs02...]
 
  With such info, is there a way to
 1) login to hfds01, and read the block directly at local file system level?
 
 
 Thanks
 
 Demai on the run
 
 
 
 -- 
 Regards,
 Stanley Shi,
 


Re: Local file system to access hdfs blocks

2014-08-27 Thread Demai Ni
Yehia,

No problem at all. I really appreciate your willingness to help. Yeah. now
I am able to get such information through two steps, and the first step
will be either hadoop fsck or getFileBlockLocations(). and then search the
local filesystem, my cluster is using the default from CDH, which is /dfs/dn

I would like to it programmatically, so wondering whether someone already
done it? or maybe better a hadoop API call already implemented for this
exact purpose

Demai


On Wed, Aug 27, 2014 at 7:58 PM, Yehia Elshater y.z.elsha...@gmail.com
wrote:

 Hi Demai,

 Sorry, I missed that you are already tried this out. I think you can
 construct the block location on the local file system if you have the block
 pool id and the block id. If you are using cloudera distribution, the
 default location is under /dfs/dn ( the value of dfs.data.dir,
 dfs.datanode.data.dir configuration keys).

 Thanks
 Yehia


 On 27 August 2014 21:20, Yehia Elshater y.z.elsha...@gmail.com wrote:

 Hi Demai,

 You can use fsck utility like the following:

 hadoop fsck /path/to/your/hdfs/file -files -blocks -locations -racks

 This will display all the information you need about the blocks of your
 file.

 Hope it helps.
 Yehia


 On 27 August 2014 20:18, Demai Ni nid...@gmail.com wrote:

 Hi, Stanley,

 Many thanks. Your method works. For now, I can have two steps approach:
 1) getFileBlockLocations to grab hdfs BlockLocation[]
 2) use local file system call(like find command) to match the block to
 files on local file system .

 Maybe there is an existing Hadoop API to return such info in already?

 Demai on the run

 On Aug 26, 2014, at 9:14 PM, Stanley Shi s...@pivotal.io wrote:

 I am not sure this is what you want but you can try this shell command:

 find [DATANODE_DIR] -name [blockname]


 On Tue, Aug 26, 2014 at 6:42 AM, Demai Ni nid...@gmail.com wrote:

 Hi, folks,

 New in this area. Hopefully to get a couple pointers.

 I am using Centos and have Hadoop set up using cdh5.1(Hadoop 2.3)

 I am wondering whether there is a interface to get each hdfs block
 information in the term of local file system.

 For example, I can use Hadoop fsck /tmp/test.txt -files -blocks
 -racks to get blockID and its replica on the nodes, such as: repl =3[
 /rack/hdfs01, /rack/hdfs02...]

  With such info, is there a way to
 1) login to hfds01, and read the block directly at local file system
 level?


 Thanks

 Demai on the run




 --
 Regards,
 *Stanley Shi,*






Local file system to access hdfs blocks

2014-08-25 Thread Demai Ni
Hi, folks,

New in this area. Hopefully to get a couple pointers. 

I am using Centos and have Hadoop set up using cdh5.1(Hadoop 2.3)

I am wondering whether there is a interface to get each hdfs block information 
in the term of local file system. 

For example, I can use Hadoop fsck /tmp/test.txt -files -blocks -racks to get 
blockID and its replica on the nodes, such as: repl =3[ /rack/hdfs01, 
/rack/hdfs02...]

 With such info, is there a way to 
1) login to hfds01, and read the block directly at local file system level?


Thanks

Demai on the run