hadoop cluster with mixed servers(different memory, speed, etc)
hi, folks, I am wondering how hadoop cluster handle commodity hardware with different speed, capacity . This situation is happening and probably become very common soon. That a cluster starts with 100 machines, and in a couple years, add another 100 machines. With Moore's law as an indicator, the new vs. old machines are at least one generation apart. The situation get even more complex if the 'new' 100 join the cluster gradually. How hadoop handles this situation and avoid the weakest link problem? thanks Demai
Re: HDFS ShortCircuit Read on Mac?
Chris, many thanks for the quick response. I will disable the shortcircuit on my mac for now. :-) Demai On Tue, Sep 8, 2015 at 4:57 PM, Chris Nauroth <cnaur...@hortonworks.com> wrote: > Hello Demai, > > HDFS short-circuit read currently does not work on Mac, due to some > platform differences in handling of domain sockets. The last time I > checked, our Hadoop code was exceeding a maximum path length enforced on > Mac for domain socket paths. I haven't had availability to look at this in > a while, but the prior work is tracked in JIRA issues HDFS-3296 and > HADOOP-11957 if you want to see the current progress. > > --Chris Nauroth > > From: Demai Ni <nid...@gmail.com> > Reply-To: "user@hadoop.apache.org" <user@hadoop.apache.org> > Date: Tuesday, September 8, 2015 at 4:46 PM > To: "user@hadoop.apache.org" <user@hadoop.apache.org> > Subject: HDFS ShortCircuit Read on Mac? > > hi, folks, > > wondering anyone has setup HDFS shortcircuit Read on Mac? I installed > hadoop through homebrew on Mac. It is up and running. But I cannot > config "dfs.domain.socket.path" as instructed here: > > http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/ShortCircuitLocalReads.html > Since there is no dn_socket on the mac. > > any pointers are appreciated. > > Demai > >
HDFS ShortCircuit Read on Mac?
hi, folks, wondering anyone has setup HDFS shortcircuit Read on Mac? I installed hadoop through homebrew on Mac. It is up and running. But I cannot config "dfs.domain.socket.path" as instructed here: http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/ShortCircuitLocalReads.html Since there is no dn_socket on the mac. any pointers are appreciated. Demai
hadoop/hdfs cache question, do client processes share cache?
hi, folks, I have a quick question about how hdfs handle cache? In this lab experiment, I have a 4 node hadoop cluster (2.x) and each node has a fair large memory (96GB). And have a single hdfs file with 256MB, and also fit in one HDFS block. The local filesystem is linux. Now from one of the DataNode, I started 10 hadoop client processes to repeatedly read the above file. With the assumption that HDFS will cache the 256MB in memory, so (after the 1st read) READs will have no disk I/O involved anymore. My question is : *how many COPY of the 256MB will be in memory of this DataNode? 10 or 1?* How about the 10 client processes are located at the 5th linux box independent of the cluster? Will we have 10 copies of the 256 MB or just 1? Many thanks. Appreciate your help on this. Demai
Re: hadoop/hdfs cache question, do client processes share cache?
Ritesh, many thanks for your response. I just read through the centralized Cache document. Thanks for the pointer. A couple follow-up questions. First, the centralized cache required 'explicit' configuration, so by default, there is no HDFS-managed cache? Will the cache occur at local filesystem level like Linux? The 2nd question. The centralized Cache is among the DN of HDFS. Let's say the client is a stand-alone Linux(not part of the cluster), which connects to the HDFS cluster with centralized cache configured. So on HDFS cluster, the file is cached. In the same scenario, the client has 10 processes repeatedly read the same HDFS file. will HDFS client API be able to cache the file content at Client side? or every READ will have to move the whole file through network, and no sharing between processes? Demai On Tue, Aug 11, 2015 at 12:58 PM, Ritesh Kumar Singh riteshoneinamill...@gmail.com wrote: Let's assume that hdfs maintains 3 replicas of the 256MB block, then all of these 3 datanodes will have only one copy of the block in their respective mem cache and thus avoiding the repeated i/o reads. This goes with the centralized cache management policy of hdfs that also gives you an option to pin 2 of these 3 blocks in cache and save the remaining 256MB of cache space. Here's a link https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html on the same. Hope that helps. Ritesh
Re: a non-commerial distribution of hadoop ecosystem?
Chris and Roman, many thanks for the quick response. I will take a look at bigtop. Actually, I heard about it, but thought it is a installation framework, instead of a hadoop distribution. Now I am looking at the BigTop 0.7.0 hadoop instruction, which probably will work fine for my needs. Appreciate the pointer. Roman, I will ping you off list for ODP. I was hoping ODP will be the one for me. Well, in reality, it is owned by a few companies, at least not by ONE company. :-) It is fine with me, as long as ODP is open to be used by others. I am just having trouble to find document/installation info of the ODP. maybe I should google harder? :-) Demai On Mon, Jun 1, 2015 at 1:46 PM, Roman Shaposhnik r...@apache.org wrote: On Mon, Jun 1, 2015 at 1:37 PM, Demai Ni nid...@gmail.com wrote: My question is besides the commercial distributions: CDH(Cloudera) , HDP (Horton work), and others like Mapr, IBM... Is there a distribution that is NOT owned by a company? I am looking for something simple for cluster configuration/installation for multiple components: hdfs, yarn, zookeeper, hive, hbase, maybe Spark. Surely, for a well-experience person(not me), he/she can build the distribution from Apache releases. Well, I am more interested on building application on top of it, and hopefully to find one packed them together. Apache Bigtop (CCed) aims at delivering a 100% open and community-driven distribution of big data management technologies around Apache Hadoop. Same as, for example, what Debian is trying to do for Linux. BTW, I don't need the latest releases like other commercial distribution offered. I am also looking into the ODP(the open data platform), but that project is kind of quiet after the initial Feb announcement. Feel free to ping me off list if you want more details on ODP. Thanks, Roman.
Re: Connect c language with HDFS
I would also suggest to take a look at https://issues.apache.org/jira/browse/HDFS-6994. I have been using libhdfs3 for POC in past few months, and highly recommend it. the only drawback is the libhdfs3 has not been formed committed into hadoop/hdfs yet. if you only like to play with hdfs, using the existing libhdfs lib is fine. but if you are looking for some serious development, libhdfs3 has a lot of advantage. On Mon, May 4, 2015 at 3:59 AM, unmesha sreeveni unmeshab...@gmail.com wrote: Thanks Did it. http://unmeshasreeveni.blogspot.in/2015/05/hadoop-word-count-using-c-hadoop.html On Mon, May 4, 2015 at 3:43 PM, Alexander Alten-Lorenz wget.n...@gmail.com wrote: That depends on the installation source (rpm, tgz or parcels). Usually, when you use parcels, libhdfs.so* should be within /opt/cloudera/parcels/ CDH/lib64/ (or similar). Or just use linux' locate (locate libhdfs.so*) to find the library. -- Alexander Alten-Lorenz m: wget.n...@gmail.com b: mapredit.blogspot.com On May 4, 2015, at 11:39 AM, unmesha sreeveni unmeshab...@gmail.com wrote: thanks alex I have gone through the same. but once I checked my cloudera distribution I am not able to get those folders ..Thats y I posted here. I dont know if I made any mistake. On Mon, May 4, 2015 at 2:40 PM, Alexander Alten-Lorenz wget.n...@gmail.com wrote: Google: http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/LibHdfs.html -- Alexander Alten-Lorenz m: wget.n...@gmail.com b: mapredit.blogspot.com On May 4, 2015, at 10:57 AM, unmesha sreeveni unmeshab...@gmail.com wrote: Hi Can we connect c with HDFS using cloudera hadoop distribution. -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/ -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/ -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/
Re: Data locality
hi, folks, I have the similar question. Is there an easy way to tell(from a user perspective) whether short circuit is enabled? thanks Demai On Mon, Mar 2, 2015 at 11:46 AM, Fei Hu hufe...@gmail.com wrote: Hi All, I developed a scheduler for data locality. Now I want to test the performance of the scheduler, so I need to monitor how many data are read remotely. Is there any tool for monitoring the volume of data moved around the cluster? Thanks, Fei
[HDFS] result order of getFileBlockLocations() and listFiles()?
hi, Guys, I am trying to implement a simple program(that is not for production, experimental). And invoke FileSystem.listFiles() to get a list of files under a hdfs folder, and then use FileSystem.getFileBlockLocations() to get replica locations of each file/blocks. Since it is a controlled environment, I can make sure the files are static and don't worry about datanode crash, fail-over, etc. Assuming at a small time-window(say, 1 minute), I have 100~1000s client invoke the same program to look up the same folder. Will the above two APIs guarantee *same result in the same order* for all clients? To elaborate a bit more, say there is a folder called /dfs/dn/user/data contains three files: file1, file2, and file3. If client1 gets: listFiles() : file1,file2,file3 getFileBlockLocation(file1) - datanode1, datanode3, datanode6 Will all other clients get the same information(I think so) and in the same order? or I have to do a sort by each client to guarantee the order? Many thanks for your inputs Demai
Re: read from a hdfs file on the same host as client
Shivram, many thanks for confirming the behavior. I will also turn on the shortcircuit as you suggested. Appreciate the help Demai On Mon, Oct 13, 2014 at 3:42 PM, Shivram Mani sm...@pivotal.io wrote: Demai, you are right. HDFS's default BlockPlacementPolicyDefault makes sure one replica of your block is available on the writer's datanode. The replica selection for the read operation is also aimed at minimizing bandwidth/latency and will serve the block from the reader's local node. If you want to further optimize this, you can set 'dfs.client.read.shortcircuit' to true. This would allow the client to bypass the datanode to read the file directly. On Mon, Oct 13, 2014 at 11:58 AM, Demai Ni nid...@gmail.com wrote: hi, folks, a very simple question, looking forward a couple pointers. Let's say I have a hdfs file: testfile, which only have one block(256MB), and the block has a replica on datanode: host1.hdfs.com (the whole hdfs may have 100 nodes though, and the other 2 replica are available at other datanode). If on host1.hdfs.com, I did a hadoop fs -cat testfile or a java client to read the file. Should I assume there won't be any significant data movement through network? That is the namenode is smart enough to give me the data on host1.hdfs.com directly? thanks Demai -- Thanks Shivram
hdfs: a C API call to getFileSize() through libhdfs or libhdfs3?
hi, folks, To get the size of a hdfs file, jave API has FileSystem#getFileStatus(PATH)#getLen(); now I am trying to use a C client to do the same thing. For a file on local file system, I can grab the info like this: fseeko(file, 0, SEEK_END); size = ftello(file); But I can't find the SEEK_END or a getFileSize() call in the existing libhdfs or the newly libhdfs3 https://issues.apache.org/jira/browse/HDFS-6994 Can someone point me to the right direction? many thanks Demai
Re: Planning to propose Hadoop initiative to company. Need some inputs please.
hi, glad to see another person moving from mainframe world to the 'big' data one. I was in the same boat a few years back after working on mainframe for 10+ years. Wilm got to the pointers already. I'd like to just chime in a bit from mainframe side. The example of website usage is a very good one for bigdata comparing to mainframe, as mainframe is very expensive to provide reliability for mission-critical workload. One approach is to look at what the current application running on mainframe or your guys are considering to implement on mainframe. For a website usage case, the cost to implement and running would be only 1/10 if on hadoop/hbase, comparing to mainframe. And mainframe probably not able to scale up if the data goes to TB. 2nd, be careful that Hadoop is not for all your cases. I am pretty such that your IT department is handling some mission-critical workloads, like payroll, employee info, customer-payment, etc. Leaving those workloads on mainframe. for 1) hbase/hadoop are not design for such RDMS workload; for 2) moving from one database to another is way too much risk unless the top boss force you do so... :-) Demai On Wed, Oct 1, 2014 at 11:02 AM, Wilm Schumacher wilm.schumac...@cawoom.com wrote: Hi, first: I think hbase is what you are looking for. If I understand correctly you want to show the customer his or her data very fast and let them manipulate their data. So you need something like a data warehouse system. Thus, hbase is the method of choice for you (and I think for your kind of data, hbase is a better choice than cassandra or mongoDB). But of course you need a running hadoop system to run a hbase. So it's not an either/or ;) (my answers are for hbase, as I think it's what you are looking for. If you are not interested, just ignore the following text. Sry @all by writing about hbase on this list ;).) Am 01.10.2014 um 17:24 schrieb mani kandan: 1) How much web usage data will a typical website like ours collect on a daily basis? (I know I can ask our IT department, but I would like to gather some background idea before talking to them.) well, if you have the option to ask your IT department you should do that, because everyone here would have to guess. You would have to explain very detailed what you have to do to let us guess. If you e.g. want to track the user on what he or she has clicked, perhaps to make personalized ads, than you have to save more data. So, you should ask the persons who have the data right away without guessing. 3) How many clusters/nodes would I need to run a web usage analytics system? in the book hbase in action there are some recommendations for some case studies (part IV deploying hbase). There are some thoughts on the number of nodes, and how to use them, depending on the size of your data 4) What are the ways for me to use our data? (One use case I'm thinking of is to analyze the error messages log for each page on quote process to redesign the UI. Is this possible?) sure. And this should be very easy. I would pump the error log into a hbase table. By this method you could read the messages directly from the hbase shell (if they are few enough). Or you could use hive to query your log a little more sql like and make statistics very easy. 5) How long would it take for me to set up and start such a system? for a novice who have to do it for the first time: for the stand alone hbase system perhaps 2 hours. For a complete distributed test cluster ... perhaps a day. For the real producing system, with all security features ... a little longer ;). I'm sorry if some/all of these questions are unanswerable. I just want to discuss my thoughts, and get an idea of what things can I achieve by going the way of Hadoop. well, I think, but I could err, that you think of hadoop (or hbase) in a way that you just can change the database backend from SQL to hbase/hadoop and everything would run right away. This will not be that easy. You would have to change the code of your web application in a very fundamental way. You have to rethink all the table designs etc., so this could be more complicate than you think right know. However, hbase/hadoop hase some advantages which are very interesing for you. Well first, it is distributed, which enables your company to grow almost limitless, or to collect more data about your customers so you can get more informations (and sell more stuff). And map reduce is a wonderful tool for making real fancy statistics, which is very interesting for an insurance company. Your mathematical economist will REALLY love it ;). Hope this helped. best wishes Wilm
Re: conf.get(dfs.data.dir) return null when hdfs-site.xml doesn't set it explicitly
Susheel actually brought up a good point. once the client code connects to the cluster, is there way to get the real cluster configuration variables/values instead of relying on the .xml files on client side? Demai On Mon, Sep 8, 2014 at 10:12 PM, Susheel Kumar Gadalay skgada...@gmail.com wrote: One doubt on building Configuration object. I have a Hadoop remote client and Hadoop cluster. When a client submitted a MR job, the Configuration object is built from Hadoop cluster node xml files, basically the resource manager node core-site.xml and mapred-site.xml and yarn-site.xml. Am I correct? TIA Susheel Kumar On 9/9/14, Bhooshan Mogal bhooshan.mo...@gmail.com wrote: Hi Demai, conf = new Configuration() will create a new Configuration object and only add the properties from core-default.xml and core-site.xml in the conf object. This is basically a new configuration object, not the same that the daemons in the hadoop cluster use. I think what you are trying to ask is if you can get the Configuration object that a daemon in your live cluster (e.g. datanode) is using. I am not sure if the datanode or any other daemon on a hadoop cluster exposes such an API. I would in fact be tempted to get this information from the configuration management daemon instead - in your case cloudera manager. But I am not sure if CM exposes that API either. You could probably find out on the Cloudera mailing list. HTH, Bhooshan On Mon, Sep 8, 2014 at 3:52 PM, Demai Ni nid...@gmail.com wrote: hi, Bhooshan, thanks for your kind response. I run the code on one of the data node of my cluster, with only one hadoop daemon running. I believe my java client code connect to the cluster correctly as I am able to retrieve fileStatus, and list files under a particular hdfs path, and similar things... However, you are right that the daemon process use the hdfs-site.xml under another folder for cloudera : /var/run/cloudera-scm-agent/process/90-hdfs-DATANODE/hdfs-site.xml. about retrieving the info from a live cluster, I would like to get the information beyond the configuration files(that is beyond the .xml files). Since I am able to use : conf = new Configuration() to connect to hdfs and did other operations, shouldn't I be able to retrieve the configuration variables? Thanks Demai On Mon, Sep 8, 2014 at 2:40 PM, Bhooshan Mogal bhooshan.mo...@gmail.com wrote: Hi Demai, When you read a property from the conf object, it will only have a value if the conf object contains that property. In your case, you created the conf object as new Configuration() -- adds core-default and core-site.xml. Then you added site.xmls (hdfs-site.xml and core-site.xml) from specific locations. If none of these files have defined dfs.data.dir, then you will get NULL. This is expected behavior. What do you mean by retrieving the info from a live cluster? Even for processes like datanode, namenode etc, the source of truth for these properties is hdfs-site.xml. It is loaded from a specific location when you start these services. Question: Where are you running the above code? Is it on a node which has other hadoop daemons as well? My guess is that the path you are referring to (/etc/hadoop/conf. cloudera.hdfs/core-site.xml) is not the right path where these config properties are defined. Since this is a CDH cluster, you would probably be best served by asking on the CDH mailing list as to where the right path to these files is. HTH, Bhooshan On Mon, Sep 8, 2014 at 11:47 AM, Demai Ni nid...@gmail.com wrote: hi, experts, I am trying to get the local filesystem directory of data node. My cluster is using CDH5.x (hadoop 2.3) and the default configuration. So the datanode is under file:///dfs/dn. I didn't specify the value in hdfs-site.xml. My code is something like: conf = new Configuration() // test both with and without the following two lines conf.addResource (new Path(/etc/hadoop/conf.cloudera.hdfs/hdfs-site.xml)); conf.addResource (new Path(/etc/hadoop/conf.cloudera.hdfs/core-site.xml)); // I also tried get(dfs.datanode.data.dir), which also return NULL String dnDir = conf.get(dfs.data.dir); // return NULL It looks like the get only look at the configuration file instead of retrieving the info from the live cluster? Many thanks for your help in advance. Demai -- Bhooshan -- Bhooshan
Re: conf.get(dfs.data.dir) return null when hdfs-site.xml doesn't set it explicitly
hi, Bhooshan, thanks for your kind response. I run the code on one of the data node of my cluster, with only one hadoop daemon running. I believe my java client code connect to the cluster correctly as I am able to retrieve fileStatus, and list files under a particular hdfs path, and similar things... However, you are right that the daemon process use the hdfs-site.xml under another folder for cloudera : /var/run/cloudera-scm-agent/process/90-hdfs-DATANODE/hdfs-site.xml. about retrieving the info from a live cluster, I would like to get the information beyond the configuration files(that is beyond the .xml files). Since I am able to use : conf = new Configuration() to connect to hdfs and did other operations, shouldn't I be able to retrieve the configuration variables? Thanks Demai On Mon, Sep 8, 2014 at 2:40 PM, Bhooshan Mogal bhooshan.mo...@gmail.com wrote: Hi Demai, When you read a property from the conf object, it will only have a value if the conf object contains that property. In your case, you created the conf object as new Configuration() -- adds core-default and core-site.xml. Then you added site.xmls (hdfs-site.xml and core-site.xml) from specific locations. If none of these files have defined dfs.data.dir, then you will get NULL. This is expected behavior. What do you mean by retrieving the info from a live cluster? Even for processes like datanode, namenode etc, the source of truth for these properties is hdfs-site.xml. It is loaded from a specific location when you start these services. Question: Where are you running the above code? Is it on a node which has other hadoop daemons as well? My guess is that the path you are referring to (/etc/hadoop/conf. cloudera.hdfs/core-site.xml) is not the right path where these config properties are defined. Since this is a CDH cluster, you would probably be best served by asking on the CDH mailing list as to where the right path to these files is. HTH, Bhooshan On Mon, Sep 8, 2014 at 11:47 AM, Demai Ni nid...@gmail.com wrote: hi, experts, I am trying to get the local filesystem directory of data node. My cluster is using CDH5.x (hadoop 2.3) and the default configuration. So the datanode is under file:///dfs/dn. I didn't specify the value in hdfs-site.xml. My code is something like: conf = new Configuration() // test both with and without the following two lines conf.addResource (new Path(/etc/hadoop/conf.cloudera.hdfs/hdfs-site.xml)); conf.addResource (new Path(/etc/hadoop/conf.cloudera.hdfs/core-site.xml)); // I also tried get(dfs.datanode.data.dir), which also return NULL String dnDir = conf.get(dfs.data.dir); // return NULL It looks like the get only look at the configuration file instead of retrieving the info from the live cluster? Many thanks for your help in advance. Demai -- Bhooshan
Re: conf.get(dfs.data.dir) return null when hdfs-site.xml doesn't set it explicitly
Bhooshan, Many thanks. I appreciate the help. I will also try out Cloudera mailing list/community Demai On Mon, Sep 8, 2014 at 4:58 PM, Bhooshan Mogal bhooshan.mo...@gmail.com wrote: Hi Demai, conf = new Configuration() will create a new Configuration object and only add the properties from core-default.xml and core-site.xml in the conf object. This is basically a new configuration object, not the same that the daemons in the hadoop cluster use. I think what you are trying to ask is if you can get the Configuration object that a daemon in your live cluster (e.g. datanode) is using. I am not sure if the datanode or any other daemon on a hadoop cluster exposes such an API. I would in fact be tempted to get this information from the configuration management daemon instead - in your case cloudera manager. But I am not sure if CM exposes that API either. You could probably find out on the Cloudera mailing list. HTH, Bhooshan On Mon, Sep 8, 2014 at 3:52 PM, Demai Ni nid...@gmail.com wrote: hi, Bhooshan, thanks for your kind response. I run the code on one of the data node of my cluster, with only one hadoop daemon running. I believe my java client code connect to the cluster correctly as I am able to retrieve fileStatus, and list files under a particular hdfs path, and similar things... However, you are right that the daemon process use the hdfs-site.xml under another folder for cloudera : /var/run/cloudera-scm-agent/process/90-hdfs-DATANODE/hdfs-site.xml. about retrieving the info from a live cluster, I would like to get the information beyond the configuration files(that is beyond the .xml files). Since I am able to use : conf = new Configuration() to connect to hdfs and did other operations, shouldn't I be able to retrieve the configuration variables? Thanks Demai On Mon, Sep 8, 2014 at 2:40 PM, Bhooshan Mogal bhooshan.mo...@gmail.com wrote: Hi Demai, When you read a property from the conf object, it will only have a value if the conf object contains that property. In your case, you created the conf object as new Configuration() -- adds core-default and core-site.xml. Then you added site.xmls (hdfs-site.xml and core-site.xml) from specific locations. If none of these files have defined dfs.data.dir, then you will get NULL. This is expected behavior. What do you mean by retrieving the info from a live cluster? Even for processes like datanode, namenode etc, the source of truth for these properties is hdfs-site.xml. It is loaded from a specific location when you start these services. Question: Where are you running the above code? Is it on a node which has other hadoop daemons as well? My guess is that the path you are referring to (/etc/hadoop/conf. cloudera.hdfs/core-site.xml) is not the right path where these config properties are defined. Since this is a CDH cluster, you would probably be best served by asking on the CDH mailing list as to where the right path to these files is. HTH, Bhooshan On Mon, Sep 8, 2014 at 11:47 AM, Demai Ni nid...@gmail.com wrote: hi, experts, I am trying to get the local filesystem directory of data node. My cluster is using CDH5.x (hadoop 2.3) and the default configuration. So the datanode is under file:///dfs/dn. I didn't specify the value in hdfs-site.xml. My code is something like: conf = new Configuration() // test both with and without the following two lines conf.addResource (new Path(/etc/hadoop/conf.cloudera.hdfs/hdfs-site.xml)); conf.addResource (new Path(/etc/hadoop/conf.cloudera.hdfs/core-site.xml)); // I also tried get(dfs.datanode.data.dir), which also return NULL String dnDir = conf.get(dfs.data.dir); // return NULL It looks like the get only look at the configuration file instead of retrieving the info from the live cluster? Many thanks for your help in advance. Demai -- Bhooshan -- Bhooshan
Re: question about matching java API with libHDFS
hi, Yi A, Thanks for your response. I took a look at hdfs.h and hdfs.c, it seems the lib only exposes some of APIs, as there are a lot of other public methods can be accessed through java API/client, but not implemented in libhdfs, such as the one I am using now: DFSclient.getNamenode(). getBlockLocations(...).. Is the libhdfs designed to limit the access? Thanks Demai On Thu, Sep 4, 2014 at 2:36 AM, Liu, Yi A yi.a@intel.com wrote: You could refer to the header file: “src/main/native/libhdfs/hdfs.h”, you could get the APIs in detail. Regards, Yi Liu *From:* Demai Ni [mailto:nid...@gmail.com] *Sent:* Thursday, September 04, 2014 5:21 AM *To:* user@hadoop.apache.org *Subject:* question about matching java API with libHDFS hi, folks, I am currently using java to access HDFS. for example, I am using this API DFSclient.getNamenode().getBlockLocations(...)... to retrieve file block information. Now I need to move the same logic into C/C++. so I am looking at libHDFS, and this wiki page: http://wiki.apache.org/hadoop/LibHDFS. And I am also using the hdfs_test.c for some reference. However, I couldn't find a way to easily figure out whether above Java API is exposed through libHDFS? Probably not, since I couldn't find it. Then, it lead to my next question. Is there an easy way to plug in the libHDFS framework, to include additonal API? thanks a lot for your suggestions Demai
question about matching java API with libHDFS
hi, folks, I am currently using java to access HDFS. for example, I am using this API DFSclient.getNamenode().getBlockLocations(...)... to retrieve file block information. Now I need to move the same logic into C/C++. so I am looking at libHDFS, and this wiki page: http://wiki.apache.org/hadoop/LibHDFS. And I am also using the hdfs_test.c for some reference. However, I couldn't find a way to easily figure out whether above Java API is exposed through libHDFS? Probably not, since I couldn't find it. Then, it lead to my next question. Is there an easy way to plug in the libHDFS framework, to include additonal API? thanks a lot for your suggestions Demai
Re: Local file system to access hdfs blocks
Stanley, Thanks. Btw, I found this jira hdfs-2246, which probably match what I am looking for. Demai on the run On Aug 28, 2014, at 11:34 PM, Stanley Shi s...@pivotal.io wrote: BP-13-7914115-10.122.195.197-14909166276345 is the blockpool information blk_1073742025 is the block name; these names are private to teh HDFS system and user should not use them, right? But if you really want ot know this, you can check the fsck code to see whether they are available; On Fri, Aug 29, 2014 at 8:13 AM, Demai Ni nid...@gmail.com wrote: Stanley and all, thanks. I will write a client application to explore this path. A quick question again. Using the fsck command, I can retrieve all the necessary info $ hadoop fsck /tmp/list2.txt -files -blocks -racks . BP-13-7914115-10.122.195.197-14909166276345:blk_1073742025 len=8 repl=2 [/default/10.122.195.198:50010, /default/10.122.195.196:50010] However, using getFileBlockLocations(), I can't get the block name/id info, such as BP-13-7914115-10.122.195.197-14909166276345:blk_1073742025 seem the BlockLocation don't provide the public info here. http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/BlockLocation.html is there another entry point? somethinig fsck is using? thanks Demai On Wed, Aug 27, 2014 at 11:09 PM, Stanley Shi s...@pivotal.io wrote: As far as I know, there's no combination of hadoop API can do that. You can easily get the location of the block (on which DN), but there's no way to get the local address of that block file. On Thu, Aug 28, 2014 at 11:54 AM, Demai Ni nid...@gmail.com wrote: Yehia, No problem at all. I really appreciate your willingness to help. Yeah. now I am able to get such information through two steps, and the first step will be either hadoop fsck or getFileBlockLocations(). and then search the local filesystem, my cluster is using the default from CDH, which is /dfs/dn I would like to it programmatically, so wondering whether someone already done it? or maybe better a hadoop API call already implemented for this exact purpose Demai On Wed, Aug 27, 2014 at 7:58 PM, Yehia Elshater y.z.elsha...@gmail.com wrote: Hi Demai, Sorry, I missed that you are already tried this out. I think you can construct the block location on the local file system if you have the block pool id and the block id. If you are using cloudera distribution, the default location is under /dfs/dn ( the value of dfs.data.dir, dfs.datanode.data.dir configuration keys). Thanks Yehia On 27 August 2014 21:20, Yehia Elshater y.z.elsha...@gmail.com wrote: Hi Demai, You can use fsck utility like the following: hadoop fsck /path/to/your/hdfs/file -files -blocks -locations -racks This will display all the information you need about the blocks of your file. Hope it helps. Yehia On 27 August 2014 20:18, Demai Ni nid...@gmail.com wrote: Hi, Stanley, Many thanks. Your method works. For now, I can have two steps approach: 1) getFileBlockLocations to grab hdfs BlockLocation[] 2) use local file system call(like find command) to match the block to files on local file system . Maybe there is an existing Hadoop API to return such info in already? Demai on the run On Aug 26, 2014, at 9:14 PM, Stanley Shi s...@pivotal.io wrote: I am not sure this is what you want but you can try this shell command: find [DATANODE_DIR] -name [blockname] On Tue, Aug 26, 2014 at 6:42 AM, Demai Ni nid...@gmail.com wrote: Hi, folks, New in this area. Hopefully to get a couple pointers. I am using Centos and have Hadoop set up using cdh5.1(Hadoop 2.3) I am wondering whether there is a interface to get each hdfs block information in the term of local file system. For example, I can use Hadoop fsck /tmp/test.txt -files -blocks -racks to get blockID and its replica on the nodes, such as: repl =3[ /rack/hdfs01, /rack/hdfs02...] With such info, is there a way to 1) login to hfds01, and read the block directly at local file system level? Thanks Demai on the run -- Regards, Stanley Shi, -- Regards, Stanley Shi, -- Regards, Stanley Shi,
Re: Local file system to access hdfs blocks
Stanley and all, thanks. I will write a client application to explore this path. A quick question again. Using the fsck command, I can retrieve all the necessary info $ hadoop fsck /tmp/list2.txt -files -blocks -racks . *BP-13-7914115-10.122.195.197-14909166276345:blk_1073742025* len=8 repl=2 [/default/10.122.195.198:50010, /default/10.122.195.196:50010] However, using getFileBlockLocations(), I can't get the block name/id info, such as *BP-13-7914115-10.122.195.197-14909166276345:blk_1073742025*seem the BlockLocation don't provide the public info here. http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/BlockLocation.html is there another entry point? somethinig fsck is using? thanks Demai On Wed, Aug 27, 2014 at 11:09 PM, Stanley Shi s...@pivotal.io wrote: As far as I know, there's no combination of hadoop API can do that. You can easily get the location of the block (on which DN), but there's no way to get the local address of that block file. On Thu, Aug 28, 2014 at 11:54 AM, Demai Ni nid...@gmail.com wrote: Yehia, No problem at all. I really appreciate your willingness to help. Yeah. now I am able to get such information through two steps, and the first step will be either hadoop fsck or getFileBlockLocations(). and then search the local filesystem, my cluster is using the default from CDH, which is /dfs/dn I would like to it programmatically, so wondering whether someone already done it? or maybe better a hadoop API call already implemented for this exact purpose Demai On Wed, Aug 27, 2014 at 7:58 PM, Yehia Elshater y.z.elsha...@gmail.com wrote: Hi Demai, Sorry, I missed that you are already tried this out. I think you can construct the block location on the local file system if you have the block pool id and the block id. If you are using cloudera distribution, the default location is under /dfs/dn ( the value of dfs.data.dir, dfs.datanode.data.dir configuration keys). Thanks Yehia On 27 August 2014 21:20, Yehia Elshater y.z.elsha...@gmail.com wrote: Hi Demai, You can use fsck utility like the following: hadoop fsck /path/to/your/hdfs/file -files -blocks -locations -racks This will display all the information you need about the blocks of your file. Hope it helps. Yehia On 27 August 2014 20:18, Demai Ni nid...@gmail.com wrote: Hi, Stanley, Many thanks. Your method works. For now, I can have two steps approach: 1) getFileBlockLocations to grab hdfs BlockLocation[] 2) use local file system call(like find command) to match the block to files on local file system . Maybe there is an existing Hadoop API to return such info in already? Demai on the run On Aug 26, 2014, at 9:14 PM, Stanley Shi s...@pivotal.io wrote: I am not sure this is what you want but you can try this shell command: find [DATANODE_DIR] -name [blockname] On Tue, Aug 26, 2014 at 6:42 AM, Demai Ni nid...@gmail.com wrote: Hi, folks, New in this area. Hopefully to get a couple pointers. I am using Centos and have Hadoop set up using cdh5.1(Hadoop 2.3) I am wondering whether there is a interface to get each hdfs block information in the term of local file system. For example, I can use Hadoop fsck /tmp/test.txt -files -blocks -racks to get blockID and its replica on the nodes, such as: repl =3[ /rack/hdfs01, /rack/hdfs02...] With such info, is there a way to 1) login to hfds01, and read the block directly at local file system level? Thanks Demai on the run -- Regards, *Stanley Shi,* -- Regards, *Stanley Shi,*
Re: Local file system to access hdfs blocks
Hi, Stanley, Many thanks. Your method works. For now, I can have two steps approach: 1) getFileBlockLocations to grab hdfs BlockLocation[] 2) use local file system call(like find command) to match the block to files on local file system . Maybe there is an existing Hadoop API to return such info in already? Demai on the run On Aug 26, 2014, at 9:14 PM, Stanley Shi s...@pivotal.io wrote: I am not sure this is what you want but you can try this shell command: find [DATANODE_DIR] -name [blockname] On Tue, Aug 26, 2014 at 6:42 AM, Demai Ni nid...@gmail.com wrote: Hi, folks, New in this area. Hopefully to get a couple pointers. I am using Centos and have Hadoop set up using cdh5.1(Hadoop 2.3) I am wondering whether there is a interface to get each hdfs block information in the term of local file system. For example, I can use Hadoop fsck /tmp/test.txt -files -blocks -racks to get blockID and its replica on the nodes, such as: repl =3[ /rack/hdfs01, /rack/hdfs02...] With such info, is there a way to 1) login to hfds01, and read the block directly at local file system level? Thanks Demai on the run -- Regards, Stanley Shi,
Re: Local file system to access hdfs blocks
Yehia, No problem at all. I really appreciate your willingness to help. Yeah. now I am able to get such information through two steps, and the first step will be either hadoop fsck or getFileBlockLocations(). and then search the local filesystem, my cluster is using the default from CDH, which is /dfs/dn I would like to it programmatically, so wondering whether someone already done it? or maybe better a hadoop API call already implemented for this exact purpose Demai On Wed, Aug 27, 2014 at 7:58 PM, Yehia Elshater y.z.elsha...@gmail.com wrote: Hi Demai, Sorry, I missed that you are already tried this out. I think you can construct the block location on the local file system if you have the block pool id and the block id. If you are using cloudera distribution, the default location is under /dfs/dn ( the value of dfs.data.dir, dfs.datanode.data.dir configuration keys). Thanks Yehia On 27 August 2014 21:20, Yehia Elshater y.z.elsha...@gmail.com wrote: Hi Demai, You can use fsck utility like the following: hadoop fsck /path/to/your/hdfs/file -files -blocks -locations -racks This will display all the information you need about the blocks of your file. Hope it helps. Yehia On 27 August 2014 20:18, Demai Ni nid...@gmail.com wrote: Hi, Stanley, Many thanks. Your method works. For now, I can have two steps approach: 1) getFileBlockLocations to grab hdfs BlockLocation[] 2) use local file system call(like find command) to match the block to files on local file system . Maybe there is an existing Hadoop API to return such info in already? Demai on the run On Aug 26, 2014, at 9:14 PM, Stanley Shi s...@pivotal.io wrote: I am not sure this is what you want but you can try this shell command: find [DATANODE_DIR] -name [blockname] On Tue, Aug 26, 2014 at 6:42 AM, Demai Ni nid...@gmail.com wrote: Hi, folks, New in this area. Hopefully to get a couple pointers. I am using Centos and have Hadoop set up using cdh5.1(Hadoop 2.3) I am wondering whether there is a interface to get each hdfs block information in the term of local file system. For example, I can use Hadoop fsck /tmp/test.txt -files -blocks -racks to get blockID and its replica on the nodes, such as: repl =3[ /rack/hdfs01, /rack/hdfs02...] With such info, is there a way to 1) login to hfds01, and read the block directly at local file system level? Thanks Demai on the run -- Regards, *Stanley Shi,*
Local file system to access hdfs blocks
Hi, folks, New in this area. Hopefully to get a couple pointers. I am using Centos and have Hadoop set up using cdh5.1(Hadoop 2.3) I am wondering whether there is a interface to get each hdfs block information in the term of local file system. For example, I can use Hadoop fsck /tmp/test.txt -files -blocks -racks to get blockID and its replica on the nodes, such as: repl =3[ /rack/hdfs01, /rack/hdfs02...] With such info, is there a way to 1) login to hfds01, and read the block directly at local file system level? Thanks Demai on the run