Re: Copying files from HDFS to remote database

2009-04-21 Thread Dhruba Borthakur
You can use any of these: 1. bin/hadoop dfs -get hdfsfile 2. Thrift API : http://wiki.apache.org/hadoop/HDFS-APIs 3. use fuse-mount ot mount hdfs as a regular file system on remote machine: http://wiki.apache.org/hadoop/MountableHDFS thanks, dhruba On Mon, Apr 20, 2009 at 9:40 PM, Parul K

Re: File Modification timestamp

2009-01-08 Thread Dhruba Borthakur
, 2008 at 11:35 PM, Sandeep Dhawan wrote: > > Hi Dhruba, > > The file is being closed properly but the timestamp does not get modified. > The modification timestamp > still shows the file creation time. > I am creating a new file and writing data into this file. > > Tha

Re: File Modification timestamp

2008-12-30 Thread Dhruba Borthakur
I believe that file modification times are updated only when the file is closed. Are you "appending" to a preexisting file? thanks, dhruba On Tue, Dec 30, 2008 at 3:14 AM, Sandeep Dhawan wrote: > > Hi, > > I have a application which creates a simple text file on hdfs. There is a > second appli

Re: 64 bit namenode and secondary namenode & 32 bit datanod

2008-11-25 Thread Dhruba Borthakur
The design is such that running multiple secondary namenodes should not corrupt the image (modulo any bugs). Are you seeing image corruptions when this happens? You can run all or any daemons in 32-bit mode or 64 bit-mode. You can mix-and-match. If you have many millions of files, then you might w

Re: Block placement in HDFS

2008-11-25 Thread Dhruba Borthakur
Hi Dennis, There were some discussions on this topic earlier: http://issues.apache.org/jira/browse/HADOOP-3799 Do you have any specific use-case for this feature? thanks, dhruba On Mon, Nov 24, 2008 at 10:22 PM, Owen O'Malley <[EMAIL PROTECTED]> wrote: > > On Nov 24, 2008, at 8:44 PM, Mahadev

Re: Anything like RandomAccessFile in Hadoop FS ?

2008-11-13 Thread Dhruba Borthakur
One can open a file and then seek to an offset and then start reading from there. For writing, one can write only to the end of an existing file using FileSystem.append(). hope this helps, dhruba On Thu, Nov 13, 2008 at 1:24 PM, Tsz Wo (Nicholas), Sze <[EMAIL PROTECTED]> wrote: > Append is going

Re: Best way to handle namespace host failures

2008-11-10 Thread Dhruba Borthakur
Couple of things that one can do: 1. dfs.name.dir should have at least two locations, one on the local disk and one on NFS. This means that all transactions are synchronously logged into two places. 2. Create a virtual IP, say name.xx.com that points to the real machine name of the machine on whi

Re: Can FSDataInputStream.read return 0 bytes and if so, what does that mean?

2008-11-08 Thread Dhruba Borthakur
It can return 0 if and only if the requested size was zero. For EOF, it should return -1. dhruba On Fri, Nov 7, 2008 at 8:09 PM, Pete Wyckoff <[EMAIL PROTECTED]> wrote: > Just want to ensure 0 iff EOF or the requested #of bytes was 0. > > On 11/7/08 6:13 PM, "Pete Wyckoff" <[EMAIL PROTECTED]> wro

Re: Question on opening file info from namenode in DFSClient

2008-11-08 Thread Dhruba Borthakur
, this is a rare case, but for online > services, it's quite a common case. > I think HBase developers would have run into similar issues as well. > > Is this enough explanation? > > Thanks in advance, > > Taeho > > > > On Tue, Nov 4, 2008 at 3:12 AM, Dhruba

Re: Question on opening file info from namenode in DFSClient

2008-11-03 Thread Dhruba Borthakur
In the current code, details about block locations of a file are cached on the client when the file is opened. This cache remains with the client until the file is closed. If the same file is re-opened by the same DFSClient, it re-contacts the namenode and refetches the block locations. This works

Re: [hive-users] Hive Roadmap (Some information)

2008-10-27 Thread Dhruba Borthakur
Hi Ben, And, if I may add, if you would like to contribute the code to make this happen, that will be awesome! In that case, we can move this discussion to a JIRA. Thanks, dhruba On 10/27/08 1:41 PM, "Ashish Thusoo" <[EMAIL PROTECTED]> wrote: We did have some discussions around it a while ba

Re: Thinking about retriving DFS metadata from datanodes!!!

2008-09-11 Thread Dhruba Borthakur
My opinion is to not store file-namespace related metadata on the datanodes. When a file is renamed, one has to contact all datanodes to change this new metadata. Worse still, if one renames an entire subdirectory, all blocks that belongs to all files in the subdirectory have to be updated. Similar

Re: Could not obtain block: blk_-2634319951074439134_1129 file=/user/root/crawl_debug/segments/20080825053518/content/part-00002/data

2008-09-07 Thread Dhruba Borthakur
The DFS errors might have been caused by http://issues.apache.org/jira/browse/HADOOP-4040 thanks, dhruba On Sat, Sep 6, 2008 at 6:59 AM, Devaraj Das <[EMAIL PROTECTED]> wrote: > These exceptions are apparently coming from the dfs side of things. Could > someone from the dfs side please look at t

Re: Setting up a Hadoop cluster where nodes are spread over the Internet

2008-08-09 Thread Dhruba Borthakur
In almost all hadoop configurations, all host names can be specified as IP address. So, in your hadoop-site.xml, please specify the IP address of the namenode (instead of its hostname). -dhruba 2008/8/8 Lucas Nazário dos Santos <[EMAIL PROTECTED]>: > Thanks Andreas. I'll try it. > > > On Fri, Aug

Re: What will happen if two processes writes the same HDFS file

2008-08-09 Thread Dhruba Borthakur
When the first one contacts the namenode to open the file for writing, the namenode records this info in a "lease". When the second process contacts the namenode to open the same file for writing, the namenode sees that a "lease" already exists for the file and rejects the request from the second

Re: java.io.IOException: Could not get block locations. Aborting...

2008-08-08 Thread Dhruba Borthakur
It is possible that your namenode is overloaded and is not able to respond to RPC requests from clients. Please check the namenode logs to see if you see lines of the form "discarding calls...". dhrua On Fri, Aug 8, 2008 at 3:41 AM, Alexander Aristov <[EMAIL PROTECTED]> wrote: > I come across the

Re: NameNode failover procedure

2008-07-30 Thread Dhruba Borthakur
one and when with secondary namenode process ? > > > Andrzej Bialecki wrote: >> >> Dhruba Borthakur wrote: >>> A good way to implement failover is to make the Namenode log transactions >>> to >>> more than one directory, typically a local directory and a NFS m

Re: Text search on a PDF file using hadoop

2008-07-23 Thread Dhruba Borthakur
One option for you is to use a pdf-to-text converter (many of them are available online) and then run map-reduce on the txt file. -dhruba On Wed, Jul 23, 2008 at 1:07 AM, GaneshG <[EMAIL PROTECTED]> wrote: > > Thanks Lohit, i am using only defalult reader and i am very new to hadoop. > This is my

Re: All datanodes getting marked as dead

2008-06-18 Thread Dhruba Borthakur
You are running out of file handles on the namenode. When this happens, the namenode cannot receive heartbeats from datanodes because these heartbeats arrive on a tcp/ip socket connection and the namenode does not have any free file descriptors to accept these socket connections. Your data is stil

Re: data locality in HDFS

2008-06-18 Thread Dhruba Borthakur
HDFS uses the network topology to distribute and replicate data. An admin has to configure a script that describes the network topology to HDFS. This is specified by using the parameter "topology.script.file.name" in the Configuration file. This has been tested when nodes are on different subnets i

Re: Maximum number of files in hadoop

2008-06-08 Thread Dhruba Borthakur
The maximum number of files in HDFS depends on the amount of memory available for the namenode. Each file object and each block object takes about 150 bytes of the memory. Thus, if you have 1million files and each file has 1 one block each, then you would need about 3GB of memory for the namenode.

Re: HDFS Question re adding additional storage

2008-05-29 Thread Dhruba Borthakur
e way to force the rebalancing operation > > thanks, > -prasana > > Dhruba Borthakur wrote: >> >> What that means is that the new nodes will be relatively empty >> till new data arrives into the cluster. It might take a while for the new >> nodes to get fil

Re: "firstbadlink is/as" messages in 0.16.4

2008-05-24 Thread Dhruba Borthakur
This "firstbadlink" was an mis-configured log message in the code. It is innocuous and has since been fixed in 0.17 release. http://issues.apache.org/jira/browse/HADOOP-3029 thanks, dhruba On Sat, May 24, 2008 at 7:03 PM, C G <[EMAIL PROTECTED]> wrote: > Hi All: > > So far, running 0.16.4 has be

Re: 0.16.4 DFS dropping blocks, then won't retart...

2008-05-23 Thread Dhruba Borthakur
If you look at the log message starting with "STARTUP_MSG: build =..." you will see that the namenode and good datanode was built by CG whereas the bad datanodes were compiled by hadoopqa! thanks, dhruba On Fri, May 23, 2008 at 9:01 AM, C G <[EMAIL PROTECTED]> wrote: > 2008-05-23 11:53:25,377 I

Re: dfs.block.size vs avg block size

2008-05-18 Thread Dhruba Borthakur
There isn's a way to change the block size of an existing file. The block size of a file can be specified only at the time of file creation and cannot be changed later. There isn't any wasted space in your system. If the block size is 128MB but you create a HDFS file of say size 10MB, then that fi

Re: java.io.IOException: Could not obtain block / java.io.IOException: Could not get block locations

2008-05-16 Thread Dhruba Borthakur
What version of java are you using? How may threads are you running on the namenode? How many cores does your machines have? thanks, dhruba On Fri, May 16, 2008 at 6:02 AM, André Martin <[EMAIL PROTECTED]> wrote: > Hi Hadoopers, > we are experiencing a lot of "Could not obtain block / Could not g

Re: HDFS corrupt...how to proceed?

2008-05-11 Thread Dhruba Borthakur
0 (0.0 %) > Target replication factor: 3 > Real replication factor: 3.0 > > > The filesystem under path '/' is CORRUPT > > So it seems like it's fixing some problems on its own? > > Thanks, > C G > > > Dhruba Borthakur <[EMA

Re: Read timed out, Abandoning block blk_-5476242061384228962

2008-05-11 Thread Dhruba Borthakur
You bring up an interesting point. A big chunk of the code in the Namenode is being done inside a global lock although there are pieces (e.g. a portion of code that chooses datanodes for a newly allocated block) that do execute outside this lock. But, it is probably the case that the namenode does

Re: HDFS corrupt...how to proceed?

2008-05-11 Thread Dhruba Borthakur
Did one datanode fail or did the namenode fail? By "fail" do you mean that the system was rebooted or was there a bad disk that caused the problem? thanks, dhruba On Sun, May 11, 2008 at 7:23 PM, C G <[EMAIL PROTECTED]> wrote: > Hi All: > > We had a primary node failure over the weekend. When

Re: HDFS: fault tolerance to block losses with namenode failure

2008-05-06 Thread Dhruba Borthakur
Starting in 0.17 release, an application can invoke DFSOutputStream.fsync() to persist block locations for a file even before the file is closed. thanks, dhruba On Tue, May 6, 2008 at 8:11 AM, Cagdas Gerede <[EMAIL PROTECTED]> wrote: > If you are writing 10 blocks for a file and let's say in 10t

RE: Block reports: memory vs. file system, and Dividing offerService into 2 threads

2008-04-30 Thread dhruba Borthakur
ock reports: memory vs. file system, and Dividing offerService into 2 threads dhruba Borthakur wrote: > My current thinking is that "block report processing" should compare the > blkxxx files on disk with the data structure in the Datanode memory. If > and only if there is some dis

RE: Block reports: memory vs. file system, and Dividing offerService into 2 threads

2008-04-30 Thread dhruba Borthakur
s, dhruba From: Cagdas Gerede [mailto:[EMAIL PROTECTED] Sent: Tuesday, April 29, 2008 11:32 PM To: core-user@hadoop.apache.org Cc: dhruba Borthakur Subject: Block reports: memory vs. file system, and Dividing offerService into 2 threads Currently, Blo

RE: Please Help: Namenode Safemode

2008-04-24 Thread dhruba Borthakur
of Datanodes is large. -dhruba From: Cagdas Gerede [mailto:[EMAIL PROTECTED] Sent: Thursday, April 24, 2008 11:56 AM To: dhruba Borthakur Cc: core-user@hadoop.apache.org Subject: Re: Please Help: Namenode Safemode Hi Dhruba, Thanks for your answer. But I

RE: Please Help: Namenode Safemode

2008-04-23 Thread dhruba Borthakur
: Cagdas Gerede [mailto:[EMAIL PROTECTED] Sent: Wednesday, April 23, 2008 4:37 PM To: core-user@hadoop.apache.org Cc: dhruba Borthakur Subject: Please Help: Namenode Safemode I have a hadoop distributed file system with 3 datanodes. I only have 150 blocks in each datanode. It takes a little more

RE: regarding a query on the support of hadoop on windows.

2008-04-22 Thread dhruba Borthakur
As far as I know, you need cygwin to install and run hadoop. The fact that you are using cygwin to run hadoop has almost negligible impact on the performance and efficiency of the hadoop cluster. Cyhgin is mostly needed for the install and configuration scripts. There are a few small portions of cl

RE: datanode files list

2008-04-21 Thread dhruba Borthakur
You should be able to run "bin/hadoop fsck -files -blocks -locations /" and get a listing of all files and the datanode(s) that each block of the file resides in. Thanks, dhruba -Original Message- From: Shimi K [mailto:[EMAIL PROTECTED] Sent: Monday, April 21, 2008 2:12 AM To: core-user@

RE: Help: When is it safe to discard a block in the application layer

2008-04-17 Thread dhruba Borthakur
The DFSClient caches small packets (e.g. 64K write buffers) and they are lazily flushed to the datanoeds in the pipeline. So, when an application completes a out.write() call, it is definitely not guaranteed that data is sent to even one datanode. One option would be to retrieve cache hints fr

RE: Lease expired on open file

2008-04-16 Thread dhruba Borthakur
The DFSClient has a thread that renews leases periodically for all files that are being written to. I suspect that this thread is not getting a chance to run because the gunzip program is eating all the CPU. You might want to put in a Sleep() after every few seconds on unzipping. Thanks, dhruba -

RE: multiple datanodes in the same machine

2008-04-15 Thread dhruba Borthakur
Yes, just point the Datanodes to different config files, different sets of ports, different data directories. Etc.etc. Thanks, dhruba -Original Message- From: Cagdas Gerede [mailto:[EMAIL PROTECTED] Sent: Tuesday, April 15, 2008 11:21 AM To: core-user@hadoop.apache.org Subject: multiple

RE: secondary namenode web interface

2008-04-08 Thread dhruba Borthakur
hing useful? I mean, what is the configuration parameter dfs.secondary.http.address for? Unless there are plans to make this interface work, this config parameter should go away, and so should the listening thread, shouldn't they? Thanks, -Yuri On Friday 04 April 2008 03:30:46 pm dh

RE: secondary namenode web interface

2008-04-04 Thread dhruba Borthakur
Your configuration is good. The secondary Namenode does not publish a web interface. The "null pointer" message in the secondary Namenode log is a harmless bug but should be fixed. It would be nice if you can open a JIRA for it. Thanks, Dhruba -Original Message- From: Yuri Pradkin [mailt

RE: Append data in hdfs_write

2008-03-26 Thread dhruba Borthakur
HDFS files, once closed, cannot be reopened for writing. See HADOOP-1700 for more details. Thanks, dhruba -Original Message- From: Raghavendra K [mailto:[EMAIL PROTECTED] Sent: Wednesday, March 26, 2008 11:29 PM To: core-user@hadoop.apache.org Subject: Append data in hdfs_write Hi, I

RE: Performance / cluster scaling question

2008-03-21 Thread dhruba Borthakur
The namenode lazily instructs a Datanode to delete blocks. As a response to every heartbeat from a Datanode, the Namenode instructs it to delete a maximum on 100 blocks. Typically, the heartbeat periodicity is 3 seconds. The heartbeat thread in the Datanode deletes the block files synchronously

RE: HDFS: Flash Application and Available APIs

2008-03-20 Thread dhruba Borthakur
There is a C-language based API to access HDFS. You can find more details at: http://wiki.apache.org/hadoop/LibHDFS If you download the Hadoop source code from http://hadoop.apache.org/core/releases.html, you will see this API in src/c++/libhdfs/hdfs.c hope this helps, dhruba -Original Mess

RE: Trash option in hadoop-site.xml configuration.

2008-03-20 Thread dhruba Borthakur
my another question. If two different clients ordered "move to trash" with different interval, (e.g. client #1 with fs.trash.interval = 60; client #2 with fs.trash.interval = 120) what would happen? Does namenode keep track of all these info? /Taeho On 3/20/08, dhruba Borthakur <[EM

RE: Trash option in hadoop-site.xml configuration.

2008-03-19 Thread dhruba Borthakur
The "trash" feature is a client side option and depends on the client configuration file. If the client's configuration specifies that "Trash" is enabled, then the HDFS client invokes a "rename to Trash" instead of a "delete". Now, if "Trash" is enabled on the Namenode, then the Namenode periodical

RE: HDFS: how to append

2008-03-18 Thread dhruba Borthakur
HDFS files, once created, cannot be modified in any way. Appends to HDFS files will probably be supported in a future release in the next couple of months. Thanks, dhruba -Original Message- From: Cagdas Gerede [mailto:[EMAIL PROTECTED] Sent: Tuesday, March 18, 2008 9:53 AM To: core-user@

RE: Question about recovering from a corrupted namenode 0.16.0

2008-03-13 Thread dhruba Borthakur
Your procedure is right: 1. Copy edit.tmp from secondary to edit on primary 2. Copy srcimage from secondary to fsimage on primary 3. remove edits.new on primary 4. restart cluster, put in Safemode, fsck / However, the above steps are not foolproof because the transactions that occured between th

RE: HDFS interface

2008-03-11 Thread dhruba Borthakur
HDFS can be accessed using the FileSystem API http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/fs/File System.html The HDFS Namenode protocol can be found in http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/dfs/Nam eNode.html thanks, dhruba -Original Messag

RE: zombie data nodes, not alive but not dead

2008-03-10 Thread dhruba Borthakur
The following issues might be impacting you (from release notes) http://issues.apache.org/jira/browse/HADOOP-2185 HADOOP-2185. RPC Server uses any available port if the specified port is zero. Otherwise it uses the specified port. Also combines the configuration attributes for the se

RE: org.apache.hadoop.dfs.NameNode: java.lang.NullPointerException

2008-03-02 Thread dhruba Borthakur
Hi Andre, Is it possible for you to let me look at your entire Namenode log? Thanks, dhruba -Original Message- From: André Martin [mailto:[EMAIL PROTECTED] Sent: Saturday, March 01, 2008 4:32 PM To: core-user@hadoop.apache.org Subject: org.apache.hadoop.dfs.NameNode: java.lang.NullPoint

RE: long write operations and data recovery

2008-02-29 Thread dhruba Borthakur
It would nice if a layer on top of the dfs client can be built to handle disconnected operation. That layer can cache files on local disk if HDFS is unavailable. It can then upload those files into HDFS when HDFS service comes back online. I think such a service will be helpful for most HDFS instal

RE: long write operations and data recovery

2008-02-28 Thread dhruba Borthakur
I agree with Joydeep. For batch processing, it is sufficient to make the application not assume that HDFS is always up and active. However, for real-time applications that are not batch-centric, it might not be sufficient. There are a few things that HDFS could do to better handle Namenode outages:

RE: long write operations and data recovery

2008-02-26 Thread dhruba Borthakur
The Namenode maintains a lease for every open file that is being written to. If the client that was writing to the file disappears, the Namenode will do "lease recovery" after expiry of the lease timeout (1 hour). The lease recovery process (in most cases) will remove the last block from the file (

RE: Namenode fails to re-start after cluster shutdown

2008-02-22 Thread dhruba Borthakur
If your file system metadata is in /tmp, then you are likely to see these kinds of problems. It would be nice if you can move the location of your metadata files away from /tmp. If you still see the problem, can you pl send us the logs from the log directory? Thanks a bunch, Dhruba -Original

RE: Namenode fails to re-start after cluster shutdown

2008-02-22 Thread dhruba Borthakur
Reformatting should never be necessary if you are using released version of hadoop. Hadoop-2783 refers to a bug that got introduced into trunk (not in any released versions). Thanks, Dhruba -Original Message- From: Steve Sapovits [mailto:[EMAIL PROTECTED] Sent: Friday, February 22, 2008

RE: Python access to HDFS

2008-02-21 Thread dhruba Borthakur
Hi Pete, If you are referring to the ability to re-open a file and append to it, then this feature is not in 0.16. Please see: http://issues.apache.org/jira/browse/HADOOP-1700 Thanks, dhruba -Original Message- From: Pete Wyckoff [mailto:[EMAIL PROTECTED] Sent: Thursday, February 21, 200

RE: Namenode fails to replicate file

2008-02-07 Thread dhruba Borthakur
You have to use the -w parameter to the setrep command to make it wait till the replication is complete. The following command bin/hadoop dfs -setrep 10 -w filename will block till all blocks of the file achieves a replication factor of 10. Thanks, dhruba -Original Message- From: Tim Wi

RE: question aboutc++ libhdfs and a bug comment on the libdfs test case

2008-01-22 Thread dhruba Borthakur
Hi Jason, Good catch. It would be great if you can create a JIRA issue and submit your code change as a patch for this problem. There are some big sites (about 1000 node clusters) that use libhdfs to access HDFS. Thanks, Dhruba -Original Message- From: Jason Venner [mailto:[EMAIL PROTEC