,
- Andy
Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)
--
*From:* Dhruba Borthakur dhr...@gmail.com
*To:* hdfs-user@hadoop.apache.org; Andrew Purtell apurt...@apache.org
*Sent:* Thursday, September 15, 2011 10:14 AM
distro. Would you object if I make a contribution of that
result if it is successful?
Best regards,
- Andy
Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)
From: Dhruba Borthakur dhr...@gmail.com
To: Andrew
We use HDFS RAID in a big way. Data older than 12 days are RAIDED using XOR
encoding (effective replication of 2.5). Data older than a few months are
raided using ReedSolomon (effective observed replication factor of 1.5).
This is running on our 60 PB size cluster for about an year now.
thanks
My answers inline.
1. Why does namenode store the blockmap (block to datanode mapping) in the
main memory for all the files, even those that are not used?
The block to datanode mapping is needed for two reasons: when a client wants
to read a file, the namenode has to tell the client the
We are using hdfs for backups (and archival) of a huge number of databases
Thanks
Dhruba
Sent from my iPhone
On Jul 16, 2011, at 9:14 AM, Owen O'Malley o...@hortonworks.com wrote:
The scientists at CERN use HDFS for storing their large data sets and
don't use MapReduce at all. (I believe
The AvatarNode does use zookeeper (but since this is not directly related to
Apache HDFS code, if u have more questions, please send it to me directly).
The latest AvatarNode code is in
We, at FB, are saving about 5 PB of raw space on a cluster that has a total
raw disk space of 30PB. Please remember that more the average number of
blocks in your files, larger is the savings of disk space.
thanks,
dhruba
On Mon, Feb 21, 2011 at 4:29 PM, Nathan Rutman nrut...@gmail.com wrote:
The Namenode uses around 160 bytes/file and 150 bytes/block in HDFS. This is
a very rough calculation.
dhruba
On Wed, Feb 2, 2011 at 5:11 PM, Dhodapkar, Chinmay chinm...@qualcomm.comwrote:
What you describe is pretty much my use case as well. Since I don’t know
how big the number of files
each packet has an offset in the file that it is supposed to be written to.
So, there is no hard in resending the same packet twice, the receiving
datanode would always write this packet to the correct offset in the
destination file.
If B crashes during the write, the client does not know whether
Hi folks,
Is it necessary to run keep the clocks synchronized on all Hbase region
servers/master? I would appreciate it a lot if somebody can please explain
if the HBase architecture depends on this fact.
thanks,
dhruba
On Wed, Dec 23, 2009 at 9:57 AM, Mark Vigeant
I recommend that the first command you run after all daemons are formatted
and started is to create your home directory (before u upload files):
$hadoop dfs -ls
ls: Cannot access .: No such file or directory.
$hadoop dfs -mkdir /user/ninput
$ hadoop dfs -copyFromLocal .vimrc.
$ hadoop
strictly copy
blocks/files from the decommissioned node to the other live nodes? Or do
blocks get copied from other live nodes too?
i.e., is the source of transfers always the node being decommissioned?
-Harold
--- On Tue, 9/15/09, Dhruba Borthakur dhr...@gmail.com wrote:
From: Dhruba Borthakur
This might help:
http://wiki.apache.org/hadoop/FAQ#17
https://issues.apache.org/jira/browse/HADOOP-681
Bandwidth can be throttled on a datanode via dfs.balance.bandwidthPerSec
thanks,
dhruba
On Tue, Sep 15, 2009 at 1:55 PM, Harold Lim rold...@yahoo.com wrote:
Hi All,
Is there a document
We cannot put GPL code into Hive... licenses are incompatible.
You can make it a dynamically configurable parameter. If the relevant
classes in the CLASSPATH then they will be invoked. Otherwise, the stubs
(built into hive) can throw an exception. A customer can download the
maxmind stuff into
You can use any of these:
1. bin/hadoop dfs -get hdfsfile remote filename
2. Thrift API : http://wiki.apache.org/hadoop/HDFS-APIs
3. use fuse-mount ot mount hdfs as a regular file system on remote machine:
http://wiki.apache.org/hadoop/MountableHDFS
thanks,
dhruba
On Mon, Apr 20, 2009 at
I attached a HiveLet (Made up term)
That's a cool name!
I believe that file modification times are updated only when the file is
closed. Are you appending to a preexisting file?
thanks,
dhruba
On Tue, Dec 30, 2008 at 3:14 AM, Sandeep Dhawan dsand...@hcl.in wrote:
Hi,
I have a application which creates a simple text file on hdfs. There is a
Hi Dennis,
There were some discussions on this topic earlier:
http://issues.apache.org/jira/browse/HADOOP-3799
Do you have any specific use-case for this feature?
thanks,
dhruba
On Mon, Nov 24, 2008 at 10:22 PM, Owen O'Malley [EMAIL PROTECTED] wrote:
On Nov 24, 2008, at 8:44 PM, Mahadev
The design is such that running multiple secondary namenodes should not
corrupt the image (modulo any bugs). Are you seeing image corruptions when
this happens?
You can run all or any daemons in 32-bit mode or 64 bit-mode. You can
mix-and-match. If you have many millions of files, then you might
One can open a file and then seek to an offset and then start reading
from there. For writing, one can write only to the end of an existing
file using FileSystem.append().
hope this helps,
dhruba
On Thu, Nov 13, 2008 at 1:24 PM, Tsz Wo (Nicholas), Sze
[EMAIL PROTECTED] wrote:
Append is going to
Couple of things that one can do:
1. dfs.name.dir should have at least two locations, one on the local
disk and one on NFS. This means that all transactions are
synchronously logged into two places.
2. Create a virtual IP, say name.xx.com that points to the real
machine name of the machine on
services, it's quite a common case.
I think HBase developers would have run into similar issues as well.
Is this enough explanation?
Thanks in advance,
Taeho
On Tue, Nov 4, 2008 at 3:12 AM, Dhruba Borthakur [EMAIL PROTECTED] wrote:
In the current code, details about block locations
It can return 0 if and only if the requested size was zero. For EOF,
it should return -1.
dhruba
On Fri, Nov 7, 2008 at 8:09 PM, Pete Wyckoff [EMAIL PROTECTED] wrote:
Just want to ensure 0 iff EOF or the requested #of bytes was 0.
On 11/7/08 6:13 PM, Pete Wyckoff [EMAIL PROTECTED] wrote:
Hi Ben,
And, if I may add, if you would like to contribute the code to make this
happen, that will be awesome! In that case, we can move this discussion to a
JIRA.
Thanks,
dhruba
On 10/27/08 1:41 PM, Ashish Thusoo [EMAIL PROTECTED] wrote:
We did have some discussions around it a while back
My opinion is to not store file-namespace related metadata on the
datanodes. When a file is renamed, one has to contact all datanodes to
change this new metadata. Worse still, if one renames an entire
subdirectory, all blocks that belongs to all files in the subdirectory
have to be updated.
In almost all hadoop configurations, all host names can be specified
as IP address. So, in your hadoop-site.xml, please specify the IP
address of the namenode (instead of its hostname).
-dhruba
2008/8/8 Lucas Nazário dos Santos [EMAIL PROTECTED]:
Thanks Andreas. I'll try it.
On Fri, Aug 8,
It is possible that your namenode is overloaded and is not able to
respond to RPC requests from clients. Please check the namenode logs
to see if you see lines of the form discarding calls
dhrua
On Fri, Aug 8, 2008 at 3:41 AM, Alexander Aristov
[EMAIL PROTECTED] wrote:
I come across the
wrote:
Dhruba Borthakur wrote:
A good way to implement failover is to make the Namenode log transactions
to
more than one directory, typically a local directory and a NFS mounted
directory. The Namenode writes transactions to both directories
synchronously.
If the Namenode machine dies
One option for you is to use a pdf-to-text converter (many of them are
available online) and then run map-reduce on the txt file.
-dhruba
On Wed, Jul 23, 2008 at 1:07 AM, GaneshG
[EMAIL PROTECTED] wrote:
Thanks Lohit, i am using only defalult reader and i am very new to hadoop.
This is my map
HDFS uses the network topology to distribute and replicate data. An
admin has to configure a script that describes the network topology to
HDFS. This is specified by using the parameter
topology.script.file.name in the Configuration file. This has been
tested when nodes are on different subnets in
This firstbadlink was an mis-configured log message in the code. It
is innocuous and has since been fixed in 0.17 release.
http://issues.apache.org/jira/browse/HADOOP-3029
thanks,
dhruba
On Sat, May 24, 2008 at 7:03 PM, C G [EMAIL PROTECTED] wrote:
Hi All:
So far, running 0.16.4 has been a
There isn's a way to change the block size of an existing file. The
block size of a file can be specified only at the time of file
creation and cannot be changed later.
There isn't any wasted space in your system. If the block size is
128MB but you create a HDFS file of say size 10MB, then that
Did one datanode fail or did the namenode fail? By fail do you mean
that the system was rebooted or was there a bad disk that caused the
problem?
thanks,
dhruba
On Sun, May 11, 2008 at 7:23 PM, C G [EMAIL PROTECTED] wrote:
Hi All:
We had a primary node failure over the weekend. When we
You bring up an interesting point. A big chunk of the code in the
Namenode is being done inside a global lock although there are pieces
(e.g. a portion of code that chooses datanodes for a newly allocated
block) that do execute outside this lock. But, it is probably the case
that the namenode does
: 3.0
The filesystem under path '/' is CORRUPT
So it seems like it's fixing some problems on its own?
Thanks,
C G
Dhruba Borthakur [EMAIL PROTECTED] wrote:
Did one datanode fail or did the namenode fail? By fail do you mean
that the system was rebooted or was there a bad disk
Starting in 0.17 release, an application can invoke
DFSOutputStream.fsync() to persist block locations for a file even
before the file is closed.
thanks,
dhruba
On Tue, May 6, 2008 at 8:11 AM, Cagdas Gerede [EMAIL PROTECTED] wrote:
If you are writing 10 blocks for a file and let's say in 10th
From: Cagdas Gerede [mailto:[EMAIL PROTECTED]
Sent: Tuesday, April 29, 2008 11:32 PM
To: core-user@hadoop.apache.org
Cc: dhruba Borthakur
Subject: Block reports: memory vs. file system, and Dividing
offerService into 2 threads
Currently,
Block reports
reports: memory vs. file system, and Dividing
offerService into 2 threads
dhruba Borthakur wrote:
My current thinking is that block report processing should compare
the
blkxxx files on disk with the data structure in the Datanode memory.
If
and only if there is some discrepancy between these two
of Datanodes is large.
-dhruba
From: Cagdas Gerede [mailto:[EMAIL PROTECTED]
Sent: Thursday, April 24, 2008 11:56 AM
To: dhruba Borthakur
Cc: core-user@hadoop.apache.org
Subject: Re: Please Help: Namenode Safemode
Hi Dhruba,
Thanks for your answer. But I
As far as I know, you need cygwin to install and run hadoop. The fact
that you are using cygwin to run hadoop has almost negligible impact on
the performance and efficiency of the hadoop cluster. Cyhgin is mostly
needed for the install and configuration scripts. There are a few small
portions of
The DFSClient caches small packets (e.g. 64K write buffers) and they are
lazily flushed to the datanoeds in the pipeline. So, when an application
completes a out.write() call, it is definitely not guaranteed that data
is sent to even one datanode.
One option would be to retrieve cache hints
The DFSClient has a thread that renews leases periodically for all files
that are being written to. I suspect that this thread is not getting a
chance to run because the gunzip program is eating all the CPU. You
might want to put in a Sleep() after every few seconds on unzipping.
Thanks,
dhruba
mean,
what is the configuration parameter dfs.secondary.http.address for?
Unless
there are plans to make this interface work, this config parameter
should go
away, and so should the listening thread, shouldn't they?
Thanks,
-Yuri
On Friday 04 April 2008 03:30:46 pm dhruba Borthakur wrote:
Your
Your configuration is good. The secondary Namenode does not publish a
web interface. The null pointer message in the secondary Namenode log
is a harmless bug but should be fixed. It would be nice if you can open
a JIRA for it.
Thanks,
Dhruba
-Original Message-
From: Yuri Pradkin
The namenode lazily instructs a Datanode to delete blocks. As a response to
every heartbeat from a Datanode, the Namenode instructs it to delete a maximum
on 100 blocks. Typically, the heartbeat periodicity is 3 seconds. The heartbeat
thread in the Datanode deletes the block files synchronously
is my another question.
If two different clients ordered move to trash with different interval,
(e.g. client #1 with fs.trash.interval = 60; client #2 with
fs.trash.interval = 120)
what would happen?
Does namenode keep track of all these info?
/Taeho
On 3/20/08, dhruba Borthakur [EMAIL PROTECTED
HDFS files, once created, cannot be modified in any way. Appends to HDFS
files will probably be supported in a future release in the next couple
of months.
Thanks,
dhruba
-Original Message-
From: Cagdas Gerede [mailto:[EMAIL PROTECTED]
Sent: Tuesday, March 18, 2008 9:53 AM
To:
Your procedure is right:
1. Copy edit.tmp from secondary to edit on primary
2. Copy srcimage from secondary to fsimage on primary
3. remove edits.new on primary
4. restart cluster, put in Safemode, fsck /
However, the above steps are not foolproof because the transactions that
occured between
HDFS can be accessed using the FileSystem API
http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/fs/File
System.html
The HDFS Namenode protocol can be found in
http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/dfs/Nam
eNode.html
thanks,
dhruba
-Original
I agree with Joydeep. For batch processing, it is sufficient to make the
application not assume that HDFS is always up and active. However, for
real-time applications that are not batch-centric, it might not be
sufficient. There are a few things that HDFS could do to better handle
Namenode
The Namenode maintains a lease for every open file that is being written
to. If the client that was writing to the file disappears, the Namenode
will do lease recovery after expiry of the lease timeout (1 hour). The
lease recovery process (in most cases) will remove the last block from
the file
If your file system metadata is in /tmp, then you are likely to see
these kinds of problems. It would be nice if you can move the location
of your metadata files away from /tmp. If you still see the problem, can
you pl send us the logs from the log directory?
Thanks a bunch,
Dhruba
52 matches
Mail list logo