Re: questions regarding hadoop version 1.0

2012-03-14 Thread Joey Echeverria
JobTracker and TaskTracker. YARN is only in 0.23 and later releases. 1.0.x is from the 0.20x line of releases. -Joey On Mar 14, 2012, at 7:00, arindam choudhury arindam732...@gmail.com wrote: Hi, Hadoop 1.0.1 uses hadoop YARN or the tasktracker, jobtracker model? Regards, Arindam

Re: decompressing bzip2 data with a custom InputFormat

2012-03-14 Thread Joey Echeverria
Yes you have to deal with the compression. Usually, you'll load the compression codec in your RecordReader. You can see an example of how TextInputFormat's LineRecordReader does it:

Re: setting up a large hadoop cluster

2012-03-12 Thread Joey Echeverria
Apache Bigtop also has Hadoop puppet modules. For the modules based on Hadoop 0.20.205 you can look at them here: https://svn.apache.org/repos/asf/incubator/bigtop/branches/branch-0.2/bigtop-deploy/puppet/ I haven't seen any documentation on the modules. -Joey On Mon, Mar 12, 2012 at 1:43 PM,

Re: What is currently the best way to write to multiple output locations in Hadoop?

2012-03-12 Thread Joey Echeverria
Small typo, try: jar tf hadoop-core-1.0.1.jar | grep -i MultipleOutputs ;) -Joey On Mon, Mar 12, 2012 at 4:56 PM, W.P. McNeill bill...@gmail.com wrote: I take that back. On my laptop I'm running Apache Hadoop 1.0.1, and I still don't see MultipleOutputs. I am building against

Re: Is there a way to get an absolute HDFS path?

2012-03-12 Thread Joey Echeverria
HDFS has the notion of a working directory which defaults to /user/username. Check out: http://hadoop.apache.org/common/docs/r1.0.1/api/org/apache/hadoop/fs/FileSystem.html#getWorkingDirectory() and

Re: setting up a large hadoop cluster

2012-03-12 Thread Joey Echeverria
Masoud, I know that the Puppet Labs website is confusing, but puppet is open source and has no node limit. You can download it from here: http://puppetlabs.com/misc/download-options/ If you're using a Red Hat compatible linux distribution, you can get RPMs from EPEL:

Re: Best way for setting up a large cluster

2012-03-08 Thread Joey Echeverria
Something like puppet it is a good choice. There are example puppet manifests available for most Hadoop-related projects in Apache BigTop, for example: https://svn.apache.org/repos/asf/incubator/bigtop/branches/branch-0.2/bigtop-deploy/puppet/ -Joey On Thu, Mar 8, 2012 at 9:42 PM, Masoud

Re: is there anyway to detect the file size as am i writing a sequence file?

2012-03-06 Thread Joey Echeverria
I think you mean Writer.getLength(). It returns the current position in the output stream in bytes (more or less the current size of the file). -Joey On Tue, Mar 6, 2012 at 9:53 AM, Jane Wayne jane.wayne2...@gmail.com wrote: hi, i am writing a little util class to recurse into a directory and

Re: Adding nodes

2012-03-01 Thread Joey Echeverria
You only have to refresh nodes if you're making use of an allows file. Sent from my iPhone On Mar 1, 2012, at 18:29, Mohit Anchlia mohitanch...@gmail.com wrote: Is this the right procedure to add nodes? I took some from hadoop wiki FAQ: http://wiki.apache.org/hadoop/FAQ 1. Update

Re: Adding nodes

2012-03-01 Thread Joey Echeverria
PM, Joey Echeverria j...@cloudera.com wrote: You only have to refresh nodes if you're making use of an allows file. Thanks does it mean that when tasktracker/datanode starts up it communicates with namenode using master file? Sent from my iPhone On Mar 1, 2012, at 18:29, Mohit Anchlia

Re: LZO exception decompressing (returned -8)

2012-03-01 Thread Joey Echeverria
I know this doesn't fix lzo, but have you considered Snappy for the intermediate output compression? It gets similar compression ratios and compress/decompress speed, but arguably has better Hadoop integration. -Joey On Thu, Mar 1, 2012 at 10:01 PM, Marc Sturlese marc.sturl...@gmail.com wrote:

Re: LZO exception decompressing (returned -8)

2012-02-28 Thread Joey Echeverria
Which version of the Hadoop LZO library are you using? It looks like something I'm pretty sure was fixed in a newer version. -Joey On Feb 28, 2012, at 4:58, Marc Sturlese marc.sturl...@gmail.com wrote: Hey there, I've been running a cluster for over a year and was getting a lzo

Re: LZO exception decompressing (returned -8)

2012-02-28 Thread Joey Echeverria
Try 0.4.15. You can get it from here: https://github.com/toddlipcon/hadoop-lzo Sent from my iPhone On Feb 28, 2012, at 6:49, Marc Sturlese marc.sturl...@gmail.com wrote: I'm with 0.4.9 (think is the latest) -- View this message in context:

Re: dfs.block.size

2012-02-27 Thread Joey Echeverria
dfs.block.size can be set per job. mapred.tasktracker.map.tasks.maximum is per tasktracker. -Joey On Mon, Feb 27, 2012 at 10:19 AM, Mohit Anchlia mohitanch...@gmail.com wrote: Can someone please suggest if parameters like dfs.block.size, mapred.tasktracker.map.tasks.maximum are only cluster

Re: Security at file level in Hadoop

2012-02-22 Thread Joey Echeverria
HDFS supports POSIX style file and directory permissions (read, write, execute) for the owner, group and world. You can change the permissions with hadoop fs -chmod permissions path -Joey On Feb 22, 2012, at 5:32, shreya@cognizant.com wrote: Hi I want to implement security at

Re: Backupnode in 1.0.0?

2012-02-22 Thread Joey Echeverria
Check out the Apache Bigtop project. I believe they have 0.22 RPMs. Out of curiosity, why are you interested in BackupNode? -Joey Sent from my iPhone On Feb 22, 2012, at 14:56, Jeremy Hansen jer...@skidrow.la wrote: Any possibility of getting spec files to create packages for 0.22?

Re: Backupnode in 1.0.0?

2012-02-22 Thread Joey Echeverria
don't fully understand. I'll check out Bigtop.  I looked at it a while ago and forgot about it. Thanks -jeremy On Feb 22, 2012, at 2:43 PM, Joey Echeverria wrote: Check out the Apache Bigtop project. I believe they have 0.22 RPMs. Out of curiosity, why are you interested in BackupNode

Re: Writing small files to one big file in hdfs

2012-02-21 Thread Joey Echeverria
I'd recommend making a SequenceFile[1] to store each XML file as a value. -Joey [1] http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/io/SequenceFile.html On Tue, Feb 21, 2012 at 12:15 PM, Mohit Anchlia mohitanch...@gmail.comwrote: We have small xml files. Currently I am

Re: Adding mahout math jar to hadoop mapreduce execution

2012-02-01 Thread Joey Echeverria
, I did not even need to specify lib jars in the command line…should I be worried that it doesn't work that way? On Jan 31, 2012, at 4:09 PM, Joey Echeverria wrote: You also need to add the jar to the classpath so it's available in your main. You can do soemthing like

Re: Adding mahout math jar to hadoop mapreduce execution

2012-01-31 Thread Joey Echeverria
You also need to add the jar to the classpath so it's available in your main. You can do soemthing like this: HADOOP_CLASSPATH=/usr/local/mahout/math/target/mahout-math-0.6-SNAPSHOT.jar hadoop jar ... -Joey On Tue, Jan 31, 2012 at 1:38 PM, Daniel Quach danqu...@cs.ucla.edu wrote: For Hadoop

Re: NameNode per-block memory usage?

2012-01-17 Thread Joey Echeverria
How much memory/JVM heap does NameNode use for each block? I don't remember the exact number, it also depends on which version of Hadoop you're using http://search-hadoop.com/m/O886P1VyVvK1 - 1 GB heap for every object? It's 1 GB for every *million* objects (files, blocks, etc.). This is a

Re: Username on Hadoop 20.2

2012-01-16 Thread Joey Echeverria
to the property name? I'm using CDH3 with my Hadoop cluster currently setup with one node in pseudo-distributed mode, in case that helps. Cheers, Eli On 1/12/12 5:39 PM, Joey Echeverria wrote: Set the user.name property in your core-site.xml on your client nodes. -Joey On Thu, Jan 12

Re: Can you unset a mapred.input.dir configuration value?

2012-01-16 Thread Joey Echeverria
You can use  FileInptuFormat.setInputPaths(configuration, job1-output). This will overwrite the old input path(s). -Joey On Mon, Jan 16, 2012 at 7:16 PM, W.P. McNeill bill...@gmail.com wrote: It is possible to unset a configuration value? I think the answer is no, but I want to be sure. I

Re: Access core-site.xml from FileInputFormat

2012-01-12 Thread Joey Echeverria
It doesn't matter if the original comes from mapred-site.xml, core-site.xml, or hdfs-site.xml. All that really matters is if it's a client/job tunable or if it configures one of the daemons. Which parameter did you want to change? On Thu, Jan 12, 2012 at 1:59 PM, Marcel Holle

Re: Username on Hadoop 20.2

2012-01-12 Thread Joey Echeverria
Set the user.name property in your core-site.xml on your client nodes. -Joey On Thu, Jan 12, 2012 at 3:55 PM, Eli Finkelshteyn iefin...@gmail.com wrote: Hi, If I have one username on a hadoop cluster and would like to set myself up to use that same username from every client from which I

Re: Access core-site.xml from FileInputFormat

2012-01-12 Thread Joey Echeverria
to access from the FileInputFormat.getSplits() method. Is this possible? 2012/1/12 Joey Echeverria j...@cloudera.com It doesn't matter if the original comes from mapred-site.xml, core-site.xml, or hdfs-site.xml. All that really matters is if it's a client/job tunable or if it configures one

Re: has bzip2 compression been deprecated?

2012-01-10 Thread Joey Echeverria
Yes. Hive doesn't format data when you load it. The only exception is if you do an INSERT OVERWRITE ... . -Joey On Jan 10, 2012, at 6:08, Tony Burton tbur...@sportingindex.com wrote: Thanks for this Bejoy, very helpful. So, to summarise: when I CREATE EXTERNAL TABLE in Hive, the STORED

Re: Expected file://// error

2012-01-08 Thread Joey Echeverria
What's the classpath of the java program submitting the job? It has to have the configuration directory (e.g. /opt/hadoop/conf) in there or it won't pick up the correct configs. -Joey On Sun, Jan 8, 2012 at 12:59 PM, Mark question markq2...@gmail.com wrote: mapred-site.xml: configuration  

Re: Multi user Hadoop 0.20.205 ?

2011-12-29 Thread Joey Echeverria
Hey Praveenesh, What do you mean by multiuser? Do you want to support multiple users starting/stopping daemons? -Joey On Dec 29, 2011, at 2:49, praveenesh kumar praveen...@gmail.com wrote: Guys, Did someone try this thing ? Thanks On Tue, Dec 27, 2011 at 4:36 PM, praveenesh kumar

Re: Multi user Hadoop 0.20.205 ?

2011-12-29 Thread Joey Echeverria
, Praveenesh On Thu, Dec 29, 2011 at 4:46 PM, Joey Echeverria j...@cloudera.com wrote: Hey Praveenesh, What do you mean by multiuser? Do you want to support multiple users starting/stopping daemons? -Joey On Dec 29, 2011, at 2:49, praveenesh kumar praveen...@gmail.com wrote

Re: network configuration (etc/hosts) ?

2011-12-21 Thread Joey Echeverria
Can you run the hostname command on both servers and send their output? -Joey On Tue, Dec 20, 2011 at 8:21 PM, MirrorX mirr...@gmail.com wrote: dear all i am trying for many days to get a simple hadoop cluster (with 2 nodes) to work but i have trouble configuring the network parameters. i

Re: streaming data ingest into HDFS

2011-12-15 Thread Joey Echeverria
You could run the flume collectors on other machines and write a source which connects to the sockets on the data generators. -Joey On Dec 15, 2011, at 21:27, Periya.Data periya.d...@gmail.com wrote: Sorry...misworded my statement. What I meant was that the sources are meant to be

Re: Cloudera Free

2011-12-08 Thread Joey Echeverria
Hi Bai, I'm moving this over to scm-us...@cloudera.org as that's a more appropriate list. (common-user bcced). I assume by Cloudera Free you mean Coudera Manager Free Edition? You should be able to run a job in the same way that do on any other Hadoop cluster. The only caveat is that you first

Re: HDFS Backup nodes

2011-12-07 Thread Joey Echeverria
You should also configure the Namenode to use an NFS mount for one of it's storage directories. That will give the most up-to-date back of the metadata in case of total node failure. -Joey On Wed, Dec 7, 2011 at 3:17 AM, praveenesh kumar praveen...@gmail.com wrote: This means still we are

Re: HDFS Backup nodes

2011-12-07 Thread Joey Echeverria
On Wed, Dec 7, 2011 at 12:37 PM, randy...@comcast.net wrote: What happens then if the nfs server fails or isn't reachable? Does hdfs lock up? Does it gracefully ignore the nfs copy? Thanks, randy - Original Message - From: Joey Echeverria j...@cloudera.com To: common-user

Re: Regarding loading a big XML file to HDFS

2011-11-22 Thread Joey Echeverria
If your file is bigger than a block size (typically 64mb or 128mb), then it will be split into more than one block. The blocks may or may not be stored on different datanodes. If you're using a default InputFormat, then the input will be split between two task. Since you said you need the whole

Re: HBase Stack

2011-11-15 Thread Joey Echeverria
You can certainly run HBase on a single server, but I don't think you'd want to. Very few projects ever reach a scale where a single MySQL server can't handle it. In my opinion, you should start with the easy solution (MySQL) and only bring HBase into the mix when your scale really demands it. If

Re: Slow shuffle stage?

2011-11-11 Thread Joey Echeverria
Another thing to look at is the map outlier. The shuffle will start by default when 5% of the maps are done, but won't finish until after the last map is done. Since one of your maps took 37 minutes, your shuffle take at least that long. I would check the following: Is the input skewed? Does

Re: Slow shuffle stage?

2011-11-11 Thread Joey Echeverria
remember...I think so though. On Nov 11, 2011, at 5:53 AM, Joey Echeverria wrote: Another thing to look at is the map outlier. The shuffle will start by default when 5% of the maps are done, but won't finish until after the last map is done. Since one of your maps took 37 minutes, your

Re: someone know how to install hadoop0.20 on hp-ux?

2011-11-04 Thread Joey Echeverria
You need to create a log directory on your TaskTracker nodes: /opt/ecip/BMC/hadoopTest/hadoop-0.20.203.0/logs/ Make sure the directory is writable by the mapred user, or which ever user your TaskTrackers were started as. -Joey On Thu, Nov 3, 2011 at 11:11 PM, Li, Yonggang yongga...@hp.com

Re: Hadoop + cygwin

2011-11-03 Thread Joey Echeverria
What are the permissions on \tmp\hadoop-cyg_server\mapred\local\ttprivate? Which user owns that directory? Which user are you starting you TaskTracker as? -Joey On Wed, Nov 2, 2011 at 9:29 PM, Masoud mas...@agape.hanyang.ac.kr wrote: Hi, Im running hadop 0.20.204 under cygwin 1.7 on Win7,

Re: map task attempt progress at 400%?

2011-11-03 Thread Joey Echeverria
Is you input data compressed? There have been some bugs in the past with reporting progress when reading compressed data. -Joey On Thu, Nov 3, 2011 at 9:18 AM, Brendan W. bw8...@gmail.com wrote: Hi, Running 0.20.2: A job with about 4000 map tasks quickly blew through all but 3 in a couple

Re: map task attempt progress at 400%?

2011-11-03 Thread Joey Echeverria
compressed lines of text.  So maybe that accounts for the progress report. Any idea what the huge time difference might be due to (2 minutes average vs. 20 hrs for the last 3 tasks)?  Does that sound like swapping to you? Thanks, Brendan On Thu, Nov 3, 2011 at 9:44 AM, Joey Echeverria j

Re: Hadoop 0.20.2 and JobConf deprecation

2011-11-03 Thread Joey Echeverria
A new API was introduced with Hadoop 0.20. However, that API is not feature complete. Despite the fact that the old API is marked as deprecated, it's still the recommended, full feature API. In fact, in future versions of Hadoop the API has been undeprecated to call more attention to it's stable

Re: Question about superuser and permissions

2011-11-03 Thread Joey Echeverria
When you get the handle to the FileSystem object you can connect as a different user: http://hadoop.apache.org/common/docs/r0.20.203.0/api/org/apache/hadoop/fs/FileSystem.html#get(java.net.URI, org.apache.hadoop.conf.Configuration, java.lang.String) This should get any permissions you set

Re: Problem using SCM

2011-11-01 Thread Joey Echeverria
Hi Trang, I'm moving the discuss to scm-us...@cloudera.org as it's not a Hadoop common issue. I've bcced common-user@hadoop.apache.org and also put you in the to: field in case you're not on scm-users. As for your problem, the issue is that SCM doesn't support an installation via sudo if sudo

Re: Default Compression

2011-10-31 Thread Joey Echeverria
Try getting rid of the extra spaces and new lines. -Joey On Mon, Oct 31, 2011 at 1:49 PM, Mark static.void@gmail.com wrote: I recently added the following to my core-site.xml property nameio.compression.codecs/name value  org.apache.hadoop.io.compress.DefaultCodec,

Re: cannot find DeprecatedLzoTextInputFormat

2011-10-16 Thread Joey Echeverria
lack of understanding about hadoop. :-) Jessica On Wed, Oct 5, 2011 at 4:27 PM, Jessica Owensby jessica.owen...@gmail.comwrote: Great.  Thanks!  Will give that a try. Jessica On Wed, Oct 5, 2011 at 4:22 PM, Joey Echeverria j...@cloudera.com wrote: It sounds like you're hitting

Re: cannot find DeprecatedLzoTextInputFormat

2011-10-05 Thread Joey Echeverria
Did you add the LZO codec configuration to core-site.xml? -Joey On Wed, Oct 5, 2011 at 2:31 PM, Jessica Owensby jessica.owen...@gmail.com wrote: Hello Everyone, I've been having an issue in a hadoop environment (running cdh3u1) where any table declared in hive with the STORED AS INPUTFORMAT

Re: cannot find DeprecatedLzoTextInputFormat

2011-10-05 Thread Joey Echeverria
Are your LZO files indexed? -Joey On Wed, Oct 5, 2011 at 3:35 PM, Jessica Owensby jessica.owen...@gmail.com wrote: Hi Joey, Thanks. I forgot to say that; yes, the lzocodec class is listed in core-site.xml under the io.compression.codecs property: property  nameio.compression.codecs/name  

Re: cannot find DeprecatedLzoTextInputFormat

2011-10-05 Thread Joey Echeverria
.  They are indexed using the following command: hadoop jar /usr/lib/hadoop/lib/hadoop-lzo-20110217.jar com.hadoop.compression.lzo.LzoIndexer /user/hive/warehouse/foo/bar.lzo Jessica On Wed, Oct 5, 2011 at 3:52 PM, Joey Echeverria j...@cloudera.com wrote: Are your LZO files indexed? -Joey On Wed, Oct

Re: setInt getInt

2011-10-04 Thread Joey Echeverria
The Job class copies the Configuraiton that you pass in. You either need to do your conf.setInt(number, 12345) before you create the Job object or you need call job.getConfiguration().setInt(number, 12345). -Joey On Tue, Oct 4, 2011 at 12:28 PM, Ratner, Alan S (IS) alan.rat...@ngc.com wrote: I

Re: pointing mapred.local.dir to a ramdisk

2011-10-03 Thread Joey Echeverria
Raj, I just tried this on my CHD3u1 VM, and the ramdisk worked the first time. So, it's possible you've hit a bug in CDH3b3 that was later fixed. Can you enable debug logging in log4j.properties and then repost your task tracker log? I think there might be more details that it will print that

Re: FileSystem closed

2011-09-29 Thread Joey Echeverria
Do you close your FileSystem instances at all? IIRC, the FileSystem instance you use is a singleton and if you close it once, it's closed for everybody. My guess is you close it in your cleanup method and you have JVM reuse turned on. -Joey On Thu, Sep 29, 2011 at 12:49 PM, Mark question

Re: Running multiple MR Job's in sequence

2011-09-29 Thread Joey Echeverria
I would definitely checkout Oozie for this use case. -Joey On Thu, Sep 29, 2011 at 12:51 PM, Aaron Baff aaron.b...@telescope.tv wrote: I saw this, but wasn't sure if it was something that ran on the client and just submitted the Job's in sequence, or if that gave it all to the JobTracker,

Re: block size

2011-09-20 Thread Joey Echeverria
HDFS blocks are stored as files in the underlying filesystem of your datanodes. Those files do not take a fixed amount of space, so if you store 10 MB in a file and you have 128 MB blocks, you still only use 10 MB (times 3 with default replication). However, the namenode does incur additional

Re: Submitting Jobs from different user to a queue in capacity scheduler

2011-09-19 Thread Joey Echeverria
FYI, I'm moving this to mapreduce-user@ and bccing common-user@. It looks like your latest permission problem is on the local disk. What is your setting for hadoop.tmp.dir? What are the permissions on that directory? -Joey On Sep 18, 2011, at 23:27, ArunKumar arunk...@gmail.com wrote: Hi

Re: Submitting Jobs from different user to a queue in capacity scheduler

2011-09-18 Thread Joey Echeverria
As hfuser, create the /user/arun directory in hdfs-user. Then change the ownership /user/arun to arun. -Joey On Sep 18, 2011 8:07 AM, ArunKumar arunk...@gmail.com wrote: Hi Uma ! I have deleted the data in /app/hadoop/tmp and formatted namenode and restarted cluster.. I tried arun$

Re: Debugging mapper

2011-09-15 Thread Joey Echeverria
You might also want to look into MRUnit[1]. It lets you mock the behavior of the framework to test your map and reduce classes in isolation. Can't discover all bugs, but a useful tool and works nicely with IDE debuggers. -Joey [1] http://incubator.apache.org/mrunit/ On Thu, Sep 15, 2011 at 3:51

Re: Handling of small files in hadoop

2011-09-14 Thread Joey Echeverria
Hi Naveen, I use hadoop-0.21.0 distribution. I have a large number of small files (KB). Word of warning, 0.21 is not a stable release. The recommended version is in the 0.20.x range. Is there any efficient way of handling it in hadoop? I have heard that solution for that problem is using:  

Re: Hadoop doesnt use Replication Level of Namenode

2011-09-13 Thread Joey Echeverria
That won't work with the replication level as that is entirely a client side config. You can partially control it by setting the maximum replication level. -Joey On Tue, Sep 13, 2011 at 10:56 AM, Edward Capriolo edlinuxg...@gmail.com wrote: On Tue, Sep 13, 2011 at 5:53 AM, Steve Loughran

Re: Disable Sorting?

2011-09-11 Thread Joey Echeverria
The sort is what's implementing the group by key function. You can't have one without the other in Hadoop. Are you trying to disable the sort because you think it's too slow? -Joey On Sun, Sep 11, 2011 at 2:43 AM, john smith js1987.sm...@gmail.com wrote: Hi Arun, Suppose I am doing a simple

Re: Help - Rack Topology Script - Hadoop 0.20 (CDH3u1)

2011-08-21 Thread Joey Echeverria
Not that I know of. -Joey On Fri, Aug 19, 2011 at 1:16 PM, modemide modem...@gmail.com wrote: Ha, what a silly mistake. Thank you Joey. Do you also happen to know of an easier way to tell which racks the jobtracker/namenode think each node is in? On 8/19/11, Joey Echeverria j

Re: Help - Rack Topology Script - Hadoop 0.20 (CDH3u1)

2011-08-19 Thread Joey Echeverria
Did you restart the JobTracker? -Joey On Fri, Aug 19, 2011 at 12:45 PM, modemide modem...@gmail.com wrote: Hi all, I've tried to make a rack topology script.  I've written it in python and it works if I call it with the following arguments: 10.2.0.1 10.2.0.11 10.2.0.11 10.2.0.12 10.2.0.21

Re: Version Mismatch

2011-08-18 Thread Joey Echeverria
It means your HDFS client jars are using a different RPC version than your namenode and datanodes. Are you sure that XXX has $HADOOP_HOME in it's classpath? It really looks like it's pointing to the wrong jars. -Joey On Thu, Aug 18, 2011 at 8:14 AM, Ratner, Alan S (IS) alan.rat...@ngc.com wrote:

Re: How do I add Hadoop dependency to a Maven project?

2011-08-16 Thread Joey Echeverria
If you're talking about the org.apache.hadoop.mapreduce.* API, that was introduced in 0.20.0. There should be no need to use the 0.21 version. -Joey On Tue, Aug 16, 2011 at 1:22 PM, W.P. McNeill bill...@gmail.com wrote: Here is my specific problem: I have a sample word count Hadoop program up

Re: WritableComparable

2011-08-14 Thread Joey Echeverria
Does your compareTo() method test object pointer equality? If so, you could be getting burned by Hadoop reusing Writable objects. -Joey On Aug 14, 2011 9:20 PM, Stan Rosenberg srosenb...@proclivitysystems.com wrote: Hi Folks, After much poking around I am still unable to determine why I am

Re: Hadoop--store a sequence file in distributed cache?

2011-08-12 Thread Joey Echeverria
You can use any kind of format for files in the distributed cache, so yes you can use sequence files. They should be faster to parse than most text formats. -Joey On Fri, Aug 12, 2011 at 4:56 AM, Sofia Georgiakaki geosofie_...@yahoo.com wrote: Thank you for the reply! In each map(), I need to

Re: Speed up node under replicated block during decomission

2011-08-12 Thread Joey Echeverria
You can configure the undocumented variable dfs.max-repl-streams to increase the number of replications a data-node is allowed to handle at one time. The default value is 2. [1] -Joey [1]

Re: Keep output folder despite a failed Job

2011-08-09 Thread Joey Echeverria
You can set the keep.failed.task.files property on the job. -Joey On Tue, Aug 9, 2011 at 9:39 PM, Saptarshi Guha saptarshi.g...@gmail.com wrote: Hello, If  i have a failure during a job, is there a way I prevent the output folder from being deleted? Cheers Saptarshi -- Joseph

Re: error:Type mismatch in value from map

2011-07-29 Thread Joey Echeverria
If you want to use a combiner, your map has to output the same types as your combiner outputs. In your case, modify your map to look like this:  public static class TokenizerMapper       extends MapperText, Text, Text, IntWritable{    public void map(Text key, Text value, Context context      

Re: Hadoop Question

2011-07-28 Thread Joey Echeverria
How about having the slave write to temp file first, then move it to the file the master is monitoring for after they close it? -Joey On Jul 27, 2011, at 22:51, Nitin Khandelwal nitin.khandel...@germinait.com wrote: Hi All, How can I determine if a file is being written to (by any

Re: questions regarding data storage and inputformat

2011-07-27 Thread Joey Echeverria
1. Any reason not to use a sequence file for this?  Perhaps a mapfile?  Since I've sorted it, I don't need random accesses, but I do need to be aware of the keys, as I need to be sure that I get all of the relevant keys sent to a given mapper MapFile *may* be better here (see my answer for 2

Re: questions regarding data storage and inputformat

2011-07-27 Thread Joey Echeverria
You could either use a custom RecordReader or you could override the run() method on your Mapper class to do the merging before calling the map() method. -Joey On Wed, Jul 27, 2011 at 11:09 AM, Tom Melendez t...@supertom.com wrote: 3. Another idea might be create separate seq files for chunk

Re: Running queries using index on HDFS

2011-07-25 Thread Joey Echeverria
To add to what Bobby said, you can get block locations with fs.getFileBlockLocations() if you want to open based on locality. -Joey On Mon, Jul 25, 2011 at 3:00 PM, Robert Evans ev...@yahoo-inc.com wrote: Sofia, You can access any HDFS file from a normal java application so long as your

Re: Hadoop-streaming with a c binary executable as a mapper

2011-07-22 Thread Joey Echeverria
Your executable needs to read lines from standard in. Try setting your mapper like this: -mapper /data/yehdego/hadoop-0.20.2/pknotsRG - If that doesn't work, you may need to execute your C program from a shell script. The -I added to the command line says read from STDIN. -Joey On Jul 22,

Re: Where to find best documentation for setting up kerberos authentication in 0.20.203.0rc1

2011-07-18 Thread Joey Echeverria
Hi Issac, I couldn't find anything specifically for the 0.20.203 release, but CDH3 uses basically the same security code. You could probably follow our security guide with the 0.20.203 release: https://ccp.cloudera.com/display/CDHDOC/CDH3+Security+Guide -Joey On Mon, Jul 18, 2011 at 12:15 PM,

Re: FW: type mismatch error

2011-07-12 Thread Joey Echeverria
Your map method is misnamed. It should be in all lower case. -Joey On Jul 12, 2011 2:46 AM, Teng, James xt...@ebay.com wrote: hi, all. I am a new hadoop beginner, I try to construct a map and reduce task to run, however encountered an exception while continue going further. Exception:

Re: HTTP Error

2011-07-08 Thread Joey Echeverria
It looks like both datanodes are trying to serve data out of the smae directory. Is there any chance that both datanodes are using the same NFS mount for the dfs.data.dir? If not, what I would do is delete the data from ${dfs.data.dir} and then re-format the namenode. You'll lose all of your

Re: Cluster Tuning

2011-07-08 Thread Joey Echeverria
Set mapred.reduce.slowstart.completed.maps to a number close to 1.0. 1.0 means the maps have to completely finish before the reduce starts copying any data. I often run jobs with this set to .90-.95. -Joey On Fri, Jul 8, 2011 at 11:25 AM, Juan P. gordoslo...@gmail.com wrote: Here's another

Re: Cluster Tuning

2011-07-07 Thread Joey Echeverria
Have you tried using a Combiner? Here's an example of using one: http://hadoop.apache.org/common/docs/r0.20.0/mapred_tutorial.html#Example%3A+WordCount+v1.0 -Joey On Thu, Jul 7, 2011 at 4:29 PM, Juan P. gordoslo...@gmail.com wrote: Hi guys! I'd like some help fine tuning my cluster. I

Re: ArrayWritable usage

2011-07-04 Thread Joey Echeverria
ArrayWritable doesn't serialize type information. You need to subclass it (e.g. IntArrayWritable) and create a no arg constructor which calls super(IntWritable.class). Use this instead of ArrayWritable directly. If you want to store more than one type, look at the source for MapWritable to see

Re: Does hadoop-0.20-append compatible with PIG 0.8 ?

2011-07-02 Thread Joey Echeverria
Try replacing the hadoop jar from the pig lib directory with the one from your cluster. -Joey On Jul 2, 2011, at 0:38, praveenesh kumar praveen...@gmail.com wrote: Hi guys.. I am previously using hadoop and Hbase... So for Hbase to run perfectly fine we need

Re: tar or hadoop archive

2011-06-27 Thread Joey Echeverria
Yes, you can see a picture describing HAR files in this old blog post: http://www.cloudera.com/blog/2009/02/the-small-files-problem/ -Joey On Mon, Jun 27, 2011 at 4:36 PM, Rita rmorgan...@gmail.com wrote: So, it does an index of the file? On Mon, Jun 27, 2011 at 10:10 AM, Joey Echeverria j

Re: Append to Existing File

2011-06-21 Thread Joey Echeverria
Yes. -Joey On Jun 21, 2011 1:47 PM, jagaran das jagaran_...@yahoo.co.in wrote: Hi All, Does CDH3 support Existing File Append ? Regards, Jagaran From: Eric Charles eric.char...@u-mangate.com To: common-user@hadoop.apache.org Sent: Tue, 21 June, 2011

Re: problem with streaming and libjars

2011-06-16 Thread Joey Echeverria
I would try the following: hadoop -libjars /home/ayon/jars/MultiOutput.jar jar /usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.2-cdh3u0.jar -libjars /home/ayon/jars/MultiOutput.jar -input /user/ayon/streaming_test_input -output /user/ayon/streaming_test_output -mapper /bin/cat

Re: Datanode not created on hadoop-0.20.203.0

2011-06-16 Thread Joey Echeverria
Message- From: Joey Echeverria [mailto:j...@cloudera.com] Sent: Wednesday, June 15, 2011 12:01 PM To: common-user@hadoop.apache.org Subject: Re: Datanode not created on hadoop-0.20.203.0 By any chance, are you running as root? If so, try running as a different user. -Joey On Wed, Jun

Re: Datanode not created on hadoop-0.20.203.0

2011-06-15 Thread Joey Echeverria
By any chance, are you running as root? If so, try running as a different user. -Joey On Wed, Jun 15, 2011 at 12:53 PM, rutesh rutesh.cha...@gmail.com wrote: Hi,   I am new to hadoop (Just 1 month old). These are the steps I followed to install and run hadoop-0.20.203.0: 1) Downloaded tar

Re: a file can be used as a queue?

2011-06-13 Thread Joey Echeverria
This feature doesn't currently work. I don't remember the JIRA for it, but there's a ticket which will allow a reader to read from an HDFS file before it's closed. In that case, you implement a queue by having the producer write to the end of the file and the reader read from the beginning of

Re: Hardware specs

2011-06-09 Thread Joey Echeverria
There are some good recommendations in this blog post: http://www.cloudera.com/blog/2010/03/clouderas-support-team-shares-some-basic-hardware-recommendations/ It's a little dated, but the reasoning and basics are sound. -Joey On Thu, Jun 9, 2011 at 10:59 AM, Mark static.void@gmail.com

Re: Hbase startup error: NoNode for /hbase/master after running out of space

2011-06-08 Thread Joey Echeverria
Hey Andy, You're correct that 0.20.203 doesn't have append. Your best bet is to build a version of the append branch or shameless-plugswitch to CDH3u0/shameless-plug. -Joey On Tue, Jun 7, 2011 at 6:31 PM, Zhong, Sheng sheng.zh...@searshc.com wrote: Thanks! The issue has been resolved by

Re: Why inter-rack communication in mapreduce slow?

2011-06-06 Thread Joey Echeverria
Larger Hadoop installations are space dense, 20-40 nodes per rack. When you get to that density with multiple racks, it becomes expensive to buy a switch with enough capacity for all of the nodes in all of the racks. The typical solution is to install a switch per rack with uplinks to a core

Re: Why inter-rack communication in mapreduce slow?

2011-06-06 Thread Joey Echeverria
Most of the network bandwidth used during a MapReduce job should come from the shuffle/sort phase. This part doesn't use HDFS. The TaskTrackers running reduce tasks will pull intermediate results from TaskTrackers running map tasks over HTTP. In most cases, it's difficult to get rack locality

Re: Heap Size question.

2011-06-01 Thread Joey Echeverria
The values show you the maximum heap size and currently used heap of the job tracker, not running jobs. Furthermore, the HADOOP_HEAPSIZE setting only sets the maximum heap for the daemons, not the tasks in your job. If you're getting OOMEs, you should add a setting to your mapred-site.xml file

Re: Starting JobTracker Locally but binding to remote Address

2011-05-31 Thread Joey Echeverria
The problem is that start-all.sh isn't all that intelligent. The way that start-all.sh works is by running start-dfs.sh and start-mapred.sh. The start-mapred.sh script always starts a job tracker on the local host and a task tracker on all of the hosts listed in slaves (it uses SSH to do the

Re: Is it safe to manually copy BLK files?

2011-05-30 Thread Joey Echeverria
The short answer is no. If you want to decommission a datanode, the safest way is to put hostnames of the datanodes you want to shutdown into a file on the namenode. Next, set the dfs.hosts.exclude parameter to point to the file. Finally, run hadoop dfsadmin -refreshNodes. As an FYI, I think you

Re: I can't see this email ... So to clarify ..

2011-05-24 Thread Joey Echeverria
Try moving the the configuration to hdfs-site.xml. One word of warning, if you use /tmp to store your HDFS data, you risk data loss. On many operating systems, files and directories in /tmp are automatically deleted. -Joey On Tue, May 24, 2011 at 10:22 PM, Mark question markq2...@gmail.com

Re: get name of file in mapper output directory

2011-05-23 Thread Joey Echeverria
Hi Mark, FYI, I'm moving the discussion over to mapreduce-u...@hadoop.apache.org since your question is specific to MapReduce. You can derive the output name from the TaskAttemptID which you can get by calling getTaskAttemptID() on the context passed to your cleanup() funciton. The task attempt

Re: Log files expanding at an alarming rate

2011-05-23 Thread Joey Echeverria
Hi Karthik, FYI, I'm moving this thread to mapreduce-u...@hadoop.apache.org (You and common-user are BCCed). My guess is that your task trackers are throwing a lot of exceptions which are getting logged. Can you send a snippet of the logs to help diagnose why it's logging so much? Can you also

Re: What's the easiest way to count the number of Key, Value pairs in a directory?

2011-05-20 Thread Joey Echeverria
What format is the input data in? At first glance, I would run an identity mapper and use a NullOutputFormat so you don't get any data written. The built in counters already count the number of key, value pairs read in by the mappers. -Joey On Fri, May 20, 2011 at 9:34 AM, W.P. McNeill

  1   2   >