December 2011 SF Hadoop User Group

2011-11-16 Thread Aaron Kimball
- Welcome - 6:30pm - Introductions; start creating agenda - Breakout sessions begin as soon as we're ready - 8pm - Conclusion Food and refreshments will be provided, courtesy of Splunk. Please RSVP at http://www.meetup.com/hadoopsf/events/41427512/ Regards, - Aaron Kimball

October SF Hadoop Meetup

2011-09-30 Thread Aaron Kimball
/events/35650052/ Regards, - Aaron Kimball

Next SF HUG: June 8, at RichRelevance

2011-05-19 Thread Aaron Kimball
ready - 8pm - Conclusion Food and refreshments will be provided, courtesy of RichRelevance. If you're going to attend, please RSVP at http://bit.ly/kxaJqa. Hope to see you all there! - Aaron Kimball

April SFHUG recap, May SFHUG meetup announcement

2011-04-18 Thread Aaron Kimball
will be provided, courtesy of Cloudera. Please RSVP at http://bit.ly/hwMCI2 Looking forward to seeing you there! Regards, - Aaron Kimball

SF Hadoop Meetup - March review and April announcement (April 13)

2011-03-11 Thread Aaron Kimball
* 8pm - Conclusion Regards, - Aaron Kimball

Re: Problem running a Hadoop program with external libraries

2011-03-04 Thread Aaron Kimball
I don't know if putting native-code .so files inside a jar works. A native-code .so is not classloaded in the same way .class files are. So the correct .so files probably need to exist in some physical directory on the worker machines. You may want to doublecheck that the correct directory on the

Re: Problem running a Hadoop program with external libraries

2011-03-04 Thread Aaron Kimball
errors are coming from. If they're from the OS, could it be because it needs to fork() and momentarily exceed the ulimit before loading the native libs? - Aaron On Fri, Mar 4, 2011 at 1:26 PM, Aaron Kimball akimbal...@gmail.com wrote: I don't know if putting native-code .so files inside a jar

Reminder: SF Hadoop meetup in 1 week

2011-03-02 Thread Aaron Kimball
to facilitate a discussion. All members of the Hadoop community are welcome to attend. While all Hadoop-related subjects are on topic, this month's discussion theme is integration. Regards, - Aaron Kimball

March 2011 San Francisco Hadoop User Meetup (integration)

2011-02-23 Thread Aaron Kimball
has asked that all attendees RSVP in advance, to comply with their security policy. Please join the meetup group and RSVP at http://www.meetup.com/hadoopsf/events/16678757/ Refreshments will be provided. Regards, - Aaron Kimball

SF Hadoop meetup report

2011-02-11 Thread Aaron Kimball
announcement! Sign up at http://www.meetup.com/hadoopsf/ Regards, - Aaron Kimball

Re: Click Stream Data

2011-01-30 Thread Aaron Kimball
Start with the student's CS department's web server? I believe the wikimedia foundation also makes the access logs to wikipedia et al. available publicly. That is quite a lot of data though. - Aaron On Sun, Jan 30, 2011 at 10:54 AM, Bruce Williams williams.br...@gmail.comwrote: Does anyone

Re: How do I log from my map/reduce application?

2010-12-15 Thread Aaron Kimball
W. P., How are you running your Reducer? Is everything running in standalone mode (all mappers/reducers in the same process as the launching application)? Or are you running this in pseudo-distributed mode or on a remote cluster? Depending on the application's configuration, log4j configuration

Re: How do I log from my map/reduce application?

2010-12-15 Thread Aaron Kimball
gave excerpts from is a central one for the cluster. On Wed, Dec 15, 2010 at 1:38 PM, Aaron Kimball akimbal...@gmail.com wrote: W. P., How are you running your Reducer? Is everything running in standalone mode (all mappers/reducers in the same process as the launching application

Re: Behaviour of reducer's Iterable in MR unit.

2010-12-09 Thread Aaron Kimball
Hi James, The ReduceDriver is configured to receive a list of inputs because lists have ordering guarantees whereas other Iterables/Collections do not; for determinism's sake, it is best to guarantee that you're calling reduce() with an ordered set of values when testing. It would be stellar if

Re: Help tuning a cluster - COPY slow

2010-11-17 Thread Aaron Kimball
Tim, Are there issues with DNS caching (or lack thereof), misconfigured /etc/hosts, or other network-config gotchas that might be preventing network connections between hosts from opening efficiently? - Aaron On Wed, Nov 17, 2010 at 12:50 PM, Tim Robertson timrobertson...@gmail.comwrote:

San Francisco Hadoop meetup

2010-11-04 Thread Aaron Kimball
: * I've created a short survey to help understand days / times that would work for the most people: http://bit.ly/ajK26U * Please also join the meetup group at http://meetup.com/hadoopsf -- We'll use this to plan the event, RSVP information, etc. I'm looking forward to meeting more of you! - Aaron

Re: MRUnit Download

2010-08-20 Thread Aaron Kimball
Rakesh, If you download Hadoop from CDH, it will include MRUnit in the contrib directory. Here's a direct link: http://archive.cloudera.com/cdh/3/hadoop-0.20.2+320.tar.gz - Aaron On Fri, Aug 20, 2010 at 11:53 AM, rakesh kothari rkothari_...@hotmail.comwrote: Hi, This link:

Re: mrunit question

2010-08-11 Thread Aaron Kimball
David, Since you are directly instantiating the Mapper and Reducer (not using ReflectionUtils), you are free to call setConf() yourself before you run the test. If you're using the old API (o.a.h.mrunit): Mapper m = new Mapper(); MapDriver d = new MapDriver(m); Configuration conf = new

Re: INFO: Task Id : attempt_201007191410_0002_m_000000_0, Status : FAILED

2010-07-20 Thread Aaron Kimball
The most likely problem I suspect is that you're emitting a key or a value to the OutputCollector that does not inherit from o.a.h.io.Writable. Your input/output types should all do this. There are stock implementations (IntWritable, LongWritable, FloatWritable, Text -- for strings, etc.) of all

Re: Text files vs. SequenceFiles

2010-07-05 Thread Aaron Kimball
David, I think you've more-or-less outlined the pros and cons of each format (though do see Alex's important point regarding SequenceFiles and compression). If everyone who worked with Hadoop clearly favored one or the other, we probably wouldn't include support for both formats by default. :)

Re: create error

2010-07-05 Thread Aaron Kimball
Is there a reason you're using that particular interface? That's very low-level. See http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample for the proper API to use. - Aaron On Sat, Jul 3, 2010 at 1:36 AM, Vidur Goyal vi...@students.iiit.ac.inwrote: Hi, I am trying to create a file in

Re: Partitioned Datasets Map/Reduce

2010-07-05 Thread Aaron Kimball
One possibility: write out all the partition numbers (one per line) to a single file, then use the NLineInputFormat to make each line its own map task. Then in your mapper itself, you will get in a key of 0 or 1 or 2 etc. Then explicitly open /dataset1/part-(n) and /dataset2/part-(n) in your

Re: exporting partitioned data into remote database

2010-06-18 Thread Aaron Kimball
Hi Szymon, Unfortunately Sqoop can't yet read partition column information out of the Hive table. So you'll need to export each partition individually. You can probably get your system to work right by doing these two commands: sqoop --connect jdbc:mysql://test-db.gadu/crunchers --username

Re: Compiling Hive against a different version of Hadoop 20

2010-06-14 Thread Aaron Kimball
Also try supplying this flag to ant: -Doffline=true On Mon, Jun 14, 2010 at 9:06 PM, Carl Steinbach c...@cloudera.com wrote: Hi Viraj, You can put your hadoop tarball in the cache directory that ivy uses (probably ~/.ant/cache/hadoop/core/sources). This should prevent Ivy from failing when

Re: Multithreaded Mapper and Map runner

2010-06-11 Thread Aaron Kimball
This will likely break most programs you try to run. Many mapper implementations are not thread safe. That having been said, if you want to force all programs using the old API (org.apache.hadoop.mapred.*) to run on the multithreaded maprunner, you can do this by setting mapred.map.runner.class

Re: Unable to read sequence file produced by MR job

2010-05-27 Thread Aaron Kimball
Put your jar on Hadoop's classpath: $ HADOOP_CLASSPATH=path/to/shortdocwritables.jar hadoop fs -text bla bla - Aaron On Thu, May 27, 2010 at 11:07 AM, James Hammerton james.hammer...@mendeley.com wrote: Hi, I tried using the hadoop fs -text command to read a sequence file generated by

Re: help on CombineFileInputFormat

2010-05-10 Thread Aaron Kimball
Zhenyu, It's a bit complicated and involves some layers of indirection. CombineFileRecordReader is a sort of shell RecordReader that passes the actual work of reading records to another child record reader. That's the class name provided in the third parameter. Instructing it to use

Sqoop is moving to github!

2010-03-29 Thread Aaron Kimball
process, please ask me. Regards, - Aaron Kimball Cloudera, Inc.

Sqoop is moving to github!

2010-03-29 Thread Aaron Kimball
process, please ask me. Regards, - Aaron Kimball Cloudera, Inc.

Re: Does it possible that output value type of Map function different from final type?

2010-03-21 Thread Aaron Kimball
Glad you figured out your bug. It's worth pointing out that Google's MapReduce has stricter type signatures than Hadoop does. Hadoop allows the final (k, v) pairs to be of a different type than the inputs to the reducer, and the final key associated with a final value does not necessarily need to

Re: Sqoop Installation on Apache Hadop 0.20.2

2010-03-17 Thread Aaron Kimball
Hi Utku, Apache Hadoop 0.20 cannot support Sqoop as-is. Sqoop makes use of the DataDrivenDBInputFormat (among other APIs) which are not shipped with Apache's 0.20 release. In order to get Sqoop working on 20, you'd need to apply a lengthy list of patches from the project source repository to your

Re: CombineFileInputFormat in 0.20.2 version

2010-03-16 Thread Aaron Kimball
The most obvious workaround is to use the old API (continue to use Mapper, Reducer, etc. from org.apache.hadoop.mapred, not .mapreduce). If you really want to use the new API, though, I unfortunately don't see a super-easy path. You could try to apply the patch from MAPREDUCE-364 to your version

Re: Parallelizing HTTP calls with MapReduce

2010-03-08 Thread Aaron Kimball
I think you should actually use the Java-based MapReduce here. As has been noted, these will be network-bound calls. And if you're trying to make a lot of them, my experience is that individual calls are slow. 10,000 GET requests could each take a second or two, especially if they involve DNS

Re: Separate mail list for streaming?

2010-03-03 Thread Aaron Kimball
We've already got a lot of mailing lists :) If you send questions to mapreduce-user, are you not getting enough feedback? - Aaron On Wed, Mar 3, 2010 at 12:09 PM, Michael Kintzer rockrep.had...@gmail.comwrote: Hi, Was curious if anyone else thought it would be useful to have a separate mail

Re: Unexpected termination of a job

2010-03-03 Thread Aaron Kimball
If it's terminating before you even run a job, then you're in luck -- it's all still running on the local machine. Try running it in Eclipse and use the debugger to trace its execution. - Aaron On Wed, Mar 3, 2010 at 4:13 AM, Rakhi Khatwani rkhatw...@gmail.com wrote: Hi, I am running a

Re: Why is $JAVA_HOME/lib/tools.jar in the classpath?

2010-02-17 Thread Aaron Kimball
Thomas, What version of Hadoop are you building Debian packages for? If you're taking Cloudera's existing debs and modifying them, these include a backport of Sqoop (from Apache's trunk) which uses the rt tools.jar to compile auto-generated code at runtime. Later versions of Sqoop (including the

Re: Ubuntu Single Node Tutorial failure. No live or dead nodes.

2010-02-12 Thread Aaron Kimball
Sonal, Can I ask why you're sleeping between starting hdfs and mapreduce? I've never needed this in my own code. In general, Hadoop is pretty tolerant about starting daemons out of order. If you need to wait for HDFS to be ready and come out of safe mode before launching a job, that's another

Re: mapred.system.dir

2010-02-12 Thread Aaron Kimball
To expand on Eric's comment: dfs.data.dir is the local filesystem directory (or directories) that a particular datanode uses to store its slice of the HDFS data blocks. so dfs.data.dir might be /home/hadoop/data/ on some machine; a bunch of files with inscrutable names like

Re: Any alternative for MultipleOutputs class in hadoop 0.18.3

2010-02-11 Thread Aaron Kimball
There's an older mechanism called MultipleOutputFormat which may do what you need. - Aaron On Fri, Feb 5, 2010 at 10:13 AM, Udaya Lakshmi udaya...@gmail.com wrote: Hi, MultipleOutput class is not available in hadoop 0.18.3. Is there any alternative for this class? Please point me useful

Re: hadoop under cygwin issue

2010-01-31 Thread Aaron Kimball
Brian, it looks like you missed a step in the instructions. You'll need to format the hdfs filesystem instance before starting the NameNode server: You need to run: $ bin/hadoop namenode -format .. then you can do bin/start-dfs.sh Hope this helps, - Aaron On Sat, Jan 30, 2010 at 12:27 AM,

Re: DBOutputFormat Speed Issues

2010-01-31 Thread Aaron Kimball
Nick, I'm afraid that right now the only available OutputFormat for JDBC is that one. You'll note that DBOutputFormat doesn't really include much support for special-casing to MySQL or other targets. Your best bet is to probably copy the code from DBOutputFormat and DBConfiguration into some

Re: map side only behavior

2010-01-31 Thread Aaron Kimball
In a map-only job, map tasks will be connected directly to the OutputFormat. So calling output.collect() / context.write() in the mapper will emit data straight to files in HDFS without sorting the data. There is no sort buffer involved. If you want exactly one output file, follow Nick's advice.

Re: Multiple file output

2010-01-07 Thread Aaron Kimball
Note that org.apache.hadoop.mapreduce.lib.output.MultipleOutputs is scheduled for the next CDH 0.20 release -- ready soon. - Aaron 2010/1/6 Amareshwari Sri Ramadasu amar...@yahoo-inc.com No. It is part of branch 0.21 onwards. For 0.20*, people can use old api only, though JobConf is

Re: Re: Doubt in Hadoop

2009-11-27 Thread Aaron Kimball
When you set up the Job object, do you call job.setJarByClass(Map.class)? That will tell Hadoop which jar file to ship with the job and to use for classloading in your code. - Aaron On Thu, Nov 26, 2009 at 11:56 PM, aa...@buffalo.edu wrote: Hi, I am running the job from command line. The

Re: part-00000.deflate as output

2009-11-27 Thread Aaron Kimball
You are always free to run with compression disabled. But in many production situations, space or performance concerns dictate that all data sets are stored compressed, so I think Tim was assuming that you might be operating in such an environment -- in which case, you'd only need things to appear

Re: Hadoop Job Performance as MapReduce count increases?

2009-11-11 Thread Aaron Kimball
, and the subsequent table make relevant sense to you, and does it reflect the true nature of potential performance issues with very small MapReduce task count on an unreliable small cluster? thanks, Rob Stewart 2009/11/9 Aaron Kimball aa...@cloudera.com On Sat, Nov 7, 2009 at 11:01 AM, Rob

Re: error setting up hdfs?

2009-11-10 Thread Aaron Kimball
You don't need to specify a path. If you don't specify a path argument for ls, then it uses your home directory in HDFS (/user/yourusernamehere). When you first started HDFS, /user/hadoop didn't exist, so 'hadoop fs -ls' -- 'hadoop fs -ls /user/hadoop' -- directory not found. When you mkdir'd

Re: Hadoop Job Performance as MapReduce count increases?

2009-11-08 Thread Aaron Kimball
On Sat, Nov 7, 2009 at 11:01 AM, Rob Stewart robstewar...@googlemail.comwrote: Hi, briefly, I'm writing my dissertation on Distributed computation, and then in detail at the various interfaces atop of Hadoop, including Pig, Hive, JAQL etc... One thing I have noticed in early testing is that

Re: How to build and deploy Hadoop 0.21 ?

2009-11-08 Thread Aaron Kimball
On Thu, Nov 5, 2009 at 2:34 AM, Andrei Dragomir adrag...@adobe.com wrote: Hello everyone. We ran into a bunch of issues with building and deploying hadoop 0.21. It would be great to get some answers about how things should work, so we can try to fix them. 1. When checking out the

Re: Can I install two different version of hadoop in the same cluster ?

2009-10-30 Thread Aaron Kimball
Also hadoop.tmp.dir and mapred.local.dir in your xml configuration, and the environment variables HADOOP_LOG_DIR and HADOOP_PID_DIR in hadoop-env.sh. - Aaron On Thu, Oct 29, 2009 at 10:44 PM, Jeff Zhang zjf...@gmail.com wrote: Hi all, I have installed hadoop 0.18.3 on my own cluster with 5

Re: Which FileInputFormat to use for fixed length records?

2009-10-28 Thread Aaron Kimball
and a corresponding FixedLengthRecordReader. Would the Hadoop commons project have interest in these? Basically these are for reading inputs of textual record data, where each record is a fixed length, (no carriage returns or separators etc) thanks On Oct 20, 2009, at 11:00 PM, Aaron Kimball

Using LinuxTaskController -- Local Dir Permissions Error

2009-10-23 Thread Aaron Kimball
Hi all, I'm trying to get the LinuxTaskController working (on the svn trunk) on a pseudo-distributed cluster. It's being quite frustrating. I compiled common, hdfs, and mapred jars with 'ant jar' and copied everything together into the same directory structure. I then ran: $ cd

Re: Can I have multiple reducers?

2009-10-22 Thread Aaron Kimball
If you need another shuffle after your first reduce pass, then you need a second MapReduce job to run after the first one. Just use an IdentityMapper. This is a reasonably common situation. - Aaron On Thu, Oct 22, 2009 at 4:17 PM, Forhadoop rutu...@gmail.com wrote: Hello, In my application

Re: Which FileInputFormat to use for fixed length records?

2009-10-20 Thread Aaron Kimball
You'll need to write your own, I'm afraid. You should subclass FileInputFormat and go from there. You may want to look at TextInputFormat / LineRecordReader for an example of how an IF/RR gets put together, but there isn't an existing fixed-len record reader. - Aaron On Tue, Oct 20, 2009 at

Re: How can I run such a mapreduce program?

2009-10-17 Thread Aaron Kimball
If you're working with the Cloudera distribution, you can install CDH1 (0.18.3) and CDH2 (0.20.1) side-by-side on your development machine. They'll install to /usr/lib/hadoop-0.18 and /usr/lib/hadoop-0.20; use /usr/bin/hadoop-0.18 and /usr/bin/hadoop-0.20 to execute, etc. See

Re: Error in FileSystem.get()

2009-10-15 Thread Aaron Kimball
Bhupesh: If you use FileSystem.newInstance(), does that return the correct object type? This sidesteps CACHE. - A On Thu, Oct 15, 2009 at 3:07 PM, Bhupesh Bansal bban...@linkedin.comwrote: This code is not map/reduce code and run only on single machine and Also each node prints the right value

Re: Assign different task to the workers.

2009-10-06 Thread Aaron Kimball
The same mapper class will be used for all map tasks. Individual tasks could perhaps do something clever like recover their task ids and do something different based on task id modulo the number of distinct task types, but in general, the idea behind MapReduce is that all tasks should be applying

Re: FileSystem Caching in Hadoop

2009-10-06 Thread Aaron Kimball
Edward, Interesting concept. I imagine that implementing CachedInputFormat over something like memcached would make for the most straightforward implementation. You could store 64MB chunks in memcached and try to retrieve them from there, falling back to the filesystem on failure. One obvious

Re: Locality when placing Map tasks

2009-10-06 Thread Aaron Kimball
Map tasks are generated based on InputSplits. An InputSplit is a logical description of the work that a task should use. The array of InputSplit objects is created on the client by the InputFormat. org.apache.hadoop.mapreduce.InputSplit has an abstract method: /** * Get the list of nodes by

Re: how to handle large volume reduce input value in mapreduce program?

2009-10-05 Thread Aaron Kimball
HB, From a scalability perspective, you will always have a finite limit of RAM available. Without knowing much about HBase, I can't tell whether this is the only way you can accomplish your goal or not. But a basic maxim to follow is, Using O(N) space in your reducer is guaranteed to overflow for

Re: Is it OK to run with no secondary namenode?

2009-10-05 Thread Aaron Kimball
Quite possible. :\ - A On Thu, Oct 1, 2009 at 5:17 PM, Mayuran Yogarajah mayuran.yogara...@casalemedia.com wrote: Aaron Kimball wrote: If you want to run the 2NN on a different node than the NN, then you need to set dfs.http.address on the 2NN to point to the namenode's http server

Re: Is it OK to run with no secondary namenode?

2009-10-01 Thread Aaron Kimball
If you want to run the 2NN on a different node than the NN, then you need to set dfs.http.address on the 2NN to point to the namenode's http server address. See http://www.cloudera.com/blog/2009/02/10/multi-host-secondarynamenode-configuration/ - Aaron On Mon, Sep 28, 2009 at 2:17 PM, Todd

Re: RandomAccessFile with HDFS

2009-09-22 Thread Aaron Kimball
Or maybe more pessimistically, the second stable append implementation. It's not like HADOOP-1700 wasn't intended to work. It was just found not to after the fact. Hopefully this reimplementation will succeed. If you're running a cluster that contains mission-critical data that cannot tolerate

Re: SequenceFileAsBinaryOutputFormat for M/R

2009-09-22 Thread Aaron Kimball
In the 0.20 branch, the common best-practice is to use the old API and ignore deprecation warnings. When you get to 0.22, you'll need to convert all your code to use the new API. There may be a new-API equivalent in org.apache.hadoop.mapreduce.lib.output that you could use, if you convert your

Re: Prepare input data for Hadoop

2009-09-22 Thread Aaron Kimball
Use an external database (e.g., mysql) or some other transactional bookkeeping system to record the state of all your datasets (STAGING, UPLOADED, PROCESSED) - Aaron On Thu, Sep 17, 2009 at 7:17 PM, Huy Phan dac...@gmail.com wrote: Hi all, I have a question about strategy to prepare data

Re: copy data (distcp) from local cluster to the EC2 cluster

2009-09-10 Thread Aaron Kimball
That's 99% correct. If you want/need to run different versions of HDFS on the two different clusters, then you can't use hdfs:// protocol to access both of them in the same command. In this case, use hdfs://bla/ for the source fs and *hftp*://bla2/ for the dest fs. - Aaron On Tue, Sep 8, 2009 at

Re: Testing Hadoop job

2009-08-28 Thread Aaron Kimball
Hi Nikhil, MRUnit now supports the 0.20 API as of https://issues.apache.org/jira/browse/MAPREDUCE-800. There are no plans to involve partitioners in MRUnit; it is for mappers and reducers only, and not for full jobs involving input/output formats, partitioners, etc. Use the LocalJobRunner for

Re: Help.

2009-08-25 Thread Aaron Kimball
Are you trying to serve blocks from a shared directory e.g. NFS? The storageID for a node is recorded in a file named VERSION in ${dfs.data.dir}/current. If one node claims that the storage directory is already locked, and another node is reporting the first node's storageID, it makes me think

Re: Writing to a db with DBOutputFormat spits out IOException Error

2009-08-24 Thread Aaron Kimball
As a more general note -- any jars needed by your mappers and reducers either need to be in your job jar in the lib/ directory of the .jar file, or in $HADOOP_HOME/lib/ on all tasktracker nodes where mappers and reducers get run. - Aaron On Fri, Aug 21, 2009 at 10:47 AM, ishwar ramani

Re: Hadoop streaming: How is data distributed from mappers to reducers?

2009-08-24 Thread Aaron Kimball
Yes. It works just like Java-based MapReduce in that regard. - Aaron On Sun, Aug 23, 2009 at 5:09 AM, Nipun Saggar nipun.sag...@gmail.comwrote: Hi all, I have recently started using Hadoop streaming. From the documentation, I understand that by default, each line output from a mapper up to

Re: How to speed up the copy phrase?

2009-08-24 Thread Aaron Kimball
If you've got 20 nodes, then you want to have 20-ish reduce tasks. Maybe 40 if you want it to run in two waves. (Assuming 1 core/node. Multiply by N for N cores...) As it is, each node has 500-ish map tasks that it has to read from and for each of these, it needs to generate 500 separate reduce

Re: localhost:9000 and ip:9000 is not the same ?

2009-08-24 Thread Aaron Kimball
Jeff, Hadoop (HDFS in particular) is overly strict about machine names. The filesystem's id is based on the DNS name used to access it. This needs to be consistent across all nodes and all configurations in your cluster. You should always use the fully-qualified domain name of the namenode in

Re: Using Hadoop with executables and binary data

2009-08-20 Thread Aaron Kimball
Look into typed bytes: http://dumbotics.com/2009/02/24/hadoop-1722-and-typed-bytes/ On Thu, Aug 20, 2009 at 10:29 AM, Jaliya Ekanayake jnekanay...@gmail.comwrote: Hi Stefan, I am sorry, for the late reply. Somehow the response email has slipped my eyes. Could you explain a bit on how to

Re: NN memory consumption on 0.20/0.21 with compressed pointers/

2009-08-20 Thread Aaron Kimball
Compressed OOPs are available now in 1.6.0u14: https://jdk6.dev.java.net/6uNea.html - Aaron On Thu, Aug 20, 2009 at 10:51 AM, Raghu Angadi rang...@yahoo-inc.comwrote: Suresh had made an spreadsheet for memory consumption.. will check. A large portion of NN memory is taken by references. I

Re: Location of the source code for the fair scheduler

2009-08-19 Thread Aaron Kimball
Hi Mithila, In the Mapreduce svn tree, it's under src/contrib/fairscheduler/ - Aaron On Wed, Aug 19, 2009 at 2:48 PM, Mithila Nagendra mnage...@asu.edu wrote: Hello I was wondering how I could locate the source code files for the fair scheduler. Thanks Mithila

Re: How does hadoop deal with hadoop-site.xml?

2009-08-19 Thread Aaron Kimball
Hi Inifok, This is a confusing aspect of Hadoop, I'm afraid. Settings are divided into two categories: per-job and per-node. Unfortunately, which are which, isn't documented. Some settings are applied to the node that is being used. So for example, if you set fs.default.name on a node to be

Re: What will we encounter if we add a lot of nodes into the current cluster?

2009-08-12 Thread Aaron Kimball
Also, if you haven't yet configured rack awareness, now's a good time to start :) - Aaron On Tue, Aug 11, 2009 at 11:27 PM, Ted Dunning ted.dunn...@gmail.com wrote: If you add these nodes, data will be put on them as you add data to the cluster. Soon after adding the nodes you should

Re: ~ Replacement for MapReduceBase ~

2009-08-10 Thread Aaron Kimball
Naga, That's right. In the old API, Mapper and Reducer were just interfaces and didn't provide default implementations of their code. Thus MapReduceBase. Now Mapper and Reducer are classes to extend, so no MapReduceBase is needed. - Aaron On Fri, Aug 7, 2009 at 8:26 AM, Naga Vijayapuram

Re: newbie question

2009-08-10 Thread Aaron Kimball
You can set it on a per-file basis if you'd like the control. The data structures associated with files allow these to be individually controlled. But there's also a create() call that only accepts the Path to open as an argument. This uses the configuration file defaults. This use case is

Re: Help in running hadoop from eclipse

2009-08-06 Thread Aaron Kimball
May I ask why you're trying to run the NameNode in Eclipse? This is likely going to cause you lots of classpath headaches. I think your current problem is that it can't find its config files, so it's not able to read in the strings for what addresses it should listen on. If you want to see what's

Re: Maps running - how to increase?

2009-08-06 Thread Aaron Kimball
Is that setting in the hadoop-site.xml file on every node? Each tasktracker reads in that file once and sets its max map tasks from that. There's no way to control this setting on a per-job basis or from the client (submitting) system. If you've changed hadoop-site.xml after starting the

Re: Help in running hadoop from eclipse

2009-08-06 Thread Aaron Kimball
May I ask why you're trying to run the NameNode in Eclipse? This is likely going to cause you lots of classpath headaches. I think your current problem is that it can't find its config files, so it's not able to read in the strings for what addresses it should listen on. If you want to see what's

Re: Maps running - how to increase?

2009-08-06 Thread Aaron Kimball
I don't know that your load-in speed is going to dramatically increase. There's a number of parameters that adjust aspects of MapReduce, but HDFS more or less works out of the box. You should run some monitoring on your nodes (ganglia, nagios) or check out what they're doing with top, iotop and

Re: how to get out of Safe Mode?

2009-08-05 Thread Aaron Kimball
For future reference, $ bin/hadoop dfsadmin safemode -leave will also just cause HDFS to exit safemode forcibly. - Aaron On Wed, Aug 5, 2009 at 1:04 AM, Amandeep Khurana ama...@gmail.com wrote: Two alternatives: 1. Do bin/hadoop namenode -format. That'll format the metadata and you can

Re: how to dump data from a mysql cluster to hdfs?

2009-08-05 Thread Aaron Kimball
mysqldump to local files on all 50 nodes, scp them to datanodes, and then bin/hadoop fs -put? - Aaron On Mon, Aug 3, 2009 at 8:15 PM, Min Zhou coderp...@gmail.com wrote: hi all, We need to dump data from a mysql cluster with about 50 nodes to a hdfs file. Considered about the issues on

Re: namenode -upgrade problem

2009-08-05 Thread Aaron Kimball
Are you sure you stopped all the daemons? Use 'sudo jps' to make sure :) - Aaron On Mon, Aug 3, 2009 at 7:26 PM, bharath vissapragada bharathvissapragada1...@gmail.com wrote: Todd thanks for replying .. I stopped the cluster and issued the command bin/hadoop namenode -upgrade and iam

Re: namenode -upgrade problem

2009-08-05 Thread Aaron Kimball
...@gmail.com wrote: yes .. I have stopped all the daemons ... when i use jps ...i get only ... pid Jps Actually .. i upgraded the version from 18.2 to 19.x on the same path of hdfs .. is it a problem? On Wed, Aug 5, 2009 at 11:02 PM, Aaron Kimball aa...@cloudera.com wrote: Are you sure you

Re: Remote access to cluster using user as hadoop

2009-07-23 Thread Aaron Kimball
The current best practice is to firewall off your cluster, configure a SOCKS proxy/gateway, and only allow traffic to the cluster from the gateway. Being able to SSH into the gateway provides authentication. See http://www.cloudera.com/blog/2008/12/03/securing-a-hadoop-cluster-through-a-gateway/

Re: Testing Mappers in Hadoop 0.20.0

2009-07-23 Thread Aaron Kimball
I hacked around this in MRUnit last night. MRUnit now has support for the new API -- See MAPREDUCE-800. You can, in fact, subclass Mapper.Context and Reducer.Context, since they don't actually share any state with the outer class Mapper/Reducer implementation, just the type signatures. But doing

Re: Question on setting a new user variable with JobConf

2009-07-21 Thread Aaron Kimball
And regarding your desire to set things on the command line: If your program implements Tool and is launched via ToolRunner, you can specify -D myparam=myvalue on the command line and it'll automatically put that binding in the JobConf created for the tool, retrieved via getConf(). - Aaron On

Re: Hadoop logging

2009-07-21 Thread Aaron Kimball
Hm. What version of Hadoop are you running? Have you modified the log4j.properties file in other ways? The logfiles generated by Hadoop should, by default, switch to a new file every day, appending the previous day's date to the closed log file (e.g.,

Re: datanode auto down

2009-07-21 Thread Aaron Kimball
* *Type* *State* /hadoop/name IMAGE_AND_EDITS Active 2009/7/21 Aaron Kimball aa...@cloudera.com A VersionMismatch occurs because you're using different builds of Hadoop on your different nodes. All DataNodes and the NameNode must be running the exact same compilation of Hadoop (It's very

Re: Loading Data into HDFS from non-cluster node

2009-07-16 Thread Aaron Kimball
Hi Asif, Just install the Hadoop package onto the external node to use it as a client. On that node, you set your fs.default.name parameter to point to the cluster, but you don't start any daemons locally, nor do you add that node to the slaves file. Then just do hadoop fs -put localfile

Re: Error reading task output

2009-07-13 Thread Aaron Kimball
Hmm. DEBUG entries are just debug-level detail. The ERROR and WARN level entries are the problematic ones. What's your hadoop-site.xml file look like? If you're storing data underneath ${hadoop.tmp.dir} and that's set to /tmp/${user.name} (as is the default), then it's possible that a tmpwatch or

Re: hadoop cluster startup error

2009-07-13 Thread Aaron Kimball
I'm not convinced anything is wrong with the TaskTracker. Can you run jobs? Does the pi example work? If not, what error does it give? If you're trying to configure your SecondaryNameNode on a different host than your NameNode, you'll need to do some configuration tweaking. I wrote a blog post

Re: Looking for counterpart of Configure Method

2009-07-13 Thread Aaron Kimball
void close(). - Aaron On Mon, Jul 13, 2009 at 11:31 AM, akhil1988 akhilan...@gmail.com wrote: Hi All, Just like method configure in Mapper interface, I am looking for its counterpart that will perform the closing operation for a Map task. For example, in method configure I start an

Re: Hadoop: Reduce exceeding 100% - a bug?

2009-07-09 Thread Aaron Kimball
Reduce tasks which require more than twenty minutes are not a problem. But you must emit some data periodically to inform the rest of the system that each reducer is still alive. Emitting a (k, v) output pair to the collector will reset the timer. Similarly, calling Reporter.incrCounter() will

Re: Custom input help/debug help

2009-07-09 Thread Aaron Kimball
Hi Matthew, You can set the heap size for child jobs by calling conf.set(mapred.child.java.opts, -Xmx1024m) to get a gig of heap space. That should fix the OOM issue in IsolationRunner. You can also change the heap size used in Eclipse; if you go to Debug Configurations, create a new

Re: Forcing Many Map Nodes

2009-07-08 Thread Aaron Kimball
If you look into FileInputFormat, you'll see that there's a call to FileSystem.getFileBlockLocations() (line 222) which finds the addresses of the nodes holding the blocks to be mapped. Each FileSplit generated in that same getSplits() method contains the list of locations where this split should

Re: HDFS is not loading evenly across all nodes.

2009-06-18 Thread Aaron Kimball
Did you run the dfs put commands from the master node? If you're inserting into HDFS from a machine running a DataNode, the local datanode will always be chosen as one of the three replica targets. For more balanced loading, you should use an off-cluster machine as the point of origin. If you

  1   2   3   >