how to preserve original line order?

2009-03-12 Thread Roldano Cattoni
The task should be simple, I want to put in uppercase all the words of a (large) file. I tried the following: - streaming mode - the mapper is a perl script that put each line in uppercase (number of mappers > 1) - no reducer (number of reducers set to zero) It works fine except for line or

Re: Batch processing with Hadoop -- does HDFS scale for parallel reads?

2009-03-12 Thread Sriram Rao
Hey TCK, We operate a large cluster in which we run both HDFS/KFS in the same cluster and on the same nodes. We run two instances of KFS and one instance of HDFS in the cluster: - Our logs are in KFS and we have KFS setup in WORM mode (a mode in which deletions/renames on files/dirs are permitte

Re: Creating Lucene index in Hadoop

2009-03-12 Thread 王红宝
you can see the nutch code. 2009/3/13 Mark Kerzner > Hi, > > How do I allow multiple nodes to write to the same index file in HDFS? > > Thank you, > Mark >

Re: tuning performance

2009-03-12 Thread Allen Wittenauer
On 3/12/09 7:13 PM, "Vadim Zaliva" wrote: > The machines have 4 disk each, stripped. > However I do not see disks being a bottleneck. When you stripe you automatically make every disk in the system have the same speed as the slowest disk. In our experiences, systems are more likely to ha

Creating Lucene index in Hadoop

2009-03-12 Thread Mark Kerzner
Hi, How do I allow multiple nodes to write to the same index file in HDFS? Thank you, Mark

Child Nodes processing jobs?

2009-03-12 Thread Richa Khandelwal
Hi, I am running a cluster of map/reduce jobs. How do I confirm that slaves are actually executing the map/reduce job spawned by the JobTracker at the master. All the slaves are running the datanodes and tasktracker fine. Thanks, Richa Khandelwal University Of California, Santa Cruz. Ph:425-241-

Re: Reducers spawned when mapred.reduce.tasks=0

2009-03-12 Thread Amareshwari Sriramadasu
Are you seeing reducers getting spawned from web ui? then, it is a bug. If not, there won't be reducers spawned, it could be job-setup/ job-cleanup task that is running on a reduce slot. See HADOOP-3150 and HADOOP-4261. -Amareshwari Chris K Wensel wrote: May have found the answer, waiting on

Re: tuning performance

2009-03-12 Thread jason hadoop
For a simple test, set the replication on your entire cluster to 6 hadoop dfs -setRep -R -w 6 / This will triple your disk usage and probably take a while, but then you are guaranteed that all data is local. You can also get a rough idea from the Job Counters, 'Data-local map tasks' total field

Re: tuning performance

2009-03-12 Thread Vadim Zaliva
The machines have 4 disk each, stripped. However I do not see disks being a bottleneck. Monitoring system activity shows that CPU is utilized 2-70%, disk usage is moderate, while network activity seems to be quite high. In this particular cluster we have 6 machines and replication factor is 2. I wa

Hadoop Streaming throw an exception with wget as the mapper

2009-03-12 Thread Nick Cen
Hi All, I am trying to use the hadoop straeming with "wget" to simulate a distributed downloader. The command line i use is ./bin/hadoop jar -D mapred.reduce.tasks=0 contrib/streaming/hadoop-0.19.0-streaming.jar -input urli -output urlo -mapper /usr/bin/wget -outputformat org.apache.hadoop.mapred

Re: How to let key sorted in the final outputfile

2009-03-12 Thread Edward J. Yoon
For your information - http://wiki.apache.org/hama/MatMult On Wed, Nov 12, 2008 at 2:05 AM, He Chen wrote: > hi everyone > > I use hadoop to do matrix multiplication, I let the key to store the row > information, and let the value be the total row like this: > > 0 (this is the key)              (

Re: Batch processing with Hadoop -- does HDFS scale for parallel reads?

2009-03-12 Thread Raghu Angadi
TCK wrote: How well does the read throughput from HDFS scale with the number of data nodes ? For example, if I had a large file (say 10GB) on a 10 data node cluster, would the time taken to read this whole file in parallel (ie, with multiple reader client processes requesting different parts of

Re: Reducers spawned when mapred.reduce.tasks=0

2009-03-12 Thread Chris K Wensel
May have found the answer, waiting on confirmation from users. Turns out 0.19.0 and .1 instantiate the reducer class when the task is actually intended for job/task cleanup. branch-0.19 looks like it resolves this issue by not instantiating the reducer class in this case. I've got a work

Hadoop User Group Meeting (Bay Area) 3/18

2009-03-12 Thread Ajay Anand
The next Bay Area Hadoop User Group meeting is scheduled for Wednesday, March 18th at Yahoo! 2811 Mission College Blvd, Santa Clara, Building 2, Training Rooms 5 & 6 from 6:00-7:30 pm. Agenda: "Performance Enhancement Techniques with Hadoop - a Case Study" - Milind Bhandarkar "RPMs for Hadoop D

Re: Building Release 0.19.1

2009-03-12 Thread Tsz Wo (Nicholas), Sze
Hi Aviad, You are right. The eclipse plugin cannot be compiled in in windows. See also HADOOP-4310, https://issues.apache.org/jira/browse/HADOOP-4310 Nicholas Sze - Original Message > From: Aviad sela > To: Hadoop Users Support > Sent: Thursday, March 12, 2009 1:00:12 PM > Subje

Re: tuning performance

2009-03-12 Thread Aaron Kimball
Xeon vs. Opteron is likely not going to be a major factor. More important than this is the number of disks you have per machine. Task performance is proportional to both the number of CPUs and the number of disks. You are probably using way too many tasks. Adding more tasks/node isn't necessarily

Building Release 0.19.1

2009-03-12 Thread Aviad sela
Building the eclipse project in windows XP, using Eclipse 3.4 results with the following error. It seems that some of the jars to build the projects are missing * compile*: [*echo*] contrib: eclipse-plugin [*javac*] Compiling 45 source files to D:\Work\AviadWork\workspace\cur\W_ECLIPSE\E34_Hadoop_

Re: about block size

2009-03-12 Thread Doug Cutting
One factor is that block size should minimize the impact of disk seeks. For example, if a disk seeks in 10ms and transfers at 100MB/s, then a good block size will be substantially larger than 1MB. With 100MB blocks, seeks would only slow things by 1%. Another factor is that, unless files are

Reducers spawned when mapred.reduce.tasks=0

2009-03-12 Thread Chris K Wensel
Hey all Have some users reporting intermittent spawning of Reducers when the job.xml shows mapred.reduce.tasks=0 in 0.19.0 and .1. This is also confirmed when jobConf is queried in the (supposedly ignored) Reducer implementation. In general this issue would likely go unnoticed since the d

How to limit concurrent task numbers of a job.

2009-03-12 Thread Zhou, Yunqing
Here I have a job , it contains 2000 map tasks and each map need 1 hour or so (map cannot be splited because its input is a compressed archive.) How can I set this job's max concurrent task numbers (map and reduce) to leave resources for other urgent jobs? Thanks.

Re: Why is large number of [(heavy) keys , (light) value] faster than (light)key , (heavy) value

2009-03-12 Thread Richa Khandelwal
I am running the same test and job that completes in 10 mins for (hk,lv) case takes is still running after 30mins have passed for (sk,hv) case. Would be interesting to pinpoint the reason behind it. On Wed, Mar 11, 2009 at 1:27 PM, Gyanit wrote: > > Here are exact numbers: > # of (k,v) pairs = 1

Re: using virtual slave machines

2009-03-12 Thread Steve Loughran
Karthikeyan V wrote: There is no specific procedure for configuring virtual machine slaves. make sure the following thing are done. I've used these as the beginning of a page on this http://wiki.apache.org/hadoop/VirtualCluster

Re: Extending ClusterMapReduceTestCase

2009-03-12 Thread Steve Loughran
jason hadoop wrote: I am having trouble reproducing this one. It happened in a very specific environment that pulled in an alternate sax parser. The bottom line is that jetty expects a parser with particular capabilities and if it doesn't get one, odd things happen. In a day or so I will have h

Re: Persistent HDFS On EC2

2009-03-12 Thread Steve Loughran
Kris Jirapinyo wrote: Why would you lose the locality of storage-per-machine if one EBS volume is mounted to each machine instance? When that machine goes down, you can just restart the instance and re-mount the exact same volume. I've tried this idea before successfully on a 10 node cluster on

Re: HADOOP Oracle connection workaround

2009-03-12 Thread Mridul Muralidharan
Would be better to externalize this through either a template - or at the least, message bundles. - Mridul evana wrote: Out of the box implementation hadoop has some issues in connecting to oracle. Loos like DBInputFomat is built keeping mysql/hsqldb in mind. You need to modify the out of th

How to skip bad records in .19.1

2009-03-12 Thread 柳松
Dear all: I have set the value "SkipBadRecords.setMapperMaxSkipRecords(conf, 1)", and also the "SkipBadRecords.setAttemptsToStartSkipping(conf, 2)". However, after 3 failed attempts, it gave me this exception message: java.lang.NullPointerException at org.apache.hadoop.io.seriali

HADOOP Oracle connection workaround

2009-03-12 Thread evana
Out of the box implementation hadoop has some issues in connecting to oracle. Loos like DBInputFomat is built keeping mysql/hsqldb in mind. You need to modify the out of the box implementation of getSelectQuery method in DBInputFomat. WORK AROUND here is the code snippet...(remember this works on