Re: How to install the MRunit

2013-04-18 Thread Harsh J
Hi Geelong, +CCing the mrunit's own user lists. You can find examples of how to use MRUnit with your MR project here: https://cwiki.apache.org/confluence/display/MRUNIT/MRUnit+Tutorial On Thu, Apr 18, 2013 at 11:17 AM, 姚吉龙 geelong...@gmail.com wrote: Hi everyone How can I install the MRunit

Re: Kerberos documentation

2013-04-18 Thread Christopher Vasanth John
Hi, I have a Kerberos KDC running and also have apache Hadoop 1.0.4 running on a cluster. Is there some kind of documentation I can use to link the two? Basically, I'm trying to make my hadoop cluster secure. Thanks, Chris On Wed, Apr 17, 2013 at 3:30 PM, Aaron T. Myers

Run multiple HDFS instances

2013-04-18 Thread Lixiang Ao
Hi all, Can I run mutiple HDFS instances, that is, n seperate namenodes and n datanodes, on a single machine? I've modified core-site.xml and hdfs-site.xml to avoid port and file conflicting between HDFSes, but when I started the second HDFS, I got the errors: Starting namenodes on [localhost]

Re: Run multiple HDFS instances

2013-04-18 Thread Harsh J
Yes you can but if you want the scripts to work, you should have them use a different PID directory (I think its called HADOOP_PID_DIR) every time you invoke them. I instead prefer to start the daemons up via their direct command such as hdfs namenode and so and move them to the background, with

Cartesian product in hadoop

2013-04-18 Thread zheyi rong
Dear all, I am writing to kindly ask for ideas of doing cartesian product in hadoop. Specifically, now I have two datasets, each of which contains 20million lines. I want to do cartesian product on these two datasets, comparing lines pairwisely. The output of each comparison can be mostly

Oracle big data appliance

2013-04-18 Thread oualid ait wafli
Hi Someone had already work with the Oracle big data Appliance ? thanks

Re: Cartesian product in hadoop

2013-04-18 Thread Jagat Singh
Hi, Can you have a look at http://pig.apache.org/docs/r0.11.1/basic.html#cross Thanks On Thu, Apr 18, 2013 at 7:47 PM, zheyi rong zheyi.r...@gmail.com wrote: Dear all, I am writing to kindly ask for ideas of doing cartesian product in hadoop. Specifically, now I have two datasets, each

Re: Cartesian product in hadoop

2013-04-18 Thread Ajay Srivastava
Hi Rong, You can use following simple method. Lets say dataset1 has m records and when you emit these records from mapper, keys are K1,K2 ….., Km for each respective record. Also add an identifier to identify dataset from where records is being emitted. So if R1 is a record in dataset1, the

Configuration clone constructor not cloning classloader

2013-04-18 Thread Amit Sela
Hi all, I was wondering if there is a good reason why public Configuration(Configuration other) constructor in Hadoop 1.0.4 doesn't clone the classloader in other to the new Configration ? Is this a bug ? I'm asking because I'm trying to run a Hadoop client in OSGI environment and I need to

Re: Cartesian product in hadoop

2013-04-18 Thread zheyi rong
Hi Ajay Srivastava, Thank your for your reply. Could you please explain a little bit more on Write a grouping comparator which group records on first part of key i.e. Ki. ? I guess it is a crucial part, which could filter some pairs before passing them to the reducer. Regards, Zheyi Rong On

Re: Cartesian product in hadoop

2013-04-18 Thread zheyi rong
Thank you, now I get your point. But I wonder that this approach would be slower than implementing a custom InputFormat which, each time, provides a pair of lines to mappers; then doing the product in mappers? (in Since your approach would need (m + n + m*n) I/O in mapper side, and (2*m*n) IO

setting hdfs balancer bandwidth doesn't work

2013-04-18 Thread zhoushuaifeng
Hi, I set the hdfs balance bandwidth from 1048576 to 104857600, but it doesn't work, what's wrong? Does anyone encounter the same problem? Thanks a lot. property namedfs.balance.bandwidthPerSec/name value104857600/value description Specifies the maximum amount of bandwidth that

Re: Cartesian product in hadoop

2013-04-18 Thread Ajay Srivastava
The approach which I proposed will have m+n i/o for reading datasets not the (m + n + m*n) and but further i/o due to spills and reading mapper output by reducer will be more as number of tuples coming out of mapper are ( m + m * n). Regards, Ajay Srivastava On 18-Apr-2013, at 5:40 PM,

Run multiple HDFS instances

2013-04-18 Thread Lixiang Ao
I modified sbin/hadoop-daemon.sh, where HADOOP_PID_DIR is set. It works! Everything looks fine now. Seems direct command hdfs namenode gives a better sense of control :) Thanks a lot. 在 2013年4月18日星期四,Harsh J 写道: Yes you can but if you want the scripts to work, you should have them use a

Re: will an application with two maps but no reduce be suitable for hadoop?

2013-04-18 Thread Roman Shaposhnik
On Thu, Apr 18, 2013 at 4:49 AM, Hadoop Explorer hadoopexplo...@outlook.com wrote: I have an application that evaluate a graph using this algorithm: - use a parallel for loop to evaluate all nodes in a graph (to evaluate a node, an image is read, and then result of this node is calculated) -

Re: Run multiple HDFS instances

2013-04-18 Thread Chris Embree
Glad you got this working... can you explain your use case a little? I'm trying to understand why you might want to do that. On Thu, Apr 18, 2013 at 11:29 AM, Lixiang Ao aolixi...@gmail.com wrote: I modified sbin/hadoop-daemon.sh, where HADOOP_PID_DIR is set. It works! Everything looks

why multiple checkpoint nodes?

2013-04-18 Thread Thanh Do
Hi all, The document says Multiple checkpoint nodes may be specified in the cluster configuration file. Can some one clarify me that why we really need to run multiple checkpoint nodes anyway? Is it possible that while checkpoint node A is doing checkpoint, and check point node B kicks in and

Re: setting hdfs balancer bandwidth doesn't work

2013-04-18 Thread Thanh Do
What do you mean by doesn't work? On Thu, Apr 18, 2013 at 10:01 AM, zhoushuaifeng zhoushuaif...@gmail.comwrote: ** Hi, I set the hdfs balance bandwidth from 1048576 to 104857600, but it doesn't work, what's wrong? Does anyone encounter the same problem? Thanks a lot. property

Re: Run multiple HDFS instances

2013-04-18 Thread Lixiang Ao
Actually I'm trying to do something like combining multiple namenodes so that they present themselves to clients as a single namespace, implementing basic namenode functionalities. 在 2013年4月18日星期四,Chris Embree 写道: Glad you got this working... can you explain your use case a little? I'm

Re: why multiple checkpoint nodes?

2013-04-18 Thread shashwat shriparv
more checkpoint nodes means more backup of the metadata :) *Thanks Regards* ∞ Shashwat Shriparv On Thu, Apr 18, 2013 at 9:35 PM, Thanh Do than...@cs.wisc.edu wrote: Hi all, The document says Multiple checkpoint nodes may be specified in the cluster configuration file. Can some one

Re: why multiple checkpoint nodes?

2013-04-18 Thread Thanh Do
so reliability (to prevent metadata loss) is the main motivation for multiple checkpoint nodes? Does anybody use multiple checkpoint nodes in real life? Thanks On Thu, Apr 18, 2013 at 12:07 PM, shashwat shriparv dwivedishash...@gmail.com wrote: more checkpoint nodes means more backup of the

Re: Cartesian product in hadoop

2013-04-18 Thread Ted Dunning
It is rarely practical to do exhaustive comparisons on datasets of this size. The method used is to heuristically prune the cartesian product set and only examine pairs that have a high likelihood of being near. This can be done in many ways. Your suggestion of doing a map-side join is a

How to process only input files containing 100% valid rows

2013-04-18 Thread Matthias Scherer
Hi all, In my mapreduce job, I would like to process only whole input files containing only valid rows. If one map task processing an input split of a file detects an invalid row, the whole file should be marked as invalid and not processed at all. This input file will then be cleansed by

Re: why multiple checkpoint nodes?

2013-04-18 Thread Bertrand Dechoux
It would be important to point the document (which I believe is http://hadoop.apache.org/docs/stable/hdfs_user_guide.html) and the version of Hadoop you are interested in. At one time, the documentation was misleading. The 1.x version didn't have checkpoint/backup nodes only the secondary

Re: Problem: org.apache.hadoop.mapred.ReduceTask: java.net.SocketTimeoutException: connect timed out

2013-04-18 Thread Som Satpathy
Never mind, got it fixed. Thanks, Som On Tue, Apr 16, 2013 at 6:18 PM, Som Satpathy somsatpa...@gmail.com wrote: Hi All, I have just set up a CDH cluster on EC2 using cloudera manager 4.5. I have been trying to run a couple of mapreduce jobs as part of an oozie workflow but have been

Re: why multiple checkpoint nodes?

2013-04-18 Thread Bertrand Dechoux
For more information : https://issues.apache.org/jira/browse/HADOOP-7297 It has been corrected but the stable documentation is still the 1.0.4 (previous to correction). See * http://hadoop.apache.org/docs/r1.0.4/hdfs_user_guide.html * http://hadoop.apache.org/docs/r1.1.1/hdfs_user_guide.html *

Re: why multiple checkpoint nodes?

2013-04-18 Thread Mohammad Tariq
Hello Thanh, Just to keep you updated, checkpoint node might get depricated. So, it's always better to use secondary namenode. More on this could be found here : https://issues.apache.org/jira/browse/HDFS-2397 https://issues.apache.org/jira/browse/HDFS-4114 Warm Regards, Tariq

Re: How to process only input files containing 100% valid rows

2013-04-18 Thread Steve Lewis
With files that small it is much better to write a custom input format which checks the entire file and only passes records from good files. If you need Hadoop you are probably processing a large number of these files and an input format could easily read the entire file and handle it if it as as

Re: why multiple checkpoint nodes?

2013-04-18 Thread Thanh Do
Thanks guys for updating! Yeah, I read the thread that Checkpoint/BackupNode may be get deprecated. SNN is a way to go then. I just wonder if we use multiple CheckpointNodes, we might run into the situation where while a checkpoint is on-going, but the first CheckpointNode is slow, then the

Re: How to configure mapreduce archive size?

2013-04-18 Thread Hemanth Yamijala
Well, since the DistributedCache is used by the tasktracker, you need to update the log4j configuration file used by the tasktracker daemon. And you need to get the tasktracker log file - from the machine where you see the distributed cache problem. On Fri, Apr 19, 2013 at 6:27 AM,

Re: Run multiple HDFS instances

2013-04-18 Thread Hemanth Yamijala
Are you trying to implement something like namespace federation, that's a part of Hadoop 2.0 - http://hadoop.apache.org/docs/r2.0.3-alpha/hadoop-project-dist/hadoop-hdfs/Federation.html On Thu, Apr 18, 2013 at 10:02 PM, Lixiang Ao aolixi...@gmail.com wrote: Actually I'm trying to do something

Re: Run multiple HDFS instances

2013-04-18 Thread Lixiang Ao
Not really, fereration provides seperate namespaces, but I want it looks like one namespace. My basic idea is to maintain a map from files to namenodes, it receive RPC calls from client and forward them to specific namenode that in charge of the file. It's challenging for me but I'll figure out

Best way to collect Hadoop logs across cluster

2013-04-18 Thread Mark Kerzner
Hi, my clusters are on EC2, and they disappear after the cluster's instances are destroyed. What is the best practice to collect the logs for later storage? EC2 does exactly that with their EMR, how do they do it? Thank you, Mark

Re: Best way to collect Hadoop logs across cluster

2013-04-18 Thread Marcos Luis Ortiz Valmaseda
When you destroy an EC2 instance, the correct behavior is to erase all data. Why don't you create a service to collect the logs directly to a S3 bucket in real-time or in a batch of 5 mins? 2013/4/18 Mark Kerzner mark.kerz...@shmsoft.com Hi, my clusters are on EC2, and they disappear after

Re: Best way to collect Hadoop logs across cluster

2013-04-18 Thread Mark Kerzner
So you are saying, the problem is very simple. Just before you destroy the cluster, simply collect the logs to S3. Anyway, I only need them after I have completed with a specific computation. So I don't need any special requirements. In regular permanent clusters, is there something that allows

Re: setting hdfs balancer bandwidth doesn't work

2013-04-18 Thread 周帅锋
I set bandwidthPerSec = 104857600, but when I add a new data node, and run hadoop balancer, the bandwidth is only 1MB/s, and the datanode log shows that: org.apache.hadoop.hdfs.server. datanode.DataNode: Balancing bandwith is 1048576 bytes/s My hadoop core version is 1.0.3 Thanks 2013/4/19

Re: Best way to collect Hadoop logs across cluster

2013-04-18 Thread Marcos Luis Ortiz Valmaseda
Actually the problem is not simple. Based on these problems, there are three companies working on them: - Loggly: http://loggly.com/ Loggly is a part of the Amazon Marketplace: https://aws.amazon.com/solution-providers/isv/loggly - Pageduty: https://papertrailapp.com/ How to do it: