Hi Geelong,
+CCing the mrunit's own user lists.
You can find examples of how to use MRUnit with your MR project here:
https://cwiki.apache.org/confluence/display/MRUNIT/MRUnit+Tutorial
On Thu, Apr 18, 2013 at 11:17 AM, 姚吉龙 geelong...@gmail.com wrote:
Hi everyone
How can I install the MRunit
Hi,
I have a Kerberos KDC running and also have apache Hadoop 1.0.4
running on a cluster. Is there some kind of documentation I can use to link
the two?
Basically, I'm trying to make my hadoop cluster secure.
Thanks,
Chris
On Wed, Apr 17, 2013 at 3:30 PM, Aaron T. Myers
Hi all,
Can I run mutiple HDFS instances, that is, n seperate namenodes and n
datanodes, on a single machine?
I've modified core-site.xml and hdfs-site.xml to avoid port and file
conflicting between HDFSes, but when I started the second HDFS, I got the
errors:
Starting namenodes on [localhost]
Yes you can but if you want the scripts to work, you should have them
use a different PID directory (I think its called HADOOP_PID_DIR)
every time you invoke them.
I instead prefer to start the daemons up via their direct command such
as hdfs namenode and so and move them to the background, with
Dear all,
I am writing to kindly ask for ideas of doing cartesian product in hadoop.
Specifically, now I have two datasets, each of which contains 20million
lines.
I want to do cartesian product on these two datasets, comparing lines
pairwisely.
The output of each comparison can be mostly
Hi
Someone had already work with the Oracle big data Appliance ?
thanks
Hi,
Can you have a look at
http://pig.apache.org/docs/r0.11.1/basic.html#cross
Thanks
On Thu, Apr 18, 2013 at 7:47 PM, zheyi rong zheyi.r...@gmail.com wrote:
Dear all,
I am writing to kindly ask for ideas of doing cartesian product in hadoop.
Specifically, now I have two datasets, each
Hi Rong,
You can use following simple method.
Lets say dataset1 has m records and when you emit these records from mapper,
keys are K1,K2 ….., Km for each respective record. Also add an identifier to
identify dataset from where records is being emitted.
So if R1 is a record in dataset1, the
Hi all,
I was wondering if there is a good reason why public
Configuration(Configuration other) constructor in Hadoop 1.0.4 doesn't
clone the classloader in other to the new Configration ?
Is this a bug ?
I'm asking because I'm trying to run a Hadoop client in OSGI environment
and I need to
Hi Ajay Srivastava,
Thank your for your reply.
Could you please explain a little bit more on Write a grouping comparator
which group records on first part of key i.e. Ki. ?
I guess it is a crucial part, which could filter some pairs before passing
them to the reducer.
Regards,
Zheyi Rong
On
Thank you, now I get your point.
But I wonder that this approach would be slower than
implementing a custom InputFormat which, each time, provides a pair of
lines to mappers; then doing the product in mappers? (in
Since your approach would need (m + n + m*n) I/O in mapper side, and
(2*m*n) IO
Hi,
I set the hdfs balance bandwidth from 1048576 to 104857600, but it doesn't
work, what's wrong?
Does anyone encounter the same problem?
Thanks a lot.
property
namedfs.balance.bandwidthPerSec/name
value104857600/value
description
Specifies the maximum amount of bandwidth that
The approach which I proposed will have m+n i/o for reading datasets not the (m
+ n + m*n) and but further i/o due to spills and reading mapper output by
reducer will be more as number of tuples coming out of mapper are ( m + m * n).
Regards,
Ajay Srivastava
On 18-Apr-2013, at 5:40 PM,
I modified sbin/hadoop-daemon.sh, where HADOOP_PID_DIR is set. It works!
Everything looks fine now.
Seems direct command hdfs namenode gives a better sense of control :)
Thanks a lot.
在 2013年4月18日星期四,Harsh J 写道:
Yes you can but if you want the scripts to work, you should have them
use a
On Thu, Apr 18, 2013 at 4:49 AM, Hadoop Explorer
hadoopexplo...@outlook.com wrote:
I have an application that evaluate a graph using this algorithm:
- use a parallel for loop to evaluate all nodes in a graph (to evaluate a
node, an image is read, and then result of this node is calculated)
-
Glad you got this working... can you explain your use case a little? I'm
trying to understand why you might want to do that.
On Thu, Apr 18, 2013 at 11:29 AM, Lixiang Ao aolixi...@gmail.com wrote:
I modified sbin/hadoop-daemon.sh, where HADOOP_PID_DIR is set. It works!
Everything looks
Hi all,
The document says Multiple checkpoint nodes may be specified in the
cluster configuration file.
Can some one clarify me that why we really need to run multiple checkpoint
nodes anyway? Is it possible that while checkpoint node A is doing
checkpoint, and check point node B kicks in and
What do you mean by doesn't work?
On Thu, Apr 18, 2013 at 10:01 AM, zhoushuaifeng zhoushuaif...@gmail.comwrote:
**
Hi,
I set the hdfs balance bandwidth from 1048576 to 104857600, but it doesn't
work, what's wrong?
Does anyone encounter the same problem?
Thanks a lot.
property
Actually I'm trying to do something like combining multiple namenodes so
that they present themselves to clients as a single namespace, implementing
basic namenode functionalities.
在 2013年4月18日星期四,Chris Embree 写道:
Glad you got this working... can you explain your use case a little? I'm
more checkpoint nodes means more backup of the metadata :)
*Thanks Regards*
∞
Shashwat Shriparv
On Thu, Apr 18, 2013 at 9:35 PM, Thanh Do than...@cs.wisc.edu wrote:
Hi all,
The document says Multiple checkpoint nodes may be specified in the
cluster configuration file.
Can some one
so reliability (to prevent metadata loss) is the main motivation for
multiple checkpoint nodes?
Does anybody use multiple checkpoint nodes in real life?
Thanks
On Thu, Apr 18, 2013 at 12:07 PM, shashwat shriparv
dwivedishash...@gmail.com wrote:
more checkpoint nodes means more backup of the
It is rarely practical to do exhaustive comparisons on datasets of this
size.
The method used is to heuristically prune the cartesian product set and
only examine pairs that have a high likelihood of being near.
This can be done in many ways. Your suggestion of doing a map-side join is
a
Hi all,
In my mapreduce job, I would like to process only whole input files containing
only valid rows. If one map task processing an input split of a file detects an
invalid row, the whole file should be marked as invalid and not processed at
all. This input file will then be cleansed by
It would be important to point the document (which I believe is
http://hadoop.apache.org/docs/stable/hdfs_user_guide.html) and the version
of Hadoop you are interested in. At one time, the documentation was
misleading. The 1.x version didn't have checkpoint/backup nodes only the
secondary
Never mind, got it fixed.
Thanks,
Som
On Tue, Apr 16, 2013 at 6:18 PM, Som Satpathy somsatpa...@gmail.com wrote:
Hi All,
I have just set up a CDH cluster on EC2 using cloudera manager 4.5. I have
been trying to run a couple of mapreduce jobs as part of an oozie workflow
but have been
For more information : https://issues.apache.org/jira/browse/HADOOP-7297
It has been corrected but the stable documentation is still the 1.0.4
(previous to correction).
See
* http://hadoop.apache.org/docs/r1.0.4/hdfs_user_guide.html
* http://hadoop.apache.org/docs/r1.1.1/hdfs_user_guide.html
*
Hello Thanh,
Just to keep you updated, checkpoint node might get depricated. So,
it's always better to use secondary namenode. More on this could be found
here :
https://issues.apache.org/jira/browse/HDFS-2397
https://issues.apache.org/jira/browse/HDFS-4114
Warm Regards,
Tariq
With files that small it is much better to write a custom input format
which checks the entire file and only passes records from good files. If
you need Hadoop you are probably processing a large number of these files
and an input format could easily read the entire file and handle it if it
as as
Thanks guys for updating!
Yeah, I read the thread that Checkpoint/BackupNode may be get deprecated.
SNN is a way to go then.
I just wonder if we use multiple CheckpointNodes, we might run into the
situation where while a checkpoint is on-going, but the first
CheckpointNode is slow, then the
Well, since the DistributedCache is used by the tasktracker, you need to
update the log4j configuration file used by the tasktracker daemon. And you
need to get the tasktracker log file - from the machine where you see the
distributed cache problem.
On Fri, Apr 19, 2013 at 6:27 AM,
Are you trying to implement something like namespace federation, that's a
part of Hadoop 2.0 -
http://hadoop.apache.org/docs/r2.0.3-alpha/hadoop-project-dist/hadoop-hdfs/Federation.html
On Thu, Apr 18, 2013 at 10:02 PM, Lixiang Ao aolixi...@gmail.com wrote:
Actually I'm trying to do something
Not really, fereration provides seperate namespaces, but I want it looks
like one namespace. My basic idea is to maintain a map from files to
namenodes, it receive RPC calls from client and forward them to specific
namenode that in charge of the file. It's challenging for me but
I'll figure out
Hi,
my clusters are on EC2, and they disappear after the cluster's instances
are destroyed. What is the best practice to collect the logs for later
storage?
EC2 does exactly that with their EMR, how do they do it?
Thank you,
Mark
When you destroy an EC2 instance, the correct behavior is to erase all
data.
Why don't you create a service to collect the logs directly to a S3 bucket
in real-time or in a batch of 5 mins?
2013/4/18 Mark Kerzner mark.kerz...@shmsoft.com
Hi,
my clusters are on EC2, and they disappear after
So you are saying, the problem is very simple. Just before you destroy the
cluster, simply collect the logs to S3. Anyway, I only need them after I
have completed with a specific computation. So I don't need any special
requirements.
In regular permanent clusters, is there something that allows
I set bandwidthPerSec = 104857600, but when I add a new data node, and run
hadoop balancer, the bandwidth is only 1MB/s, and the datanode log shows
that:
org.apache.hadoop.hdfs.server.
datanode.DataNode: Balancing bandwith is 1048576 bytes/s
My hadoop core version is 1.0.3
Thanks
2013/4/19
Actually the problem is not simple. Based on these problems, there are
three companies working on them:
- Loggly:
http://loggly.com/
Loggly is a part of the Amazon Marketplace:
https://aws.amazon.com/solution-providers/isv/loggly
- Pageduty:
https://papertrailapp.com/
How to do it:
37 matches
Mail list logo