Hi all,
I'd like to select random N records from a large amount of data using
hadoop, just wonder how can I archive this ? Currently my idea is that let
each mapper task select N / mapper_number records. Does anyone has such
experience ?
--
Best Regards
Jeff Zhang
I believe you meant,
SELECT * FROM LOG1 LEFT OUTER JOIN LOG2 ON LOG1.recordid = LOG2.recordid
WHERE LOG2.recordid is null. (this would produce set of records in LOG1 and
which are not present in LOG2).
In PIG, we have to add additional filter with is null condition.
~Rajesh.B
On Mon, Jun 27,
We use hadoop/hdfs to archive data. I archive a lot of file by creating one
large tar file and then placing to hdfs. Is it better to use hadoop archive
for this or is it essentially the same thing?
--
--- Get your facts first, then you can distort them as you please.--
Saumitra,
Two questions come to mind that could help you narrow down a solution:
1) How quickly do the downstream processes need the transformed data?
Reason: If you can delay the processing for a period of time, enough to
batch the data into a blob that is a multiple of your block
On 24/06/11 18:16, Niels Boldt wrote:
Hi,
I'm running nutch in pseudo cluster, eg all daemons are running on the same
server. I'm writing to the hadoop list, as it looks like a problem related
to hadoop
Some of my jobs partially fails and in the error log I get output like
2011-06-24
On Sun, 26 Jun 2011 17:34:34 -0700, Mark static.void@gmail.com
wrote:
Hello all,
We have a recommendation system that reads in similarity data via a
Spring context.xml as follows:
bean id=similarity
class=org.apache.mahout.cf.taste.impl.similarity.file.FileItemSimilarity
Hi,
I have posted the question to stackoverflow, where I have also
clearified my problem a bit.
If you have a solution, please respond there (if its not too much of a
hassle):
I did something similar. Basically I had a random sampling algorithm that I
called from the mapper. If it returned true I would collect the data, otherwise
I would discard it.
Bill
-Original Message-
From: ni...@basj.es [mailto:ni...@basj.es] On Behalf Of Niels Basjes
Sent: Monday,
If you are running default configurations then you are only getting 2 mappers
and 1 reducer per node. The rule of thumb I have gone on (and back up by the
definitive guide) is 2 processes per core so: tasktracker/datanode and 6 slots
left. How you break it up from there is your call but I would
Wait - Habermaas like in Critical Theory
-Original Message-
From: Habermaas, William [mailto:william.haberm...@fatwire.com]
Sent: Monday, June 27, 2011 2:55 PM
To: common-user@hadoop.apache.org
Subject: RE: How to select random n records using mapreduce ?
I did something similar.
If the incoming data is unique you can create a hash of the data and then do
a modulus of the hash to select a random set. So if you wanted 10% of the
data randomly:
hash % 10 == 0
Gives a random 10%
On 6/27/11 12:54 PM, Habermaas, William william.haberm...@fatwire.com
wrote:
I did
Hi Everyone:
I am quite new to hadoop here. I am attempting to set up Hadoop locally in
two machines, connected by LAN. Both of them pass the single-node test.
However, I failed in two-node cluster setup, as shown in the 2 cases below:
1) set one as dedicated namenode and the other as dedicated
Did you make sure to define the datanode/tasktracker in the slaves file in your
conf directory and push that to both machines? Also have you checked the logs
on either to see if there are any errors?
Matt
-Original Message-
From: Jingwei Lu [mailto:j...@ucsd.edu]
Sent: Monday, June
Hi,
I just manually modify the masters slaves files in the both machines.
I found something wrong in the log files, as shown below:
-- Master :
namenote.log:
2011-06-27 13:44:47,055 INFO org.mortbay.log: jetty-6.1.14
2011-06-27 13:44:47,394 INFO
http://www.mentby.com/tim-robertson/error-register-getprotocolversion.html
-Original Message-
From: Jingwei Lu [mailto:j...@ucsd.edu]
Sent: Monday, June 27, 2011 3:58 PM
To: common-user@hadoop.apache.org
Subject: Re: Why I cannot see live nodes in a LAN-based cluster setup?
Hi,
I
As a follow-up to what Jeff posted: go ahead and ignore the message you got on
the NN for now.
If you look at the address that the DN log shows it is 127.0.0.1 and the
ip:port it is trying to connect to for the NN is 127.0.0.1:54310 --- it is
trying to bind to itself as if it was still in
Matt,
Thanks for your help!
I think I get it now, but this part is a bit confusing:
*
*
*so: tasktracker/datanode and 6 slots left. How you break it up from there
is your call but I would suggest either 4 mappers / 2 reducers or 5 mappers
/ 1 reducer.*
*
*
If it's 2 processes per core, then it's:
Hi Matt and Jeff:
Thanks a lot for your instructions. I corrected the mistakes in conf files
of DN, and now the log on DN becomes:
2011-06-27 15:32:36,025 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 0 time(s).
2011-06-27
Yes, you can see a picture describing HAR files in this old blog post:
http://www.cloudera.com/blog/2009/02/the-small-files-problem/
-Joey
On Mon, Jun 27, 2011 at 4:36 PM, Rita rmorgan...@gmail.com wrote:
So, it does an index of the file?
On Mon, Jun 27, 2011 at 10:10 AM, Joey Echeverria
Ok,
So I tried putting the following config in the mapred-site.xml of all of my
nodes
configuration
property
namemapred.job.tracker/name
valuename-node:54311/value
/property
property
namemapred.map.tasks/name
value7/value
/property
property
At this point if that is the correct ip then I would see if you can actually
ssh from the DN to the NN to make sure it can actually connect to the other
box. If you can successfully connect through ssh then it's just a matter of
figuring out why that port is having issues (netstat is your
21 matches
Mail list logo