How to select random n records using mapreduce ?

2011-06-27 Thread Jeff Zhang
Hi all, I'd like to select random N records from a large amount of data using hadoop, just wonder how can I archive this ? Currently my idea is that let each mapper task select N / mapper_number records. Does anyone has such experience ? -- Best Regards Jeff Zhang

Re: Comparing two logs, finding missing records

2011-06-27 Thread Rajesh Balamohan
I believe you meant, SELECT * FROM LOG1 LEFT OUTER JOIN LOG2 ON LOG1.recordid = LOG2.recordid WHERE LOG2.recordid is null. (this would produce set of records in LOG1 and which are not present in LOG2). In PIG, we have to add additional filter with is null condition. ~Rajesh.B On Mon, Jun 27,

tar or hadoop archive

2011-06-27 Thread Rita
We use hadoop/hdfs to archive data. I archive a lot of file by creating one large tar file and then placing to hdfs. Is it better to use hadoop archive for this or is it essentially the same thing? -- --- Get your facts first, then you can distort them as you please.--

RE: Queue support from HDFS

2011-06-27 Thread GOEKE, MATTHEW (AG/1000)
Saumitra, Two questions come to mind that could help you narrow down a solution: 1) How quickly do the downstream processes need the transformed data? Reason: If you can delay the processing for a period of time, enough to batch the data into a blob that is a multiple of your block

Re: error in reduce task

2011-06-27 Thread Steve Loughran
On 24/06/11 18:16, Niels Boldt wrote: Hi, I'm running nutch in pseudo cluster, eg all daemons are running on the same server. I'm writing to the hadoop list, as it looks like a problem related to hadoop Some of my jobs partially fails and in the error log I get output like 2011-06-24

Re: Reading HDFS files via Spring

2011-06-27 Thread John Armstrong
On Sun, 26 Jun 2011 17:34:34 -0700, Mark static.void@gmail.com wrote: Hello all, We have a recommendation system that reads in similarity data via a Spring context.xml as follows: bean id=similarity class=org.apache.mahout.cf.taste.impl.similarity.file.FileItemSimilarity

Re: Computing overlap of two files with hadoop

2011-06-27 Thread Claus Stadler
Hi, I have posted the question to stackoverflow, where I have also clearified my problem a bit. If you have a solution, please respond there (if its not too much of a hassle):

RE: How to select random n records using mapreduce ?

2011-06-27 Thread Habermaas, William
I did something similar. Basically I had a random sampling algorithm that I called from the mapper. If it returned true I would collect the data, otherwise I would discard it. Bill -Original Message- From: ni...@basj.es [mailto:ni...@basj.es] On Behalf Of Niels Basjes Sent: Monday,

RE: Performance Tunning

2011-06-27 Thread GOEKE, MATTHEW (AG/1000)
If you are running default configurations then you are only getting 2 mappers and 1 reducer per node. The rule of thumb I have gone on (and back up by the definitive guide) is 2 processes per core so: tasktracker/datanode and 6 slots left. How you break it up from there is your call but I would

RE: How to select random n records using mapreduce ?

2011-06-27 Thread Jeff.Schmitz
Wait - Habermaas like in Critical Theory -Original Message- From: Habermaas, William [mailto:william.haberm...@fatwire.com] Sent: Monday, June 27, 2011 2:55 PM To: common-user@hadoop.apache.org Subject: RE: How to select random n records using mapreduce ? I did something similar.

Re: How to select random n records using mapreduce ?

2011-06-27 Thread Matt Pouttu-Clarke
If the incoming data is unique you can create a hash of the data and then do a modulus of the hash to select a random set. So if you wanted 10% of the data randomly: hash % 10 == 0 Gives a random 10% On 6/27/11 12:54 PM, Habermaas, William william.haberm...@fatwire.com wrote: I did

Why I cannot see live nodes in a LAN-based cluster setup?

2011-06-27 Thread Jingwei Lu
Hi Everyone: I am quite new to hadoop here. I am attempting to set up Hadoop locally in two machines, connected by LAN. Both of them pass the single-node test. However, I failed in two-node cluster setup, as shown in the 2 cases below: 1) set one as dedicated namenode and the other as dedicated

RE: Why I cannot see live nodes in a LAN-based cluster setup?

2011-06-27 Thread GOEKE, MATTHEW (AG/1000)
Did you make sure to define the datanode/tasktracker in the slaves file in your conf directory and push that to both machines? Also have you checked the logs on either to see if there are any errors? Matt -Original Message- From: Jingwei Lu [mailto:j...@ucsd.edu] Sent: Monday, June

Re: Why I cannot see live nodes in a LAN-based cluster setup?

2011-06-27 Thread Jingwei Lu
Hi, I just manually modify the masters slaves files in the both machines. I found something wrong in the log files, as shown below: -- Master : namenote.log: 2011-06-27 13:44:47,055 INFO org.mortbay.log: jetty-6.1.14 2011-06-27 13:44:47,394 INFO

RE: Why I cannot see live nodes in a LAN-based cluster setup?

2011-06-27 Thread Jeff.Schmitz
http://www.mentby.com/tim-robertson/error-register-getprotocolversion.html -Original Message- From: Jingwei Lu [mailto:j...@ucsd.edu] Sent: Monday, June 27, 2011 3:58 PM To: common-user@hadoop.apache.org Subject: Re: Why I cannot see live nodes in a LAN-based cluster setup? Hi, I

RE: Why I cannot see live nodes in a LAN-based cluster setup?

2011-06-27 Thread GOEKE, MATTHEW (AG/1000)
As a follow-up to what Jeff posted: go ahead and ignore the message you got on the NN for now. If you look at the address that the DN log shows it is 127.0.0.1 and the ip:port it is trying to connect to for the NN is 127.0.0.1:54310 --- it is trying to bind to itself as if it was still in

Re: Performance Tunning

2011-06-27 Thread Juan P.
Matt, Thanks for your help! I think I get it now, but this part is a bit confusing: * * *so: tasktracker/datanode and 6 slots left. How you break it up from there is your call but I would suggest either 4 mappers / 2 reducers or 5 mappers / 1 reducer.* * * If it's 2 processes per core, then it's:

Re: Why I cannot see live nodes in a LAN-based cluster setup?

2011-06-27 Thread Jingwei Lu
Hi Matt and Jeff: Thanks a lot for your instructions. I corrected the mistakes in conf files of DN, and now the log on DN becomes: 2011-06-27 15:32:36,025 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 0 time(s). 2011-06-27

Re: tar or hadoop archive

2011-06-27 Thread Joey Echeverria
Yes, you can see a picture describing HAR files in this old blog post: http://www.cloudera.com/blog/2009/02/the-small-files-problem/ -Joey On Mon, Jun 27, 2011 at 4:36 PM, Rita rmorgan...@gmail.com wrote: So, it does an index of the file? On Mon, Jun 27, 2011 at 10:10 AM, Joey Echeverria

Re: Performance Tunning

2011-06-27 Thread Juan P.
Ok, So I tried putting the following config in the mapred-site.xml of all of my nodes configuration property namemapred.job.tracker/name valuename-node:54311/value /property property namemapred.map.tasks/name value7/value /property property

RE: Why I cannot see live nodes in a LAN-based cluster setup?

2011-06-27 Thread GOEKE, MATTHEW (AG/1000)
At this point if that is the correct ip then I would see if you can actually ssh from the DN to the NN to make sure it can actually connect to the other box. If you can successfully connect through ssh then it's just a matter of figuring out why that port is having issues (netstat is your