cross product of two files using MapReduce - pls suggest

2011-01-19 Thread Rohit Kelkar
I have two files, A and D, containing (vectorId, vector) on each line. |D| = 100,000 and |A| = 1000. Dimensionality of the vectors = 100 Now I want to execute the following for eachItem in A: for eachElem in D: dot_product = eachItem * eachElem save(dot_product) What I tried

Re: How to reduce number of splits in DataDrivenDBInputFormat?

2011-01-19 Thread Rohit Kelkar
you could try out this piece of code before job.waitForCompletion() FileSystem dfs = FileSystem.get(conf); long fileSize = dfs.getFileStatus(new Path(hdfsFile)).getLen(); long maxSplitSize = fileSize / NUM_OF_MAP_TASKS; //in your case NUM_OF_MAP_TASKS = 4 conf.setLong("mapred.max.split.size", maxS

Re: controlling where a map task runs?

2012-01-17 Thread Rohit Kelkar
nodes in the clusters without obeying the data locality. Of course N=1 is an extreme case, you could try changing the value of N based on the number of lines in your input file. - Rohit Kelkar On Wed, Jan 18, 2012 at 2:31 AM, Yang wrote: > I understand that normally map tasks are run close to

Re: Query Regarding design MR job for Billing

2012-02-27 Thread Rohit Kelkar
Just trying to understand your use case you need an hour job to run on data between 6:40 AM and 7:40 AM. Would it be like a moving window? For ex. run hour jobs on 6:41 AM to 7:41 AM 6:42 AM to 7:42 AM and so on... On Mon, Feb 27, 2012 at 1:01 PM, Stuti Awasthi wrote: > Hi all, > > I have to imp

Re: How I can create map/reduce for a spreadsheet calculation?

2012-04-02 Thread Rohit Kelkar
reducer code you could take the sqrt. Your second assumption that number of reducers = number of variables is not right. - Rohit Kelkar On Tue, Apr 3, 2012 at 10:10 AM, Fang Xin wrote: > Hi, > > I have a spreadsheet where each column contains values for one > variable. and I need to c

Re: How I can create map/reduce for a spreadsheet calculation?

2012-04-02 Thread Rohit Kelkar
cluster summary. You will notice a number for the "reduce task capacity". This is the max number of reducers that would run concurrently on your cluster. No matter what value of N you specify, you will be limited by the above "reduce task capacity". - Rohit Kelkar On Tue, Apr 3, 20