I have two files, A and D, containing (vectorId, vector) on each line.
|D| = 100,000 and |A| = 1000. Dimensionality of the vectors = 100
Now I want to execute the following
for eachItem in A:
for eachElem in D:
dot_product = eachItem * eachElem
save(dot_product)
What I tried
you could try out this piece of code before job.waitForCompletion()
FileSystem dfs = FileSystem.get(conf);
long fileSize = dfs.getFileStatus(new Path(hdfsFile)).getLen();
long maxSplitSize = fileSize / NUM_OF_MAP_TASKS; //in your case
NUM_OF_MAP_TASKS = 4
conf.setLong("mapred.max.split.size", maxS
nodes in the clusters without obeying the data
locality. Of course N=1 is an extreme case, you could try changing the
value of N based on the number of lines in your input file.
- Rohit Kelkar
On Wed, Jan 18, 2012 at 2:31 AM, Yang wrote:
> I understand that normally map tasks are run close to
Just trying to understand your use case
you need an hour job to run on data between 6:40 AM and 7:40 AM. Would
it be like a moving window? For ex. run hour jobs on
6:41 AM to 7:41 AM
6:42 AM to 7:42 AM
and so on...
On Mon, Feb 27, 2012 at 1:01 PM, Stuti Awasthi wrote:
> Hi all,
>
> I have to imp
reducer code you could take the sqrt.
Your second assumption that number of reducers = number of variables
is not right.
- Rohit Kelkar
On Tue, Apr 3, 2012 at 10:10 AM, Fang Xin wrote:
> Hi,
>
> I have a spreadsheet where each column contains values for one
> variable. and I need to c
cluster summary. You
will notice a number for the "reduce task capacity". This is the max
number of reducers that would run concurrently on your cluster. No
matter what value of N you specify, you will be limited by the above
"reduce task capacity".
- Rohit Kelkar
On Tue, Apr 3, 20