Re: Why In-memory Mapoutput is necessary in ReduceCopier

2013-03-11 Thread Ling Kun
Dear Ravi and all,

   Thanks very much for your kindly reply.
   I am currently concern about whether it is possible to eliminate the
HTTP GET method using some other ways. And currently have not got a better
idea.

   Thanks agin.

yours,
Ling Kun


On Tue, Mar 12, 2013 at 12:58 AM, Ravi Prakash  wrote:

> Hi Ling,
>
> Yes! It is because of performance concerns. We want to keep and merge map
> outputs in memory as much as we can. The amount of memory reserved for this
> purpose is configurable. Obviously storing fetched map outputs on disk,
> then reading them back from disk to merge them and then write out back to
> disk, is a lot more expensive than if it were done in memory.
>
> Please let us know if you find there was an opportunity to keep the map
> output in memory but we did not, and instead shuffled to disk.
>
> Thanks
> Ravi
>
>
>
>
> 
>  From: Ling Kun 
> To: mapreduce-dev@hadoop.apache.org
> Sent: Monday, March 11, 2013 5:27 AM
> Subject: Why In-memory Mapoutput is necessary in ReduceCopier
>
> Dear all,
>
>  I am focusing on the Mapoutput copier implementation. This part of
> code will try to get mapoutputs, and merge them into a file that can feed
> to reduce functions. I have the following questions.
>
> 1. All the local file mapoutput data will be merged together by the
> LocalFSMerge, and the in-memory mapout will be merged by
> InMemFSMergeThread. For the InMemFSMergeThread, there is also a writer
> object   which write the result to outputPath ( ReduceTask.java Line 2843).
> It seems after merging, in-memory mapoutput and local file mapoutput data
> will all be stored in local file system. Why not just using the local file
> for all mapoutput data.
>
> 2. After using http to get  some fragment of a map output file, some of the
> mapoutput data will be selected and keep in memory, while others are
> directly write to local disk of reducers. Which mapoutput wil be kept in
> memory is determined in MapOutputCopier.getMapOutput(), this method will
> call ramManager.canFitInMemory().  why not store all the data to disk?
>
> 3. According to the comment, Hadoop will put a file in memory if it meets:
> a, the size of the (decompressed) file should be less than 25% of the total
> inmem fs; b, there is space available in the inmem fs. Why ? Is it because
> of the performance?
>
>
>
> Thanks
>
> yours,
> Ling Kun
>
> --
> http://www.lingcc.com
>



-- 
http://www.lingcc.com


[jira] [Created] (MAPREDUCE-5057) Datajoin Package for reduce side join (in contrib folder)

2013-03-11 Thread Vikas Jadhav (JIRA)
Vikas Jadhav created MAPREDUCE-5057:
---

 Summary: Datajoin Package for reduce side join (in contrib folder)
 Key: MAPREDUCE-5057
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5057
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: contrib/data-join
Affects Versions: 1.0.3
Reporter: Vikas Jadhav
Priority: Trivial


DataJoin Package contributed to Hadoop has bug 
1) MRJobConfig config is not present and will not return input file  
   name (MRJobConfig.MAP_INPUT_FILE) 


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (MAPREDUCE-5056) TestProcfsBasedProcessTree fails on Windows with Process-tree dump doesn't start with a proper header

2013-03-11 Thread Ivan Mitic (JIRA)
Ivan Mitic created MAPREDUCE-5056:
-

 Summary: TestProcfsBasedProcessTree fails on Windows with 
Process-tree dump doesn't start with a proper header
 Key: MAPREDUCE-5056
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5056
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Affects Versions: 3.0.0
Reporter: Ivan Mitic
Assignee: Ivan Mitic


Test fails on the below assertion:

Running org.apache.hadoop.mapreduce.util.TestProcfsBasedProcessTree
Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.266 sec <<< 
FAILURE!
testProcessTreeDump(org.apache.hadoop.mapreduce.util.TestProcfsBasedProcessTree)
  Time elapsed: 0 sec  <<< FAILURE!
junit.framework.AssertionFailedError: Process-tree dump doesn't start with a 
proper header
at junit.framework.Assert.fail(Assert.java:47)
at junit.framework.Assert.assertTrue(Assert.java:20)
at 
org.apache.hadoop.mapreduce.util.TestProcfsBasedProcessTree.testProcessTreeDump(TestProcfsBasedProcessTree.java:564)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at junit.framework.TestCase.runTest(TestCase.java:168)
at junit.framework.TestCase.runBare(TestCase.java:134)
at junit.framework.TestResult$1.protect(TestResult.java:110)
at junit.framework.TestResult.runProtected(TestResult.java:128)
at junit.framework.TestResult.run(TestResult.java:113)
at junit.framework.TestCase.run(TestCase.java:124)
at junit.framework.TestSuite.runTest(TestSuite.java:243)
at junit.framework.TestSuite.run(TestSuite.java:238)
at 
org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:83)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:252)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:141)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:112)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(ReflectionUtils.java:189)
at 
org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:165)
at 
org.apache.maven.surefire.booter.ProviderFactory.invokeProvider(ProviderFactory.java:85)
at 
org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:115)
at 
org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:75)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Why In-memory Mapoutput is necessary in ReduceCopier

2013-03-11 Thread Ravi Prakash
Hi Ling,

Yes! It is because of performance concerns. We want to keep and merge map 
outputs in memory as much as we can. The amount of memory reserved for this 
purpose is configurable. Obviously storing fetched map outputs on disk, then 
reading them back from disk to merge them and then write out back to disk, is a 
lot more expensive than if it were done in memory. 

Please let us know if you find there was an opportunity to keep the map output 
in memory but we did not, and instead shuffled to disk.

Thanks
Ravi





 From: Ling Kun 
To: mapreduce-dev@hadoop.apache.org 
Sent: Monday, March 11, 2013 5:27 AM
Subject: Why In-memory Mapoutput is necessary in ReduceCopier
 
Dear all,

     I am focusing on the Mapoutput copier implementation. This part of
code will try to get mapoutputs, and merge them into a file that can feed
to reduce functions. I have the following questions.

1. All the local file mapoutput data will be merged together by the
LocalFSMerge, and the in-memory mapout will be merged by
InMemFSMergeThread. For the InMemFSMergeThread, there is also a writer
object   which write the result to outputPath ( ReduceTask.java Line 2843).
It seems after merging, in-memory mapoutput and local file mapoutput data
will all be stored in local file system. Why not just using the local file
for all mapoutput data.

2. After using http to get  some fragment of a map output file, some of the
mapoutput data will be selected and keep in memory, while others are
directly write to local disk of reducers. Which mapoutput wil be kept in
memory is determined in MapOutputCopier.getMapOutput(), this method will
call ramManager.canFitInMemory().  why not store all the data to disk?

3. According to the comment, Hadoop will put a file in memory if it meets:
a, the size of the (decompressed) file should be less than 25% of the total
inmem fs; b, there is space available in the inmem fs. Why ? Is it because
of the performance?



Thanks

yours,
Ling Kun

-- 
http://www.lingcc.com