Yes, I feel we could discuss this over a JIRA to remove it if it hurts perf. too much, but it would have to be a marked incompatible change, and we have to add a note about the lack of thread safety in the javadoc of base Mapper/Reducer classes.
On Sun, Aug 11, 2013 at 1:26 PM, Sathwik B P <[email protected]> wrote: > Hi Harsh, > > Does it make any sense to keep the method in LRW still synchronized. Isn't > it creating unnecessary overhead for non multi threaded implementations. > > regards, > sathwik > > > On Fri, Aug 9, 2013 at 7:16 AM, Harsh J <[email protected]> wrote: >> >> I suppose I should have been clearer. There's no problem out of box if >> people stick to the libraries we offer :) >> >> Yes the LRW was marked synchronized at some point over 8 years ago [1] >> in support for multi-threaded maps, but the framework has changed much >> since then. The MultithreadedMapper/etc. API we offer now >> automatically shields the devs away from having to think of output >> thread safety [2]. >> >> I can imagine there can only be a problem if a user writes their own >> unsafe multi threaded task. I suppose we could document that in the >> Mapper/MapRunner and Reducer APIs. >> >> [1] - http://svn.apache.org/viewvc?view=revision&revision=171186 - >> Commit added a synchronized to the write call. >> [2] - MultiThreadedMapper/etc. synchronize over the collector - >> >> http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/map/MultithreadedMapper.java?view=markup >> >> On Thu, Aug 8, 2013 at 7:52 PM, Azuryy Yu <[email protected]> wrote: >> > sequence writer is also synchronized, I dont think this is bad. >> > >> > if you call HDFS api to write concurrently, then its necessary. >> > >> > On Aug 8, 2013 7:53 PM, "Jay Vyas" <[email protected]> wrote: >> >> >> >> Then is this a bug? Synchronization in absence of any race condition >> >> is >> >> normally considered "bad". >> >> >> >> In any case id like to know why this writer is synchronized whereas the >> >> other one are not.. That is, I think, then point at issue: either other >> >> writers should be synchronized or else this one shouldn't be - >> >> consistency >> >> across the write implementations is probably desirable so that changes >> >> to >> >> output formats or record writers don't lead to bugs in multithreaded >> >> environments . >> >> >> >> On Aug 8, 2013, at 6:50 AM, Harsh J <[email protected]> wrote: >> >> >> >> While we don't fork by default, we do provide a MultithreadedMapper >> >> implementation that would require such synchronization. But if you are >> >> asking is it necessary, then perhaps the answer is no. >> >> >> >> On Aug 8, 2013 3:43 PM, "Azuryy Yu" <[email protected]> wrote: >> >>> >> >>> its not hadoop forked threads, we may create a line record writer, >> >>> then >> >>> call this writer concurrently. >> >>> >> >>> On Aug 8, 2013 4:00 PM, "Sathwik B P" <[email protected]> wrote: >> >>>> >> >>>> Hi, >> >>>> Thanks for your reply. >> >>>> May I know where does hadoop fork multiple threads to use a single >> >>>> RecordWriter. >> >>>> >> >>>> regards, >> >>>> sathwik >> >>>> >> >>>> On Thu, Aug 8, 2013 at 7:06 AM, Azuryy Yu <[email protected]> wrote: >> >>>>> >> >>>>> because we may use multi-threads to write a single file. >> >>>>> >> >>>>> On Aug 8, 2013 2:54 PM, "Sathwik B P" <[email protected]> wrote: >> >>>>>> >> >>>>>> Hi, >> >>>>>> >> >>>>>> LineRecordWriter.write(..) is synchronized. I did not find any >> >>>>>> other >> >>>>>> RecordWriter implementations define the write as synchronized. >> >>>>>> Any specific reason for this. >> >>>>>> >> >>>>>> regards, >> >>>>>> sathwik >> >>>> >> >>>> >> > >> >> >> >> -- >> Harsh J > > -- Harsh J
