Hi Shahab, Sorry about the late reply, a personal matter came up and it took most of my time. Thank you for your replies.
The solution I chose was to temporarily transfer the metadata along with the data and then restore it on the reduce nodes. This works from a functional perspective as long as there are no performance requirements and it will have to do for now. The permanent solution will likely involve tweaking hadoop, but that is a different kettle of fish. On Sun, Sep 1, 2013 at 12:48 AM, Shahab Yunus <[email protected]>wrote: > Personally, I don't know a way to access job configuration parameters in > custom implementation of Writables ( at least not an elegant and > appropriate one. Of course hacks of various kinds be done.) Maybe experts > can chime in? > > One idea that I thought about was to use MapWritable (if you have not > explored it already.) You can encode the 'custom metadata' for you 'data' > as one byte symbols and move your data in the M/R flow as a map. Then while > deserialization you will have the type (or your 'custom metadata') in the > key part of the map and the value would be you actual data. This aligns > with the efficient approach that is used natively in Hadoop for > Strings/Text i.e. compact metadata (though I agree that you are not taking > advantage of the other aspect of non-dependence between metadata and the > data it defines.) > > Take a look at that: > Page 96 of the Definitive Guide: > > http://books.google.com/books?id=Nff49D7vnJcC&pg=PA96&lpg=PA96&dq=mapwritable+in+hadoop&source=bl&ots=IiixYu7vXu&sig=4V6H7cY-MrNT7Rzs3WlODsDOoP4&hl=en&sa=X&ei=aX4iUp2YGoaosASs_YCACQ&sqi=2&ved=0CFUQ6AEwBA#v=onepage&q=mapwritable%20in%20hadoop&f=false > > and then this: > > http://www.chrisstucchio.com/blog/2011/mapwritable_sometimes_a_performance_hog.html > > and add your own custom types here (note that you are restricted by size > of byte): > > http://hadoop.sourcearchive.com/documentation/0.20.2plus-pdfsg1-1/AbstractMapWritable_8java-source.html > > Regards, > Shahab > > > On Sat, Aug 31, 2013 at 5:38 AM, Adrian CAPDEFIER > <[email protected]>wrote: > >> Thank you for your help Shahab. >> >> I guess I wasn't being too clear. My logic is that I use a custom type as >> key and in order to deserialize it on the compute nodes, I need an extra >> piece of information (also a custom type). >> >> To use an analogy, a Text is serialized by writing the length of the >> string as a number and then the bytes that compose the actual string. When >> it is deserialized, the number informs the reader when to stop reading the >> string. This number is varies from string to string and it is compact so it >> makes sense to serialize it with the string. >> >> My use case is similar to it. I have a complex type (let's call this >> data), and in order to deserialize it, I need another complex type (let's >> call this second type metadata). The metadata is not closely tied to the >> data (i.e. if the data value changes, the metadata does not) and the >> metadata size is quite large. >> >> I ruled out a couple of options, but please let me know if you think I >> did so for the wrong reasons: >> 1. I could serialize each data value with it's own metadata value, but >> since the data value count is in the +tens of millions and the metadata >> value distinct count can be up to one hundred, it would waste resources in >> the system. >> 2. I could serialize the metadata and then the data as a collection >> property of the metadata. This would be an elegant solution code-wise, but >> then all the data would have to be read and kept in memory as a massive >> object before any reduce operations can happen. I wasn't able to find any >> info on this online so this is just a guess from peeking at the hadoop code. >> >> My "solution" was to serialize the data with a hash of the metadata and >> separately serialize the metadata and its hash in the job configuration (as >> key/value pairs). For this to work, I would need to be able to deserialize >> the metadata on the reduce node before the data is deserialized in the >> readFields() method. >> >> I think that for that to happen I need to hook into the code somewhere >> where a context or job configuration is used (before readFields()), but I'm >> stumped as to where that is. >> >> Cheers, >> Adi >> >> >> On Sat, Aug 31, 2013 at 3:42 AM, Shahab Yunus <[email protected]>wrote: >> >>> What I meant was that you might have to split or redesign your logic or >>> your usecase (which we don't know about)? >>> >>> Regards, >>> Shahab >>> >>> >>> On Fri, Aug 30, 2013 at 10:31 PM, Adrian CAPDEFIER < >>> [email protected]> wrote: >>> >>>> But how would the comparator have access to the job config? >>>> >>>> >>>> On Sat, Aug 31, 2013 at 2:38 AM, Shahab Yunus >>>> <[email protected]>wrote: >>>> >>>>> I think you have to override/extend the Comparator to achieve that, >>>>> something like what is done in Secondary Sort? >>>>> >>>>> Regards, >>>>> Shahab >>>>> >>>>> >>>>> On Fri, Aug 30, 2013 at 9:01 PM, Adrian CAPDEFIER < >>>>> [email protected]> wrote: >>>>> >>>>>> Howdy, >>>>>> >>>>>> I apologise for the lack of code in this message, but the code is >>>>>> fairly convoluted and it would obscure my problem. That being said, I can >>>>>> put together some sample code if really needed. >>>>>> >>>>>> I am trying to pass some metadata between the map & reduce steps. >>>>>> This metadata is read and generated in the map step and stored in the job >>>>>> config. It also needs to be recreated on the reduce node before the key/ >>>>>> value fields can be read in the readFields function. >>>>>> >>>>>> I had assumed that I would be able to override the Reducer.setup() >>>>>> function and that would be it, but apparently the readFields function is >>>>>> called before the Reducer.setup() function. >>>>>> >>>>>> My question is what is any (the best) place on the reduce node where >>>>>> I can access the job configuration/ context before the readFields >>>>>> function >>>>>> is called? >>>>>> >>>>>> This is the stack trace: >>>>>> >>>>>> at >>>>>> org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:103) >>>>>> at >>>>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:1111) >>>>>> at >>>>>> org.apache.hadoop.util.QuickSort.sortInternal(QuickSort.java:70) >>>>>> at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:59) >>>>>> at >>>>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1399) >>>>>> at >>>>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1298) >>>>>> at >>>>>> org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:699) >>>>>> at >>>>>> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:766) >>>>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) >>>>>> at org.apache.hadoop.mapred.Child$4.run(Child.java:255) >>>>>> at java.security.AccessController.doPrivileged(Native Method) >>>>>> at javax.security.auth.Subject.doAs(Subject.java:415) >>>>>> at >>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149) >>>>>> at org.apache.hadoop.mapred.Child.main(Child.java:249) >>>>>> >>>>>> >>>>> >>>> >>> >> >
