Re: Job config before read fields

Adrian CAPDEFIER Mon, 09 Sep 2013 11:09:00 -0700

Hi Shahab,

Sorry about the late reply, a personal matter came up and it took most of
my time. Thank you for your replies.


The solution I chose was to temporarily transfer the metadata along with
the data and then restore it on the reduce nodes. This works from a
functional perspective as long as there are no performance requirements and
it will have to do for now.

The permanent solution will likely involve tweaking hadoop, but that is a
different kettle of fish.


On Sun, Sep 1, 2013 at 12:48 AM, Shahab Yunus <[email protected]>wrote:

> Personally, I don't know a way to access job configuration parameters in
> custom implementation of Writables ( at least not an elegant and
> appropriate one. Of course hacks of various kinds be done.) Maybe experts
> can chime in?
>
> One idea that I thought about was to use MapWritable (if you have not
> explored it already.) You can encode the 'custom metadata' for you 'data'
> as one byte symbols and move your data in the M/R flow as a map. Then while
> deserialization you will have the type (or your 'custom metadata') in the
> key part of the map and the value would be you actual data. This aligns
> with the efficient approach that is used natively in Hadoop for
> Strings/Text i.e. compact metadata (though I agree that you are not taking
> advantage of the other aspect of non-dependence between metadata and the
> data it defines.)
>
> Take a look at that:
> Page 96 of the Definitive Guide:
>
> http://books.google.com/books?id=Nff49D7vnJcC&pg=PA96&lpg=PA96&dq=mapwritable+in+hadoop&source=bl&ots=IiixYu7vXu&sig=4V6H7cY-MrNT7Rzs3WlODsDOoP4&hl=en&sa=X&ei=aX4iUp2YGoaosASs_YCACQ&sqi=2&ved=0CFUQ6AEwBA#v=onepage&q=mapwritable%20in%20hadoop&f=false
>
> and then this:
>
> http://www.chrisstucchio.com/blog/2011/mapwritable_sometimes_a_performance_hog.html
>
> and add your own custom types here (note that you are restricted by size
> of byte):
>
> http://hadoop.sourcearchive.com/documentation/0.20.2plus-pdfsg1-1/AbstractMapWritable_8java-source.html
>
> Regards,
> Shahab
>
>
> On Sat, Aug 31, 2013 at 5:38 AM, Adrian CAPDEFIER 
> <[email protected]>wrote:
>
>> Thank you for your help Shahab.
>>
>> I guess I wasn't being too clear. My logic is that I use a custom type as
>> key and in order to deserialize it on the compute nodes, I need an extra
>> piece of information (also a custom type).
>>
>> To use an analogy, a Text is serialized by writing the length of the
>> string as a number and then the bytes that compose the actual string. When
>> it is deserialized, the number informs the reader when to stop reading the
>> string. This number is varies from string to string and it is compact so it
>> makes sense to serialize it with the string.
>>
>> My use case is similar to it. I have a complex type (let's call this
>> data), and in order to deserialize it, I need another complex type (let's
>> call this second type metadata). The metadata is not closely tied to the
>> data (i.e. if the data value changes, the metadata does not) and the
>> metadata size is quite large.
>>
>> I ruled out a couple of options, but please let me know if you think I
>> did so for the wrong reasons:
>> 1. I could serialize each data value with it's own metadata value, but
>> since the data value count is in the +tens of millions and the metadata
>> value distinct count can be up to one hundred, it would waste resources in
>> the system.
>> 2. I could serialize the metadata and then the data as a collection
>> property of the metadata. This would be an elegant solution code-wise, but
>> then all the data would have to be read and kept in memory as a massive
>> object before any reduce operations can happen. I wasn't able to find any
>> info on this online so this is just a guess from peeking at the hadoop code.
>>
>> My "solution" was to serialize the data with a hash of the metadata and
>> separately serialize the metadata and its hash in the job configuration (as
>> key/value pairs). For this to work, I would need to be able to deserialize
>> the metadata on the reduce node before the data is deserialized in the
>> readFields() method.
>>
>> I think that for that to happen I need to hook into the code somewhere
>> where a context or job configuration is used (before readFields()), but I'm
>> stumped as to where that is.
>>
>>  Cheers,
>> Adi
>>
>>
>> On Sat, Aug 31, 2013 at 3:42 AM, Shahab Yunus <[email protected]>wrote:
>>
>>> What I meant was that you might have to split or redesign your logic or
>>> your usecase (which we don't know about)?
>>>
>>> Regards,
>>> Shahab
>>>
>>>
>>> On Fri, Aug 30, 2013 at 10:31 PM, Adrian CAPDEFIER <
>>> [email protected]> wrote:
>>>
>>>> But how would the comparator have access to the job config?
>>>>
>>>>
>>>> On Sat, Aug 31, 2013 at 2:38 AM, Shahab Yunus 
>>>> <[email protected]>wrote:
>>>>
>>>>> I think you have to override/extend the Comparator to achieve that,
>>>>> something like what is done in Secondary Sort?
>>>>>
>>>>> Regards,
>>>>> Shahab
>>>>>
>>>>>
>>>>> On Fri, Aug 30, 2013 at 9:01 PM, Adrian CAPDEFIER <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Howdy,
>>>>>>
>>>>>> I apologise for the lack of code in this message, but the code is
>>>>>> fairly convoluted and it would obscure my problem. That being said, I can
>>>>>> put together some sample code if really needed.
>>>>>>
>>>>>> I am trying to pass some metadata between the map & reduce steps.
>>>>>> This metadata is read and generated in the map step and stored in the job
>>>>>> config. It also needs to be recreated on the reduce node before the key/
>>>>>> value fields can be read in the readFields function.
>>>>>>
>>>>>> I had assumed that I would be able to override the Reducer.setup()
>>>>>> function and that would be it, but apparently the readFields function is
>>>>>> called before the Reducer.setup() function.
>>>>>>
>>>>>> My question is what is any (the best) place on the reduce node where
>>>>>> I can access the job configuration/ context before the readFields 
>>>>>> function
>>>>>> is called?
>>>>>>
>>>>>> This is the stack trace:
>>>>>>
>>>>>>         at
>>>>>> org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:103)
>>>>>>         at
>>>>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:1111)
>>>>>>         at
>>>>>> org.apache.hadoop.util.QuickSort.sortInternal(QuickSort.java:70)
>>>>>>         at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:59)
>>>>>>         at
>>>>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1399)
>>>>>>         at
>>>>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1298)
>>>>>>         at
>>>>>> org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:699)
>>>>>>         at
>>>>>> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:766)
>>>>>>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>>>>>>         at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>>>>>>         at java.security.AccessController.doPrivileged(Native Method)
>>>>>>         at javax.security.auth.Subject.doAs(Subject.java:415)
>>>>>>         at
>>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149)
>>>>>>         at org.apache.hadoop.mapred.Child.main(Child.java:249)
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Job config before read fields

Reply via email to