Re: Problem running job with large number of directories

Josh Wills Wed, 22 May 2013 17:29:33 -0700

Hey John,

I posted a patch here: https://issues.apache.org/jira/browse/CRUNCH-209


I created it against master, as I don't think there have been any changes
to the MR execution stuff in 0.6.0 we need to worry about, but if you can't
apply it, let me know and I'll find a way to backport it. I'm 50-50 on
whether this will fix the issue, so please let me know if this doesn't do
the trick.

J


On Wed, May 22, 2013 at 4:42 PM, John Jensen <[email protected]>wrote:

>
>  Certainly. Appreciate it.
>
>  ------------------------------
> *From:* Josh Wills [[email protected]]
> *Sent:* Wednesday, May 22, 2013 4:38 PM
> *To:* [email protected]
> *Subject:* Re: Problem running job with large number of directories
>
>   Hey John,
>
>  I haven't hit that one before, but I have some hypothesis we could test
> if you're up for some trying out some patches I write.
>
>  J
>
>
> On Wed, May 22, 2013 at 4:01 PM, John Jensen <[email protected]>wrote:
>
>>
>>  I have a curious problem when running a crunch job on (avro) files in a
>> fairly large set of directories (just slightly less than 100).
>> After running some fraction of the mappers they start failing with the
>> exception below. Things work fine with a smaller number of directories.
>>
>>  The magic 
>> 'zdHJpbmcifSx7Im5hbWUiOiJ2YWx1ZSIsInR5cGUiOiJzdHJpbmcifV19fSwiZGVmYXVsdCI'
>> string shows up in the 'crunch.inputs.dir' entry in the job config, so I
>> assume it has something to do with deserializing that value, but reading
>> through the code I don't see any obvious way how.
>>
>>  Furthermore, the crunch.inputs.dir config entry is just under 1.5M, so
>> it would not surprise me if I'm running up against a hadoop limit somewhere.
>>
>>  Has anybody else seen similar issues? (this is 0.5.0, btw).
>>
>>  -- John
>>
>>   java.io.IOException: Split class zdHJp
>> bmcifSx7Im5hbWUiOiJ2YWx1ZSIsInR5cGUiOiJzdHJpbmcifV19fSwiZGVmYXVsdCI not found
>>      at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:342)
>>      at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:614)
>>      at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)
>>      at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
>>      at java.security.AccessController.doPrivileged(Native Method)
>>      at javax.security.auth.Subject.doAs(Subject.java:415)
>>      at 
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
>>      at org.apache.hadoop.mapred.Child.main(Child.java:262)
>> Caused by: java.lang.ClassNotFoundException: Class zdHJp
>> bmcifSx7Im5hbWUiOiJ2YWx1ZSIsInR5cGUiOiJzdHJpbmcifV19fSwiZGVmYXVsdCI not found
>>      at 
>> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1493)
>>      at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:340)
>>      ... 7 more
>>
>>
>
>
>  --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Re: Problem running job with large number of directories

Reply via email to