Re: Error Executing a Fragment Replicated Join

Renato Marroquín Mogrovejo Fri, 29 Apr 2011 19:25:52 -0700

In spite of the fact that my execution plan says that only one
MapReduce will be used, in my webUI there are two MR jobs for the Pig
task, I am probably missing something here in the middle, because yeah
replicated joins should only use one MR job, right?
And another thing I find weird is that I tried executing the FR join
again and I get a JavaHeapSpace problem in the second job of it, when
before I got an error saying something like Pig was expecting X bytes
but it was getting X+Y bytes. I haven't been able to replicate this
error, it probably has something to do with my env at some point in
time.
I thought that error of Pig expecting X bytes and getting more than
expected had something to do with Pig seeing about a 4x expansion when
loading data from disk into memory, that is why I was asking about how
this count is done (available java heap space > 4x FileSize) or
something like this?
Thanks again.


#-----------------------------------------------
# Logical Plan:
#-----------------------------------------------
Store 1-86 Schema: {proy_sR::sr_cde_sk: int,proy_cD::cd_dem_sk: int}
Type: Unknown
|
|---LOJoin 1-25 Schema: {proy_sR::sr_cde_sk: int,proy_cD::cd_dem_sk:
int} Type: bag
    |   |
    |   Project 1-23 Projections: [0] Overloaded: false FieldSchema:
sr_cde_sk: int Type: int
    |   Input: ForEach 1-18
    |   |
    |   Project 1-24 Projections: [0] Overloaded: false FieldSchema:
cd_dem_sk: int Type: int
    |   Input: ForEach 1-22
    |
    |---ForEach 1-18 Schema: {sr_cde_sk: int} Type: bag
    |   |   |
    |   |   Project 1-17 Projections: [0] Overloaded: false
FieldSchema: sr_cde_sk: int Type: int
    |   |   Input: ForEach 1-66
    |   |
    |   |---ForEach 1-66 Schema: {sr_cde_sk: int} Type: bag
    |       |   |
    |       |   Cast 1-35 FieldSchema: sr_cde_sk: int Type: int
    |       |   |
    |       |   |---Project 1-34 Projections: [0] Overloaded: false
FieldSchema: sr_cde_sk: bytearray Type: bytearray
    |       |       Input: Load 1-13
    |       |
    |       |---Load 1-13 Schema: {sr_cde_sk: bytearray} Type: bag
    |
    |---ForEach 1-22 Schema: {cd_dem_sk: int} Type: bag
        |   |
        |   Project 1-21 Projections: [0] Overloaded: false
FieldSchema: cd_dem_sk: int Type: int
        |   Input: ForEach 1-85
        |
        |---ForEach 1-85 Schema: {cd_dem_sk: int} Type: bag
            |   |
            |   Cast 1-68 FieldSchema: cd_demo_sk: int Type: int
            |   |
            |   |---Project 1-67 Projections: [0] Overloaded: false
FieldSchema: cd_dem_sk: bytearray Type: bytearray
            |       Input: Load 1-14
            |
            |---Load 1-14 Schema: {cd_dem_sk: bytearray} Type: bag

#-----------------------------------------------
# Physical Plan:
#-----------------------------------------------
Store(fakefile:org.apache.pig.builtin.PigStorage) - 1-107
|
|---FRJoin[tuple] - 1-101
    |   |
    |   Project[int][0] - 1-99
    |   |
    |   Project[int][0] - 1-100
    |
    |---New For Each(false)[bag] - 1-92
    |   |   |
    |   |   Project[int][0] - 1-91
    |   |
    |   |---New For Each(false)[bag] - 1-90
    |       |   |
    |       |   Cast[int] - 1-89
    |       |   |
    |       |   |---Project[bytearray][0] - 1-88
    |       |
    |       
|---Load(hdfs://berlin.labbio:54310/user/hadoop/pigData/sr.dat:PigStorage('|'))
- 1-87
    |
    |---New For Each(false)[bag] - 1-98
        |   |
        |   Project[int][0] - 1-97
        |
        |---New For Each(false)[bag] - 1-96
            |   |
            |   Cast[int] - 1-95
            |   |
            |   |---Project[bytearray][0] - 1-94
            |
            
|---Load(hdfs://berlin.labbio:54310/user/hadoop/pigData/cd.dat:PigStorage('|'))
- 1-93

2011-04-29 23:04:54,727 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- MR plan size before optimization: 2
2011-04-29 23:04:54,727 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- MR plan size after optimization: 2
#--------------------------------------------------
# Map Reduce Plan
#--------------------------------------------------
MapReduce node 1-109
Map Plan
Store(hdfs://berlin.labbio:54310/tmp/temp1815576246/tmp379673501:org.apache.pig.builtin.BinStorage)
- 1-110
|
|---New For Each(false)[bag] - 1-98
    |   |
    |   Project[int][0] - 1-97
    |
    |---New For Each(false)[bag] - 1-96
        |   |
        |   Cast[int] - 1-95
        |   |
        |   |---Project[bytearray][0] - 1-94
        |
        
|---Load(hdfs://berlin.labbio:54310/user/hadoop/pigData/cd.dat:PigStorage('|'))
- 1-93--------
Global sort: false
----------------

MapReduce node 1-108
Map Plan
Store(fakefile:org.apache.pig.builtin.PigStorage) - 1-107
|
|---FRJoin[tuple] - 1-101
    |   |
    |   Project[int][0] - 1-99
    |   |
    |   Project[int][0] - 1-100
    |
    |---New For Each(false)[bag] - 1-92
        |   |
        |   Project[int][0] - 1-91
        |
        |---New For Each(false)[bag] - 1-90
            |   |
            |   Cast[int] - 1-89
            |   |
            |   |---Project[bytearray][0] - 1-88
            |
            
|---Load(hdfs://berlin.labbio:54310/user/hadoop/pigData/sr.dat:PigStorage('|'))
- 1-87--------
Global sort: false
----------------



2011/4/28 Daniel Dai <[email protected]>:
> There should be only one job. Thanks Thejas point out.
>
> Daniel
>
>
> -----Original Message----- From: Daniel Dai
> Sent: Wednesday, April 27, 2011 7:18 PM
> To: [email protected]
> Cc: Renato Marroquín Mogrovejo ; [email protected]
> Subject: Re: Error Executing a Fragment Replicated Join
>
> Do you see the failure in the first job (sampling) or second job? Do you
> see the exception right after the job kick off?
>
> If the replicated side is too large, you probably will see a "Java heap
> exception" rather than job setup exception. It more like an environment
> issue. Check if you can run regular join, or you have other hadoop
> config file in your classpath.
>
> Daniel
>
>
> On 04/27/2011 05:26 PM, Renato Marroquín Mogrovejo wrote:
>>
>> Now that the Apache server is ok with me again, I can write back to
>> the list. I wrote to the Apache Infra team and they told me to write
>> messages just in plain text, disabling any html within the message
>> (not that I ever sent html but oh well), I guess that worked :)
>> Well, first thanks for answering. I am using pig 0.7 and my pig script
>> is as follows:
>>
>> {code}
>> sr = LOAD 'pigData/sr.dat' using PigStorage('|') AS
>> (sr_ret_date_sk:int, sr_ret_tim_sk:int, sr_ite_sk:int, sr_cus_sk:int,
>> sr_cde_sk:int, sr_hde_sk:int, sr_add_sk:int, sr_sto_sk:int,
>> sr_rea_sk:int, sr_tic_num:int, sr_ret_qua:int, sr_ret_amt:double,
>> sr_ret_tax:double, sr_ret_amt_inc_tax:double, sr_fee:double,
>> sr_ret_sh_cst:double, sr_ref_csh:double, sr_rev_cha:double,
>> sr_sto_cred:double, sr_net_lss:double);
>>
>> cd = LOAD 'pigData/cd.dat' using PigStorage('|') AS (cd_dem_sk:int,
>> cd_gnd:chararray, cd_mrt_sts:chararray, cd_edt_sts:chararray,
>> cd_pur_est:int, cd_cred_rtg:chararray, cd_dep_cnt:int,
>> cd_dep_emp_cnt:int, cd_dep_col_count:int);
>>
>> proy_sR = FOREACH sr GENERATE sr_cde_sk;
>> proy_cD = FOREACH cd GENERATE cd_dem_sk;
>>
>> join_sR_cD = JOIN proy_sR BY sr_cde_sk, proy_cD BY cd_dem_sk USING
>> 'replicated';
>>
>> STORE join_sR_cD INTO 'queryResults/query.11.sr.cd.5.1' using
>> PigStorage('|');
>> {/code}
>>
>> Being "cd" the relation of 77MB and "sr" the relation of 32MB. I had
>> some other similar queries in which the 32MB relation was being joined
>> with smaller relations (<10MB) giving the same problem, I modified
>> those, so the queries<10MB would be ones being replicated.
>> Thanks again.
>>
>> Renato M.
>>
>> 2011/4/27 Thejas M Nair<[email protected]>:
>>>
>>> The exception indicates that the hadoop job creation failed. Are you able
>>> to
>>> run simple MR queries using each of the inputs ?
>>> It could also caused by some problem pig is having with copying the file
>>> being replicated to distributed cache.
>>> -Thejas
>>>
>>>
>>> On 4/27/11 3:42 PM, "Renato Marroquín Mogrovejo"
>>> <[email protected]>  wrote:
>>>
>>> Does anybody have any suggestions? Please???
>>> Thanks again.
>>>
>>> Renato M.
>>>
>>> 2011/4/26 Alan Gates<[email protected]>
>>>>
>>>> Sent for Renato, since Apache's mail system has decided it doesn't like
>>>> him.
>>>>
>>>> Alan.
>>>>
>>>> I am getting an error while trying to execute a simple fragment
>>>> replicated
>>>> join on two files (one of 77MB and the other one of 32MB). I am using
>>>> the
>>>> 32MB file as the small one to be replicated, but I keep getting this
>>>> error.
>>>> Does any body know how this count is done? I mean how Pig determines
>>>> that
>>>> the small file is not small enough, or how I could modify this?
>>>> I am executing these on four PC's with 3GB of RAM running DebianLenny.
>>>> Thanks in advance.
>>>>
>>>>
>>>> Renato M.
>>>>
>>>> Pig Stack Trace
>>>> ---------------
>>>> ERROR 2017: Internal error creating job configuration.
>>>>
>>>> org.apache.pig.backend.executionengine.ExecException: ERROR 2043:
>>>> Unexpected
>>>> error during execution.
>>>>      at
>>>>
>>>>
>>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:332)
>>>>      at
>>>> org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:835)
>>>>      at org.apache.pig.PigServer.execute(PigServer.java:828)
>>>>      at org.apache.pig.PigServer.access$100(PigServer.java:105)
>>>>      at org.apache.pig.PigServer$Graph.execute(PigServer.java:1080)
>>>>      at org.apache.pig.PigServer.executeBatch(PigServer.java:288)
>>>>      at
>>>>
>>>> org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:109)
>>>>      at
>>>>
>>>>
>>>> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166)
>>>>      at
>>>>
>>>>
>>>> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138)
>>>>      at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
>>>>      at org.apache.pig.Main.main(Main.java:391)
>>>> Caused by:
>>>>
>>>>
>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobCreationException:
>>>> ERROR 2017: Internal error creating job configuration.
>>>>      at
>>>>
>>>>
>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getJob(JobControlCompiler.java:624)
>>>>      at
>>>>
>>>>
>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.compile(JobControlCompiler.java:246)
>>>>      at
>>>>
>>>>
>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>>
>>>
>
>

Re: Error Executing a Fragment Replicated Join

Reply via email to