In spite of the fact that my execution plan says that only one
MapReduce will be used, in my webUI there are two MR jobs for the Pig
task, I am probably missing something here in the middle, because yeah
replicated joins should only use one MR job, right?
And another thing I find weird is that I tried executing the FR join
again and I get a JavaHeapSpace problem in the second job of it, when
before I got an error saying something like Pig was expecting X bytes
but it was getting X+Y bytes. I haven't been able to replicate this
error, it probably has something to do with my env at some point in
time.
I thought that error of Pig expecting X bytes and getting more than
expected had something to do with Pig seeing about a 4x expansion when
loading data from disk into memory, that is why I was asking about how
this count is done (available java heap space > 4x FileSize) or
something like this?
Thanks again.
#-----------------------------------------------
# Logical Plan:
#-----------------------------------------------
Store 1-86 Schema: {proy_sR::sr_cde_sk: int,proy_cD::cd_dem_sk: int}
Type: Unknown
|
|---LOJoin 1-25 Schema: {proy_sR::sr_cde_sk: int,proy_cD::cd_dem_sk:
int} Type: bag
| |
| Project 1-23 Projections: [0] Overloaded: false FieldSchema:
sr_cde_sk: int Type: int
| Input: ForEach 1-18
| |
| Project 1-24 Projections: [0] Overloaded: false FieldSchema:
cd_dem_sk: int Type: int
| Input: ForEach 1-22
|
|---ForEach 1-18 Schema: {sr_cde_sk: int} Type: bag
| | |
| | Project 1-17 Projections: [0] Overloaded: false
FieldSchema: sr_cde_sk: int Type: int
| | Input: ForEach 1-66
| |
| |---ForEach 1-66 Schema: {sr_cde_sk: int} Type: bag
| | |
| | Cast 1-35 FieldSchema: sr_cde_sk: int Type: int
| | |
| | |---Project 1-34 Projections: [0] Overloaded: false
FieldSchema: sr_cde_sk: bytearray Type: bytearray
| | Input: Load 1-13
| |
| |---Load 1-13 Schema: {sr_cde_sk: bytearray} Type: bag
|
|---ForEach 1-22 Schema: {cd_dem_sk: int} Type: bag
| |
| Project 1-21 Projections: [0] Overloaded: false
FieldSchema: cd_dem_sk: int Type: int
| Input: ForEach 1-85
|
|---ForEach 1-85 Schema: {cd_dem_sk: int} Type: bag
| |
| Cast 1-68 FieldSchema: cd_demo_sk: int Type: int
| |
| |---Project 1-67 Projections: [0] Overloaded: false
FieldSchema: cd_dem_sk: bytearray Type: bytearray
| Input: Load 1-14
|
|---Load 1-14 Schema: {cd_dem_sk: bytearray} Type: bag
#-----------------------------------------------
# Physical Plan:
#-----------------------------------------------
Store(fakefile:org.apache.pig.builtin.PigStorage) - 1-107
|
|---FRJoin[tuple] - 1-101
| |
| Project[int][0] - 1-99
| |
| Project[int][0] - 1-100
|
|---New For Each(false)[bag] - 1-92
| | |
| | Project[int][0] - 1-91
| |
| |---New For Each(false)[bag] - 1-90
| | |
| | Cast[int] - 1-89
| | |
| | |---Project[bytearray][0] - 1-88
| |
|
|---Load(hdfs://berlin.labbio:54310/user/hadoop/pigData/sr.dat:PigStorage('|'))
- 1-87
|
|---New For Each(false)[bag] - 1-98
| |
| Project[int][0] - 1-97
|
|---New For Each(false)[bag] - 1-96
| |
| Cast[int] - 1-95
| |
| |---Project[bytearray][0] - 1-94
|
|---Load(hdfs://berlin.labbio:54310/user/hadoop/pigData/cd.dat:PigStorage('|'))
- 1-93
2011-04-29 23:04:54,727 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- MR plan size before optimization: 2
2011-04-29 23:04:54,727 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- MR plan size after optimization: 2
#--------------------------------------------------
# Map Reduce Plan
#--------------------------------------------------
MapReduce node 1-109
Map Plan
Store(hdfs://berlin.labbio:54310/tmp/temp1815576246/tmp379673501:org.apache.pig.builtin.BinStorage)
- 1-110
|
|---New For Each(false)[bag] - 1-98
| |
| Project[int][0] - 1-97
|
|---New For Each(false)[bag] - 1-96
| |
| Cast[int] - 1-95
| |
| |---Project[bytearray][0] - 1-94
|
|---Load(hdfs://berlin.labbio:54310/user/hadoop/pigData/cd.dat:PigStorage('|'))
- 1-93--------
Global sort: false
----------------
MapReduce node 1-108
Map Plan
Store(fakefile:org.apache.pig.builtin.PigStorage) - 1-107
|
|---FRJoin[tuple] - 1-101
| |
| Project[int][0] - 1-99
| |
| Project[int][0] - 1-100
|
|---New For Each(false)[bag] - 1-92
| |
| Project[int][0] - 1-91
|
|---New For Each(false)[bag] - 1-90
| |
| Cast[int] - 1-89
| |
| |---Project[bytearray][0] - 1-88
|
|---Load(hdfs://berlin.labbio:54310/user/hadoop/pigData/sr.dat:PigStorage('|'))
- 1-87--------
Global sort: false
----------------
2011/4/28 Daniel Dai <[email protected]>:
> There should be only one job. Thanks Thejas point out.
>
> Daniel
>
>
> -----Original Message----- From: Daniel Dai
> Sent: Wednesday, April 27, 2011 7:18 PM
> To: [email protected]
> Cc: Renato Marroquín Mogrovejo ; [email protected]
> Subject: Re: Error Executing a Fragment Replicated Join
>
> Do you see the failure in the first job (sampling) or second job? Do you
> see the exception right after the job kick off?
>
> If the replicated side is too large, you probably will see a "Java heap
> exception" rather than job setup exception. It more like an environment
> issue. Check if you can run regular join, or you have other hadoop
> config file in your classpath.
>
> Daniel
>
>
> On 04/27/2011 05:26 PM, Renato Marroquín Mogrovejo wrote:
>>
>> Now that the Apache server is ok with me again, I can write back to
>> the list. I wrote to the Apache Infra team and they told me to write
>> messages just in plain text, disabling any html within the message
>> (not that I ever sent html but oh well), I guess that worked :)
>> Well, first thanks for answering. I am using pig 0.7 and my pig script
>> is as follows:
>>
>> {code}
>> sr = LOAD 'pigData/sr.dat' using PigStorage('|') AS
>> (sr_ret_date_sk:int, sr_ret_tim_sk:int, sr_ite_sk:int, sr_cus_sk:int,
>> sr_cde_sk:int, sr_hde_sk:int, sr_add_sk:int, sr_sto_sk:int,
>> sr_rea_sk:int, sr_tic_num:int, sr_ret_qua:int, sr_ret_amt:double,
>> sr_ret_tax:double, sr_ret_amt_inc_tax:double, sr_fee:double,
>> sr_ret_sh_cst:double, sr_ref_csh:double, sr_rev_cha:double,
>> sr_sto_cred:double, sr_net_lss:double);
>>
>> cd = LOAD 'pigData/cd.dat' using PigStorage('|') AS (cd_dem_sk:int,
>> cd_gnd:chararray, cd_mrt_sts:chararray, cd_edt_sts:chararray,
>> cd_pur_est:int, cd_cred_rtg:chararray, cd_dep_cnt:int,
>> cd_dep_emp_cnt:int, cd_dep_col_count:int);
>>
>> proy_sR = FOREACH sr GENERATE sr_cde_sk;
>> proy_cD = FOREACH cd GENERATE cd_dem_sk;
>>
>> join_sR_cD = JOIN proy_sR BY sr_cde_sk, proy_cD BY cd_dem_sk USING
>> 'replicated';
>>
>> STORE join_sR_cD INTO 'queryResults/query.11.sr.cd.5.1' using
>> PigStorage('|');
>> {/code}
>>
>> Being "cd" the relation of 77MB and "sr" the relation of 32MB. I had
>> some other similar queries in which the 32MB relation was being joined
>> with smaller relations (<10MB) giving the same problem, I modified
>> those, so the queries<10MB would be ones being replicated.
>> Thanks again.
>>
>> Renato M.
>>
>> 2011/4/27 Thejas M Nair<[email protected]>:
>>>
>>> The exception indicates that the hadoop job creation failed. Are you able
>>> to
>>> run simple MR queries using each of the inputs ?
>>> It could also caused by some problem pig is having with copying the file
>>> being replicated to distributed cache.
>>> -Thejas
>>>
>>>
>>> On 4/27/11 3:42 PM, "Renato Marroquín Mogrovejo"
>>> <[email protected]> wrote:
>>>
>>> Does anybody have any suggestions? Please???
>>> Thanks again.
>>>
>>> Renato M.
>>>
>>> 2011/4/26 Alan Gates<[email protected]>
>>>>
>>>> Sent for Renato, since Apache's mail system has decided it doesn't like
>>>> him.
>>>>
>>>> Alan.
>>>>
>>>> I am getting an error while trying to execute a simple fragment
>>>> replicated
>>>> join on two files (one of 77MB and the other one of 32MB). I am using
>>>> the
>>>> 32MB file as the small one to be replicated, but I keep getting this
>>>> error.
>>>> Does any body know how this count is done? I mean how Pig determines
>>>> that
>>>> the small file is not small enough, or how I could modify this?
>>>> I am executing these on four PC's with 3GB of RAM running DebianLenny.
>>>> Thanks in advance.
>>>>
>>>>
>>>> Renato M.
>>>>
>>>> Pig Stack Trace
>>>> ---------------
>>>> ERROR 2017: Internal error creating job configuration.
>>>>
>>>> org.apache.pig.backend.executionengine.ExecException: ERROR 2043:
>>>> Unexpected
>>>> error during execution.
>>>> at
>>>>
>>>>
>>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:332)
>>>> at
>>>> org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:835)
>>>> at org.apache.pig.PigServer.execute(PigServer.java:828)
>>>> at org.apache.pig.PigServer.access$100(PigServer.java:105)
>>>> at org.apache.pig.PigServer$Graph.execute(PigServer.java:1080)
>>>> at org.apache.pig.PigServer.executeBatch(PigServer.java:288)
>>>> at
>>>>
>>>> org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:109)
>>>> at
>>>>
>>>>
>>>> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166)
>>>> at
>>>>
>>>>
>>>> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138)
>>>> at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
>>>> at org.apache.pig.Main.main(Main.java:391)
>>>> Caused by:
>>>>
>>>>
>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobCreationException:
>>>> ERROR 2017: Internal error creating job configuration.
>>>> at
>>>>
>>>>
>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getJob(JobControlCompiler.java:624)
>>>> at
>>>>
>>>>
>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.compile(JobControlCompiler.java:246)
>>>> at
>>>>
>>>>
>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>>
>>>
>
>