Re: XMLLoader does not work with BIG wikipedia dump

Prashant Kommireddi Wed, 28 Mar 2012 12:16:40 -0700

Did you set heap size to 0?

Sent from my iPhone


On Mar 28, 2012, at 12:12 PM, "Herbert Mühlburger"
<[email protected]> wrote:

> Hi,
>
> Am 28.03.12 18:28, schrieb Jonathan Coveney:
>> - dev@pig
>> + user@pig
>
> You are right, fits better to user@pig.
>
>> What command are you using to run this? Are you upping the max heap?
>
> I created a pig script wiki.pig with the following content:
>
> ===register piggybank.jar;
>
> pages = load
> '/user/herbert/enwiki-latest-pages-meta-history1.xml-p000000010p000002162.bz2'
> using org.apache.pig.piggybank.storage.XMLLoader('page') as
> (page:chararray);
> pages = limit pages 1;
> dump pages;
> ===
> and used the command:
>
>  % pig wiki.pig
>
> to run the pig script.
>
> I use current Hadoop 1.0.1. My version of PIG is checked out from trunk
> and build by myself.
>
> Everything that I customized was setting HADOOP_HEAPSIZE 00 in
> hadoop-env.sh (default heap size was was 1000MB).
>
> Kind regards,
> Herbert
>
>> 2012/3/28 Herbert Mühlburger<[email protected]>
>>
>>> Hi,
>>>
>>> I would like to use pig to work with wikipedia dump files. It works
>>> successfully with an input file of around 8GB of size but not too big xml
>>> element content.
>>>
>>> In my current case I would like to use the file "enwiki-latest-pages-meta-
>>> **history1.xml-**p000000010p000002162.bz2" (around 2GB of compressed
>>> size) which can be found here:
>>>
>>> http://dumps.wikimedia.org/**enwiki/latest/enwiki-latest-**
>>> pages-meta-history1.xml-**p000000010p000002162.bz2<http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-meta-history1.xml-p000000010p000002162.bz2>
>>>
>>> Is it possible that due to the fact that the content of the<page></page>
>>> xml element could potentially become very large (several GB for instance)
>>> XMLLoader of Piggybank has problems loading elements splitted by<page>?
>>>
>>> Hopefully anybody could help me with this.
>>>
>>> I've tried to call the following PIG Latin script:
>>>
>>> ========>> register piggybank.jar;
>>>
>>> pages = load '/user/herbert/enwiki-latest-**pages-meta-history1.xml-
>>> p000000010p000002162.bz2' using 
>>> org.apache.pig.piggybank.**storage.XMLLoader('page')
>>> as (page:chararray);
>>> pages = limit pages 1;
>>> dump pages;
>>> ========>>
>>> and always get the following error (the generated logfile is attached):
>>>
>>> ========>>
>>> 2012-03-28 14:49:54,695 [main] INFO  org.apache.pig.Main - Apache Pig
>>> version 0.11.0-SNAPSHOT (rexported) compiled Mrz 28 2012, 08:21:45
>>> 2012-03-28 14:49:54,696 [main] INFO  org.apache.pig.Main - Logging error
>>> messages to: /Users/herbert/Documents/**workspace/pig-wikipedia/pig_**
>>> 1332938994693.log
>>> 2012-03-28 14:49:54,936 [main] INFO  org.apache.pig.impl.util.Utils -
>>> Default bootup file /Users/herbert/.pigbootup not found
>>> 2012-03-28 14:49:55,189 [main] INFO org.apache.pig.backend.hadoop.**
>>> executionengine.**HExecutionEngine - Connecting to hadoop file system at:
>>> hdfs://localhost:9000
>>> 2012-03-28 14:49:55,403 [main] INFO org.apache.pig.backend.hadoop.**
>>> executionengine.**HExecutionEngine - Connecting to map-reduce job tracker
>>> at: localhost:9001
>>> 2012-03-28 14:49:55,845 [main] INFO 
>>> org.apache.pig.tools.pigstats.**ScriptState
>>> - Pig features used in the script: LIMIT
>>> 2012-03-28 14:49:56,021 [main] INFO org.apache.pig.backend.hadoop.**
>>> executionengine.**mapReduceLayer.MRCompiler - File concatenation
>>> threshold: 100 optimistic? false
>>> 2012-03-28 14:49:56,067 [main] INFO org.apache.pig.backend.hadoop.**
>>> executionengine.**mapReduceLayer.**MultiQueryOptimizer - MR plan size
>>> before optimization: 1
>>> 2012-03-28 14:49:56,067 [main] INFO org.apache.pig.backend.hadoop.**
>>> executionengine.**mapReduceLayer.**MultiQueryOptimizer - MR plan size
>>> after optimization: 1
>>> 2012-03-28 14:49:56,171 [main] INFO 
>>> org.apache.pig.tools.pigstats.**ScriptState
>>> - Pig script settings are added to the job
>>> 2012-03-28 14:49:56,187 [main] INFO org.apache.pig.backend.hadoop.**
>>> executionengine.**mapReduceLayer.**JobControlCompiler -
>>> mapred.job.reduce.markreset.**buffer.percent is not set, set to default
>>> 0.3
>>> 2012-03-28 14:49:56,274 [main] INFO org.apache.pig.backend.hadoop.**
>>> executionengine.**mapReduceLayer.**JobControlCompiler - creating jar file
>>> Job5733074907123320640.jar
>>> 2012-03-28 14:49:59,720 [main] INFO org.apache.pig.backend.hadoop.**
>>> executionengine.**mapReduceLayer.**JobControlCompiler - jar file
>>> Job5733074907123320640.jar created
>>> 2012-03-28 14:49:59,736 [main] INFO org.apache.pig.backend.hadoop.**
>>> executionengine.**mapReduceLayer.**JobControlCompiler - Setting up single
>>> store job
>>> 2012-03-28 14:49:59,795 [main] INFO org.apache.pig.backend.hadoop.**
>>> executionengine.**mapReduceLayer.**MapReduceLauncher - 1 map-reduce
>>> job(s) waiting for submission.
>>> ****hdfs://localhost:9000/**user/herbert/enwiki-latest-**
>>> pages-meta-history1.xml-**p000000010p000002162.bz2
>>> 2012-03-28 14:50:00,152 [Thread-11] INFO 
>>> org.apache.hadoop.mapreduce.**lib.input.FileInputFormat
>>> - Total input paths to process : 1
>>> 2012-03-28 14:50:00,169 [Thread-11] INFO org.apache.pig.backend.hadoop.**
>>> executionengine.util.**MapRedUtil - Total input paths (combined) to
>>> process : 35
>>> 2012-03-28 14:50:00,299 [main] INFO org.apache.pig.backend.hadoop.**
>>> executionengine.**mapReduceLayer.**MapReduceLauncher - 0% complete
>>> 2012-03-28 14:50:01,277 [main] INFO org.apache.pig.backend.hadoop.**
>>> executionengine.**mapReduceLayer.**MapReduceLauncher - HadoopJobId:
>>> job_201203281105_0009
>>> 2012-03-28 14:50:01,278 [main] INFO org.apache.pig.backend.hadoop.**
>>> executionengine.**mapReduceLayer.**MapReduceLauncher - More information
>>> at: 
>>> http://localhost:50030/**jobdetails.jsp?jobid=job_**201203281105_0009<http://localhost:50030/jobdetails.jsp?jobid=job_201203281105_0009>
>>> 2012-03-28 14:50:23,145 [main] INFO org.apache.pig.backend.hadoop.**
>>> executionengine.**mapReduceLayer.**MapReduceLauncher - 1% complete
>>> 2012-03-28 14:50:29,206 [main] INFO org.apache.pig.backend.hadoop.**
>>> executionengine.**mapReduceLayer.**MapReduceLauncher - 2% complete
>>> 2012-03-28 14:50:38,288 [main] INFO org.apache.pig.backend.hadoop.**
>>> executionengine.**mapReduceLayer.**MapReduceLauncher - 4% complete
>>> 2012-03-28 14:53:17,686 [main] INFO org.apache.pig.backend.hadoop.**
>>> executionengine.**mapReduceLayer.**MapReduceLauncher - 7% complete
>>> 2012-03-28 14:53:41,529 [main] INFO org.apache.pig.backend.hadoop.**
>>> executionengine.**mapReduceLayer.**MapReduceLauncher - 9% complete
>>> 2012-03-28 14:55:05,775 [main] INFO org.apache.pig.backend.hadoop.**
>>> executionengine.**mapReduceLayer.**MapReduceLauncher - 10% complete
>>> 2012-03-28 14:55:32,685 [main] INFO org.apache.pig.backend.hadoop.**
>>> executionengine.**mapReduceLayer.**MapReduceLauncher - 12% complete
>>> 2012-03-28 14:56:21,754 [main] INFO org.apache.pig.backend.hadoop.**
>>> executionengine.**mapReduceLayer.**MapReduceLauncher - 13% complete
>>> 2012-03-28 14:58:36,797 [main] INFO org.apache.pig.backend.hadoop.**
>>> executionengine.**mapReduceLayer.**MapReduceLauncher - job
>>> job_201203281105_0009 has failed! Stop running all dependent jobs
>>> 2012-03-28 14:58:36,799 [main] INFO org.apache.pig.backend.hadoop.**
>>> executionengine.**mapReduceLayer.**MapReduceLauncher - 100% complete
>>> 2012-03-28 14:58:36,850 [main] ERROR 
>>> org.apache.pig.tools.pigstats.**SimplePigStats
>>> - ERROR 2997: Unable to recreate exception from backed error: Error: Java
>>> heap space
>>> 2012-03-28 14:58:36,850 [main] ERROR 
>>> org.apache.pig.tools.pigstats.**PigStatsUtil
>>> - 1 map reduce job(s) failed!
>>> 2012-03-28 14:58:36,854 [main] INFO 
>>> org.apache.pig.tools.pigstats.**SimplePigStats
>>> - Script Statistics:
>>>
>>> HadoopVersion   PigVersion      UserId  StartedAt       FinishedAt
>>>  Features
>>> 1.0.1   0.11.0-SNAPSHOT herbert 2012-03-28 14:49:56     2012-03-28
>>> 14:58:36     LIMIT
>>>
>>> Failed!
>>>
>>> Failed Jobs:
>>> JobId   Alias   Feature Message Outputs
>>> job_201203281105_0009   pages           Message: Job failed! Error - # of
>>> failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask:
>>> task_201203281105_0009_m_**000003 hdfs://localhost:9000/tmp/**
>>> temp1813558187/tmp250990633,
>>>
>>> Input(s):
>>> Failed to read data from "/user/herbert/enwiki-latest-**
>>> pages-meta-history1.xml-**p000000010p000002162.bz2"
>>>
>>> Output(s):
>>> Failed to produce result in "hdfs://localhost:9000/tmp/**
>>> temp1813558187/tmp250990633"
>>>
>>> Counters:
>>> Total records written : 0
>>> Total bytes written : 0
>>> Spillable Memory Manager spill count : 0
>>> Total bags proactively spilled: 0
>>> Total records proactively spilled: 0
>>>
>>> Job DAG:
>>> job_201203281105_0009
>>>
>>>
>>> 2012-03-28 14:58:36,855 [main] INFO org.apache.pig.backend.hadoop.**
>>> executionengine.**mapReduceLayer.**MapReduceLauncher - Failed!
>>> 2012-03-28 14:58:36,891 [main] ERROR org.apache.pig.tools.grunt.**Grunt -
>>> ERROR 2997: Unable to recreate exception from backed error: Error: Java
>>> heap space
>>> Details at logfile: /Users/herbert/Documents/**
>>> workspace/pig-wikipedia/pig_**1332938994693.log
>>> pig wiki.pig  8,48s user 2,72s system 2% cpu 8:46,07 total
>>>
>>> ========>>
>>> Thank you very much and kind reagards,
>>> Herbert
>>>
>>
>
> --
> ================================================================Herbert 
> Muehlburger  Software Development and Business Management
>                                    Graz University of Technology
> www.muehlburger.at                   www.twitter.com/hmuehlburger
> ================================================================

Re: XMLLoader does not work with BIG wikipedia dump

Reply via email to