- dev@pig + user@pig What command are you using to run this? Are you upping the max heap?
2012/3/28 Herbert Mühlburger <[email protected]> > Hi, > > I would like to use pig to work with wikipedia dump files. It works > successfully with an input file of around 8GB of size but not too big xml > element content. > > In my current case I would like to use the file "enwiki-latest-pages-meta- > **history1.xml-**p000000010p000002162.bz2" (around 2GB of compressed > size) which can be found here: > > http://dumps.wikimedia.org/**enwiki/latest/enwiki-latest-** > pages-meta-history1.xml-**p000000010p000002162.bz2<http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-meta-history1.xml-p000000010p000002162.bz2> > > Is it possible that due to the fact that the content of the <page></page> > xml element could potentially become very large (several GB for instance) > XMLLoader of Piggybank has problems loading elements splitted by <page>? > > Hopefully anybody could help me with this. > > I've tried to call the following PIG Latin script: > > ========= > register piggybank.jar; > > pages = load '/user/herbert/enwiki-latest-**pages-meta-history1.xml- > p000000010p000002162.bz2' using > org.apache.pig.piggybank.**storage.XMLLoader('page') > as (page:chararray); > pages = limit pages 1; > dump pages; > ========= > > and always get the following error (the generated logfile is attached): > > ========= > > 2012-03-28 14:49:54,695 [main] INFO org.apache.pig.Main - Apache Pig > version 0.11.0-SNAPSHOT (rexported) compiled Mrz 28 2012, 08:21:45 > 2012-03-28 14:49:54,696 [main] INFO org.apache.pig.Main - Logging error > messages to: /Users/herbert/Documents/**workspace/pig-wikipedia/pig_** > 1332938994693.log > 2012-03-28 14:49:54,936 [main] INFO org.apache.pig.impl.util.Utils - > Default bootup file /Users/herbert/.pigbootup not found > 2012-03-28 14:49:55,189 [main] INFO org.apache.pig.backend.hadoop.** > executionengine.**HExecutionEngine - Connecting to hadoop file system at: > hdfs://localhost:9000 > 2012-03-28 14:49:55,403 [main] INFO org.apache.pig.backend.hadoop.** > executionengine.**HExecutionEngine - Connecting to map-reduce job tracker > at: localhost:9001 > 2012-03-28 14:49:55,845 [main] INFO > org.apache.pig.tools.pigstats.**ScriptState > - Pig features used in the script: LIMIT > 2012-03-28 14:49:56,021 [main] INFO org.apache.pig.backend.hadoop.** > executionengine.**mapReduceLayer.MRCompiler - File concatenation > threshold: 100 optimistic? false > 2012-03-28 14:49:56,067 [main] INFO org.apache.pig.backend.hadoop.** > executionengine.**mapReduceLayer.**MultiQueryOptimizer - MR plan size > before optimization: 1 > 2012-03-28 14:49:56,067 [main] INFO org.apache.pig.backend.hadoop.** > executionengine.**mapReduceLayer.**MultiQueryOptimizer - MR plan size > after optimization: 1 > 2012-03-28 14:49:56,171 [main] INFO > org.apache.pig.tools.pigstats.**ScriptState > - Pig script settings are added to the job > 2012-03-28 14:49:56,187 [main] INFO org.apache.pig.backend.hadoop.** > executionengine.**mapReduceLayer.**JobControlCompiler - > mapred.job.reduce.markreset.**buffer.percent is not set, set to default > 0.3 > 2012-03-28 14:49:56,274 [main] INFO org.apache.pig.backend.hadoop.** > executionengine.**mapReduceLayer.**JobControlCompiler - creating jar file > Job5733074907123320640.jar > 2012-03-28 14:49:59,720 [main] INFO org.apache.pig.backend.hadoop.** > executionengine.**mapReduceLayer.**JobControlCompiler - jar file > Job5733074907123320640.jar created > 2012-03-28 14:49:59,736 [main] INFO org.apache.pig.backend.hadoop.** > executionengine.**mapReduceLayer.**JobControlCompiler - Setting up single > store job > 2012-03-28 14:49:59,795 [main] INFO org.apache.pig.backend.hadoop.** > executionengine.**mapReduceLayer.**MapReduceLauncher - 1 map-reduce > job(s) waiting for submission. > ****hdfs://localhost:9000/**user/herbert/enwiki-latest-** > pages-meta-history1.xml-**p000000010p000002162.bz2 > 2012-03-28 14:50:00,152 [Thread-11] INFO > org.apache.hadoop.mapreduce.**lib.input.FileInputFormat > - Total input paths to process : 1 > 2012-03-28 14:50:00,169 [Thread-11] INFO org.apache.pig.backend.hadoop.** > executionengine.util.**MapRedUtil - Total input paths (combined) to > process : 35 > 2012-03-28 14:50:00,299 [main] INFO org.apache.pig.backend.hadoop.** > executionengine.**mapReduceLayer.**MapReduceLauncher - 0% complete > 2012-03-28 14:50:01,277 [main] INFO org.apache.pig.backend.hadoop.** > executionengine.**mapReduceLayer.**MapReduceLauncher - HadoopJobId: > job_201203281105_0009 > 2012-03-28 14:50:01,278 [main] INFO org.apache.pig.backend.hadoop.** > executionengine.**mapReduceLayer.**MapReduceLauncher - More information > at: > http://localhost:50030/**jobdetails.jsp?jobid=job_**201203281105_0009<http://localhost:50030/jobdetails.jsp?jobid=job_201203281105_0009> > 2012-03-28 14:50:23,145 [main] INFO org.apache.pig.backend.hadoop.** > executionengine.**mapReduceLayer.**MapReduceLauncher - 1% complete > 2012-03-28 14:50:29,206 [main] INFO org.apache.pig.backend.hadoop.** > executionengine.**mapReduceLayer.**MapReduceLauncher - 2% complete > 2012-03-28 14:50:38,288 [main] INFO org.apache.pig.backend.hadoop.** > executionengine.**mapReduceLayer.**MapReduceLauncher - 4% complete > 2012-03-28 14:53:17,686 [main] INFO org.apache.pig.backend.hadoop.** > executionengine.**mapReduceLayer.**MapReduceLauncher - 7% complete > 2012-03-28 14:53:41,529 [main] INFO org.apache.pig.backend.hadoop.** > executionengine.**mapReduceLayer.**MapReduceLauncher - 9% complete > 2012-03-28 14:55:05,775 [main] INFO org.apache.pig.backend.hadoop.** > executionengine.**mapReduceLayer.**MapReduceLauncher - 10% complete > 2012-03-28 14:55:32,685 [main] INFO org.apache.pig.backend.hadoop.** > executionengine.**mapReduceLayer.**MapReduceLauncher - 12% complete > 2012-03-28 14:56:21,754 [main] INFO org.apache.pig.backend.hadoop.** > executionengine.**mapReduceLayer.**MapReduceLauncher - 13% complete > 2012-03-28 14:58:36,797 [main] INFO org.apache.pig.backend.hadoop.** > executionengine.**mapReduceLayer.**MapReduceLauncher - job > job_201203281105_0009 has failed! Stop running all dependent jobs > 2012-03-28 14:58:36,799 [main] INFO org.apache.pig.backend.hadoop.** > executionengine.**mapReduceLayer.**MapReduceLauncher - 100% complete > 2012-03-28 14:58:36,850 [main] ERROR > org.apache.pig.tools.pigstats.**SimplePigStats > - ERROR 2997: Unable to recreate exception from backed error: Error: Java > heap space > 2012-03-28 14:58:36,850 [main] ERROR > org.apache.pig.tools.pigstats.**PigStatsUtil > - 1 map reduce job(s) failed! > 2012-03-28 14:58:36,854 [main] INFO > org.apache.pig.tools.pigstats.**SimplePigStats > - Script Statistics: > > HadoopVersion PigVersion UserId StartedAt FinishedAt > Features > 1.0.1 0.11.0-SNAPSHOT herbert 2012-03-28 14:49:56 2012-03-28 > 14:58:36 LIMIT > > Failed! > > Failed Jobs: > JobId Alias Feature Message Outputs > job_201203281105_0009 pages Message: Job failed! Error - # of > failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: > task_201203281105_0009_m_**000003 hdfs://localhost:9000/tmp/** > temp1813558187/tmp250990633, > > Input(s): > Failed to read data from "/user/herbert/enwiki-latest-** > pages-meta-history1.xml-**p000000010p000002162.bz2" > > Output(s): > Failed to produce result in "hdfs://localhost:9000/tmp/** > temp1813558187/tmp250990633" > > Counters: > Total records written : 0 > Total bytes written : 0 > Spillable Memory Manager spill count : 0 > Total bags proactively spilled: 0 > Total records proactively spilled: 0 > > Job DAG: > job_201203281105_0009 > > > 2012-03-28 14:58:36,855 [main] INFO org.apache.pig.backend.hadoop.** > executionengine.**mapReduceLayer.**MapReduceLauncher - Failed! > 2012-03-28 14:58:36,891 [main] ERROR org.apache.pig.tools.grunt.**Grunt - > ERROR 2997: Unable to recreate exception from backed error: Error: Java > heap space > Details at logfile: /Users/herbert/Documents/** > workspace/pig-wikipedia/pig_**1332938994693.log > pig wiki.pig 8,48s user 2,72s system 2% cpu 8:46,07 total > > ========= > > Thank you very much and kind reagards, > Herbert >
