- dev@pig
+ user@pig

What command are you using to run this? Are you upping the max heap?

2012/3/28 Herbert Mühlburger <[email protected]>

> Hi,
>
> I would like to use pig to work with wikipedia dump files. It works
> successfully with an input file of around 8GB of size but not too big xml
> element content.
>
> In my current case I would like to use the file "enwiki-latest-pages-meta-
> **history1.xml-**p000000010p000002162.bz2" (around 2GB of compressed
> size) which can be found here:
>
> http://dumps.wikimedia.org/**enwiki/latest/enwiki-latest-**
> pages-meta-history1.xml-**p000000010p000002162.bz2<http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-meta-history1.xml-p000000010p000002162.bz2>
>
> Is it possible that due to the fact that the content of the <page></page>
> xml element could potentially become very large (several GB for instance)
> XMLLoader of Piggybank has problems loading elements splitted by <page>?
>
> Hopefully anybody could help me with this.
>
> I've tried to call the following PIG Latin script:
>
> =========
> register piggybank.jar;
>
> pages = load '/user/herbert/enwiki-latest-**pages-meta-history1.xml-
> p000000010p000002162.bz2' using 
> org.apache.pig.piggybank.**storage.XMLLoader('page')
> as (page:chararray);
> pages = limit pages 1;
> dump pages;
> =========
>
> and always get the following error (the generated logfile is attached):
>
> =========
>
> 2012-03-28 14:49:54,695 [main] INFO  org.apache.pig.Main - Apache Pig
> version 0.11.0-SNAPSHOT (rexported) compiled Mrz 28 2012, 08:21:45
> 2012-03-28 14:49:54,696 [main] INFO  org.apache.pig.Main - Logging error
> messages to: /Users/herbert/Documents/**workspace/pig-wikipedia/pig_**
> 1332938994693.log
> 2012-03-28 14:49:54,936 [main] INFO  org.apache.pig.impl.util.Utils -
> Default bootup file /Users/herbert/.pigbootup not found
> 2012-03-28 14:49:55,189 [main] INFO org.apache.pig.backend.hadoop.**
> executionengine.**HExecutionEngine - Connecting to hadoop file system at:
> hdfs://localhost:9000
> 2012-03-28 14:49:55,403 [main] INFO org.apache.pig.backend.hadoop.**
> executionengine.**HExecutionEngine - Connecting to map-reduce job tracker
> at: localhost:9001
> 2012-03-28 14:49:55,845 [main] INFO 
> org.apache.pig.tools.pigstats.**ScriptState
> - Pig features used in the script: LIMIT
> 2012-03-28 14:49:56,021 [main] INFO org.apache.pig.backend.hadoop.**
> executionengine.**mapReduceLayer.MRCompiler - File concatenation
> threshold: 100 optimistic? false
> 2012-03-28 14:49:56,067 [main] INFO org.apache.pig.backend.hadoop.**
> executionengine.**mapReduceLayer.**MultiQueryOptimizer - MR plan size
> before optimization: 1
> 2012-03-28 14:49:56,067 [main] INFO org.apache.pig.backend.hadoop.**
> executionengine.**mapReduceLayer.**MultiQueryOptimizer - MR plan size
> after optimization: 1
> 2012-03-28 14:49:56,171 [main] INFO 
> org.apache.pig.tools.pigstats.**ScriptState
> - Pig script settings are added to the job
> 2012-03-28 14:49:56,187 [main] INFO org.apache.pig.backend.hadoop.**
> executionengine.**mapReduceLayer.**JobControlCompiler -
> mapred.job.reduce.markreset.**buffer.percent is not set, set to default
> 0.3
> 2012-03-28 14:49:56,274 [main] INFO org.apache.pig.backend.hadoop.**
> executionengine.**mapReduceLayer.**JobControlCompiler - creating jar file
> Job5733074907123320640.jar
> 2012-03-28 14:49:59,720 [main] INFO org.apache.pig.backend.hadoop.**
> executionengine.**mapReduceLayer.**JobControlCompiler - jar file
> Job5733074907123320640.jar created
> 2012-03-28 14:49:59,736 [main] INFO org.apache.pig.backend.hadoop.**
> executionengine.**mapReduceLayer.**JobControlCompiler - Setting up single
> store job
> 2012-03-28 14:49:59,795 [main] INFO org.apache.pig.backend.hadoop.**
> executionengine.**mapReduceLayer.**MapReduceLauncher - 1 map-reduce
> job(s) waiting for submission.
> ****hdfs://localhost:9000/**user/herbert/enwiki-latest-**
> pages-meta-history1.xml-**p000000010p000002162.bz2
> 2012-03-28 14:50:00,152 [Thread-11] INFO 
> org.apache.hadoop.mapreduce.**lib.input.FileInputFormat
> - Total input paths to process : 1
> 2012-03-28 14:50:00,169 [Thread-11] INFO org.apache.pig.backend.hadoop.**
> executionengine.util.**MapRedUtil - Total input paths (combined) to
> process : 35
> 2012-03-28 14:50:00,299 [main] INFO org.apache.pig.backend.hadoop.**
> executionengine.**mapReduceLayer.**MapReduceLauncher - 0% complete
> 2012-03-28 14:50:01,277 [main] INFO org.apache.pig.backend.hadoop.**
> executionengine.**mapReduceLayer.**MapReduceLauncher - HadoopJobId:
> job_201203281105_0009
> 2012-03-28 14:50:01,278 [main] INFO org.apache.pig.backend.hadoop.**
> executionengine.**mapReduceLayer.**MapReduceLauncher - More information
> at: 
> http://localhost:50030/**jobdetails.jsp?jobid=job_**201203281105_0009<http://localhost:50030/jobdetails.jsp?jobid=job_201203281105_0009>
> 2012-03-28 14:50:23,145 [main] INFO org.apache.pig.backend.hadoop.**
> executionengine.**mapReduceLayer.**MapReduceLauncher - 1% complete
> 2012-03-28 14:50:29,206 [main] INFO org.apache.pig.backend.hadoop.**
> executionengine.**mapReduceLayer.**MapReduceLauncher - 2% complete
> 2012-03-28 14:50:38,288 [main] INFO org.apache.pig.backend.hadoop.**
> executionengine.**mapReduceLayer.**MapReduceLauncher - 4% complete
> 2012-03-28 14:53:17,686 [main] INFO org.apache.pig.backend.hadoop.**
> executionengine.**mapReduceLayer.**MapReduceLauncher - 7% complete
> 2012-03-28 14:53:41,529 [main] INFO org.apache.pig.backend.hadoop.**
> executionengine.**mapReduceLayer.**MapReduceLauncher - 9% complete
> 2012-03-28 14:55:05,775 [main] INFO org.apache.pig.backend.hadoop.**
> executionengine.**mapReduceLayer.**MapReduceLauncher - 10% complete
> 2012-03-28 14:55:32,685 [main] INFO org.apache.pig.backend.hadoop.**
> executionengine.**mapReduceLayer.**MapReduceLauncher - 12% complete
> 2012-03-28 14:56:21,754 [main] INFO org.apache.pig.backend.hadoop.**
> executionengine.**mapReduceLayer.**MapReduceLauncher - 13% complete
> 2012-03-28 14:58:36,797 [main] INFO org.apache.pig.backend.hadoop.**
> executionengine.**mapReduceLayer.**MapReduceLauncher - job
> job_201203281105_0009 has failed! Stop running all dependent jobs
> 2012-03-28 14:58:36,799 [main] INFO org.apache.pig.backend.hadoop.**
> executionengine.**mapReduceLayer.**MapReduceLauncher - 100% complete
> 2012-03-28 14:58:36,850 [main] ERROR 
> org.apache.pig.tools.pigstats.**SimplePigStats
> - ERROR 2997: Unable to recreate exception from backed error: Error: Java
> heap space
> 2012-03-28 14:58:36,850 [main] ERROR 
> org.apache.pig.tools.pigstats.**PigStatsUtil
> - 1 map reduce job(s) failed!
> 2012-03-28 14:58:36,854 [main] INFO 
> org.apache.pig.tools.pigstats.**SimplePigStats
> - Script Statistics:
>
> HadoopVersion   PigVersion      UserId  StartedAt       FinishedAt
>  Features
> 1.0.1   0.11.0-SNAPSHOT herbert 2012-03-28 14:49:56     2012-03-28
> 14:58:36     LIMIT
>
> Failed!
>
> Failed Jobs:
> JobId   Alias   Feature Message Outputs
> job_201203281105_0009   pages           Message: Job failed! Error - # of
> failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask:
> task_201203281105_0009_m_**000003 hdfs://localhost:9000/tmp/**
> temp1813558187/tmp250990633,
>
> Input(s):
> Failed to read data from "/user/herbert/enwiki-latest-**
> pages-meta-history1.xml-**p000000010p000002162.bz2"
>
> Output(s):
> Failed to produce result in "hdfs://localhost:9000/tmp/**
> temp1813558187/tmp250990633"
>
> Counters:
> Total records written : 0
> Total bytes written : 0
> Spillable Memory Manager spill count : 0
> Total bags proactively spilled: 0
> Total records proactively spilled: 0
>
> Job DAG:
> job_201203281105_0009
>
>
> 2012-03-28 14:58:36,855 [main] INFO org.apache.pig.backend.hadoop.**
> executionengine.**mapReduceLayer.**MapReduceLauncher - Failed!
> 2012-03-28 14:58:36,891 [main] ERROR org.apache.pig.tools.grunt.**Grunt -
> ERROR 2997: Unable to recreate exception from backed error: Error: Java
> heap space
> Details at logfile: /Users/herbert/Documents/**
> workspace/pig-wikipedia/pig_**1332938994693.log
> pig wiki.pig  8,48s user 2,72s system 2% cpu 8:46,07 total
>
> =========
>
> Thank you very much and kind reagards,
> Herbert
>

Reply via email to