PigStorage and ElephantBird's JsonLoader - InputFormat

Jonathan Holloway Wed, 15 Jun 2011 18:58:51 -0700

Hi all,

I was wondering whether somebody could explain how Pig deals with nested
directories of log files,
Something like:


/logs/2011-01-01/a.log
/logs/2011-01-01/b.log
/logs/2011-01-01/c.log

I'm pretty sure if I give a Pig script the /logs directory as input it will
successfully process all logs (a.log, b.log, c.log)
within that structure.

However, I'm seeing a discrepancy with JsonLoader in elephant bird, because
if I do the same thing then it errors with the following:

Backend error message
---------------------
java.io.IOException: Cannot open filename /logs/2011-01-01
        at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1497)
        at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1488)
        at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:376)
        at
org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:178)
        at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:356)
        at
org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:67)
        at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.initialize(PigRecordReader.java:176)
        at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:418)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:620)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
        at org.apache.hadoop.mapred.Child.main(Child.java:170)

Pig Stack Trace
---------------
ERROR 2998: Unhandled internal error. Cannot open filename /logs/2011-01-01

java.io.IOException: Cannot open filename /logs/2011-01-01
        at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1497)
        at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1488)
        at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:376)
        at
org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:178)
        at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:356)
        at
org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:67)
        at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.initialize(PigRecordReader.java:176)
        at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:418)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:620)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
================================================================================
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.PigMain],
exit code [2]

I think it returns a TextInputFormat currently, where PigStorage can handle
this because it returns a PigTextInputFormat
which uses the MapRedUtil.getAllFileRecursively() workaround for
MAPREDUCE-1577.

Can anybody confirm this is actually the case, and whether there's some sort
of workaround for it?

I'm using Pig 0.8.0, Apache Hadoop 0.20.2 and Oozie 3.0.0

Many thanks in advance,
Jon.

PigStorage and ElephantBird's JsonLoader - InputFormat

Reply via email to