Difference in Semantics between Load statement in Pig and HDFS client on 
Command line
-------------------------------------------------------------------------------------

                 Key: PIG-1576
                 URL: https://issues.apache.org/jira/browse/PIG-1576
             Project: Pig
          Issue Type: Bug
          Components: impl
    Affects Versions: 0.7.0, 0.6.0
            Reporter: Viraj Bhat


Here is my directory structure on HDFS which I want to access using Pig. 
This is a sample, but in real use case I have more than 100 of these 
directories.
{code}
$ hadoop fs -ls /user/viraj/recursive/
Found 3 items
drwxr-xr-x   - viraj supergroup          0 2010-08-26 11:25 
/user/viraj/recursive/20080615
drwxr-xr-x   - viraj supergroup          0 2010-08-26 11:25 
/user/viraj/recursive/20080616
drwxr-xr-x   - viraj supergroup          0 2010-08-26 11:25 
/user/viraj/recursive/20080617
{code}
Using the command line I am access them using variety of options:
{code}
$ hadoop fs -ls /user/viraj/recursive/{200806}{15..17}/
-rw-r--r--   1 viraj supergroup       5791 2010-08-26 11:25 
/user/viraj/recursive/20080615/kv2.txt
-rw-r--r--   1 viraj supergroup       5791 2010-08-26 11:25 
/user/viraj/recursive/20080616/kv2.txt
-rw-r--r--   1 viraj supergroup       5791 2010-08-26 11:25 
/user/viraj/recursive/20080617/kv2.txt

$ hadoop fs -ls /user/viraj/recursive/{20080615..20080617}/

-rw-r--r--   1 viraj supergroup       5791 2010-08-26 11:25 
/user/viraj/recursive/20080615/kv2.txt

-rw-r--r--   1 viraj supergroup       5791 2010-08-26 11:25 
/user/viraj/recursive/20080616/kv2.txt

-rw-r--r--   1 viraj supergroup       5791 2010-08-26 11:25 
/user/viraj/recursive/20080617/kv2.txt
{code}

I have written a Pig script, all the below combination of load statements do 
not work?
{code}
--A = load '/user/viraj/recursive/{200806}{15..17}/' using PigStorage('\u0001') 
as (k:int, v:chararray);
A = load '/user/viraj/recursive/{20080615..20080617}/' using 
PigStorage('\u0001') as (k:int, v:chararray);
AL = limit A 10;
dump AL;
{code}

I get the following error in Pig 0.8
{noformat}
2010-08-27 16:34:27,704 [main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil 
- 1 map reduce job(s) failed!
2010-08-27 16:34:27,711 [main] INFO  org.apache.pig.tools.pigstats.PigStats - 
Script Statistics: 
HadoopVersion   PigVersion      UserId  StartedAt       FinishedAt      Features
0.20.2  0.8.0-SNAPSHOT  viraj   2010-08-27 16:34:24     2010-08-27 16:34:27     
LIMIT
Failed!
Failed Jobs:
JobId   Alias   Feature Message Outputs
N/A     A,AL            Message: 
org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to 
create input splits for: /user/viraj/recursive/{20080615..20080617}/
        at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:279)
        at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:885)
        at 
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
        at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378)
        at 
org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247)
        at 
org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279)
        at java.lang.Thread.run(Thread.java:619)
Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input 
Pattern hdfs://localhost:9000/user/viraj/recursive/{20080615..20080617} matches 
0 files
        at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:224)
        at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTextInputFormat.listStatus(PigTextInputFormat.java:36)
        at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:241)
        at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:268)
        ... 7 more
        hdfs://localhost:9000/tmp/temp241388470/tmp987803889,
{noformat}

The following works:
{code}
A = load '/user/viraj/recursive/{200806}{15,16,17}/' using PigStorage('\u0001') 
as (k:int, v:chararray);
AL = limit A 10;
dump AL;
{code}

Why is there an inconsistency between HDFS client and Pig?

Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to