Any ideas on how I can figure out where things are not working or is this
expected behavior.
new observation :

1. It seems Path Globbing does not work with HAR Files with Pig, is this
intentional ? For example :
hadoop fs -ls har:///x/y/z/{a.har,b.har}/* works and lists all the files in
both har files. If I give the same path as input path to a pig script it
does not seem to work.

2. Wildcards in HAR path.
    Like the above example if I do the following on hadoop fs it works
hadoop fs -ls har://x/y/*/a.har/*
This lists all files from all folders that have a.har

If I give the path input path to pig it does not work. I have tried these 2
things on pig 0.8
Also, for the second use case. If I remove the last wild card where files
should be, then it works.
For example input path to pig :
har://x/y/*/a.har/logFile

then pig can read the file and give me records back, but wild card at the
last location does not work.

Any insights would be great around if this should or should not work. I
have 30000 files in one folder inside the har, I cannot list each one and
want to use wildcard as the last element in the path and use path globbing
to provide multiple har files.


thanks
mohnish

On Wed, Sep 26, 2012 at 10:44 AM, Mohnish Kodnani <[email protected]
> wrote:

> I think its pig related because if i do hadoop fs -ls on the har file path
> with input globbing it works fine.
>
>
> On Tue, Sep 25, 2012 at 7:45 PM, Cheolsoo Park <[email protected]>wrote:
>
>> Sounds like I was wrong. ;-)
>>
>> You might get a better answer from hadoop user group since this is more
>> related to HarFileSystem than Pig I think.
>>
>> Thanks,
>> Cheolsoo
>>
>> On Tue, Sep 25, 2012 at 6:20 PM, Mohnish Kodnani
>> <[email protected]>wrote:
>>
>> > Hi Chelsoo,
>> > thanks for replying. On the same system the following works :
>> >
>> > x = load 'har:///a/b/b/22.har/00/*,har:///a/b/c/d/23.har/00/*' using
>> > PigStorage('\t');
>> >
>> > Two separate file paths with har protocol work.
>> >
>> > A single path works but if I do the following I get an error.
>> > x = LOAD 'har:///a/b/c/{d.har,e.har}/z/ab/*' using PigStorage('\t');
>> >
>> > Thanks
>> > Mohnish
>> >
>> > On Tue, Sep 25, 2012 at 6:09 PM, Cheolsoo Park <[email protected]
>> > >wrote:
>> >
>> > > Hi Mohnish,
>> > >
>> > > I am not very familiar with har files, so I might be wrong here.
>> > >
>> > > Looking at the call stack, the exception is thrown from initialize(URI
>> > > name, Configuration conf) in HarFileSystem.java. In the source code,
>> the
>> > > comment of this method says the following:
>> > >
>> > > Initialize a Har filesystem per har archive. The
>> > > > archive home directory is the top level directory
>> > > > in the filesystem that contains the HAR archive.
>> > >
>> > >
>> > > This sounds to me that HarFileSystem expects a single path.
>> > >
>> > >
>> > > This gives error due to the curly braces being encoded to %7B and %7D.
>> > >
>> > >
>> > > The encoded curly braces should be fine though. In fact, if they're
>> not
>> > > encoded, that's a problem because then a URISyntaxException will be
>> > thrown
>> > > by Java URI class.
>> > >
>> > > Hope that this helps,
>> > > Cheolsoo
>> > >
>> > >
>> > > On Tue, Sep 25, 2012 at 12:43 PM, Mohnish Kodnani <
>> > > [email protected]
>> > > > wrote:
>> > >
>> > > > Hi,
>> > > > I am trying to give multiple paths to a pig script using path
>> globbing
>> > in
>> > > > HAR file format and it does not seem to work. I wanted to know if
>> this
>> > is
>> > > > expected or a bug / feature request.
>> > > >
>> > > > Command :
>> > > > x = LOAD 'har:///a/b/c/{d.har,e.har}/z/ab/*' using PigStorage('\t');
>> > > >
>> > > > This gives error due to the curly braces being encoded to %7B and
>> %7D.
>> > > > I am trying this on Pig 0.8.0
>> > > >
>> > > > ERROR 2017: Internal error creating job configuration.
>> > > >
>> > > > org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066:
>> Unable
>> > to
>> > > > open iterator for alias blah
>> > > >         at org.apache.pig.PigServer.openIterator(PigServer.java:765)
>> > > >         at
>> > > >
>> > org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:615)
>> > > >         at
>> > > >
>> > > >
>> > >
>> >
>> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:303)
>> > > >         at
>> > > >
>> > > >
>> > >
>> >
>> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
>> > > >         at
>> > > >
>> > > >
>> > >
>> >
>> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144)
>> > > >         at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:76)
>> > > >         at org.apache.pig.Main.run(Main.java:455)
>> > > >         at org.apache.pig.Main.main(Main.java:107)
>> > > > Caused by: org.apache.pig.PigException: ERROR 1002: Unable to store
>> > alias
>> > > > blah
>> > > >         at org.apache.pig.PigServer.storeEx(PigServer.java:889)
>> > > >         at org.apache.pig.PigServer.store(PigServer.java:827)
>> > > >         at org.apache.pig.PigServer.openIterator(PigServer.java:739)
>> > > >         ... 7 more
>> > > > Caused by:
>> > > >
>> > > >
>> > >
>> >
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobCreationException:
>> > > > ERROR 2017: Internal error creating job configuration.
>> > > >         at
>> > > >
>> > > >
>> > >
>> >
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getJob(JobControlCompiler.java:679)
>> > > >         at
>> > > >
>> > > >
>> > >
>> >
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.compile(JobControlCompiler.java:256)
>> > > >         at
>> > > >
>> > > >
>> > >
>> >
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:147)
>> > > >         at
>> > > >
>> > > >
>> > >
>> >
>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:382)
>> > > >         at
>> > > >
>> > org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1209)
>> > > >         at org.apache.pig.PigServer.storeEx(PigServer.java:885)
>> > > >         ... 9 more
>> > > > Caused by: java.io.IOException: Invalid path for the Har Filesystem.
>> > > >
>> > > >
>> > >
>> >
>> har:///user/cronusapp/cassini_downsample_logs/prod/2012/09/%7B22.har,23.har%7D/00/*
>> > > >         at
>> > > >
>> org.apache.hadoop.fs.HarFileSystem.initialize(HarFileSystem.java:100)
>> > > >         at
>> > > >
>> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1563)
>> > > >         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:225)
>> > > >         at org.apache.hadoop.fs.Path.getFileSystem(Path.java:183)
>> > > >         at
>> > > >
>> > > >
>> > >
>> >
>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:348)
>> > > >         at
>> > > >
>> > > >
>> > >
>> >
>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:317)
>> > > >         at
>> > > > org.apache.pig.builtin.PigStorage.setLocation(PigStorage.java:219)
>> > > >         at
>> > > >
>> > > >
>> > >
>> >
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getJob(JobControlCompiler.java:369)
>> > > >         ... 14 more
>> > > >
>> > >
>> >
>>
>
>

Reply via email to