Re: HAR file and path globbing

Mohnish Kodnani Thu, 27 Sep 2012 09:38:23 -0700

It would seem that when there is a wildcard in the last location in the
file path and when using Har file protocol, the combined paths are 0.
I get this when trying out the below given example .
2012-09-27 09:22:28,074 [main] INFO
org.apache.pig.tools.pigstats.ScriptState - Pig features used in the
script: LIMIT
2012-09-27 09:22:28,074 [main] INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
pig.usenewlogicalplan is set to true. New logical plan will be used.
2012-09-27 09:22:28,147 [main] INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name: x:
Store(hdfs://nn/tmp/temp1300843291/tmp-1282091819:org.apache.pig.impl.io.InterStorage)
- scope-2 Operator Key: scope-2)
2012-09-27 09:22:28,155 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler -
File concatenation threshold: 100 optimistic? false
2012-09-27 09:22:28,189 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- MR plan size before optimization: 1
2012-09-27 09:22:28,189 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- MR plan size after optimization: 1
2012-09-27 09:22:28,268 [main] INFO
org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added
to the job
2012-09-27 09:22:28,280 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
- mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2012-09-27 09:22:30,055 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
- Setting up single store job
2012-09-27 09:22:30,096 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- 1 map-reduce job(s) waiting for submission.
2012-09-27 09:22:30,597 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- 0% complete
2012-09-27 09:22:46,428 [Thread-6] INFO
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths
to process : *21667*
2012-09-27 09:22:46,431 [Thread-6] INFO
org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input
paths to process : *21667*
2012-09-27 09:22:46,440 [Thread-6] INFO
com.hadoop.compression.lzo.GPLNativeCodeLoader - Loaded native gpl library
2012-09-27 09:22:46,443 [Thread-6] INFO
com.hadoop.compression.lzo.LzoCodec - Successfully loaded & initialized
native-lzo library [hadoop-lzo rev 335fea4fecb385745e9a6f2de174a5b26fbc6cae]
2012-09-27 09:24:04,257 [Thread-6] INFO
org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - *Total
input paths (combined) to process : 0*


It seems that MapRedUtil returns 0 paths to process after it tries to find
proper splits.


On Thu, Sep 27, 2012 at 9:30 AM, Mohnish Kodnani
<[email protected]>wrote:

> Any ideas on how I can figure out where things are not working or is this
> expected behavior.
> new observation :
>
> 1. It seems Path Globbing does not work with HAR Files with Pig, is this
> intentional ? For example :
> hadoop fs -ls har:///x/y/z/{a.har,b.har}/* works and lists all the files
> in both har files. If I give the same path as input path to a pig script
> it does not seem to work.
>
> 2. Wildcards in HAR path.
>     Like the above example if I do the following on hadoop fs it works
> hadoop fs -ls har://x/y/*/a.har/*
> This lists all files from all folders that have a.har
>
> If I give the path input path to pig it does not work. I have tried these
> 2 things on pig 0.8
> Also, for the second use case. If I remove the last wild card where files
> should be, then it works.
> For example input path to pig :
> har://x/y/*/a.har/logFile
>
> then pig can read the file and give me records back, but wild card at the
> last location does not work.
>
> Any insights would be great around if this should or should not work. I
> have 30000 files in one folder inside the har, I cannot list each one and
> want to use wildcard as the last element in the path and use path globbing
> to provide multiple har files.
>
>
> thanks
> mohnish
>
>
> On Wed, Sep 26, 2012 at 10:44 AM, Mohnish Kodnani <
> [email protected]> wrote:
>
>> I think its pig related because if i do hadoop fs -ls on the har file
>> path with input globbing it works fine.
>>
>>
>> On Tue, Sep 25, 2012 at 7:45 PM, Cheolsoo Park <[email protected]>wrote:
>>
>>> Sounds like I was wrong. ;-)
>>>
>>> You might get a better answer from hadoop user group since this is more
>>> related to HarFileSystem than Pig I think.
>>>
>>> Thanks,
>>> Cheolsoo
>>>
>>> On Tue, Sep 25, 2012 at 6:20 PM, Mohnish Kodnani
>>> <[email protected]>wrote:
>>>
>>> > Hi Chelsoo,
>>> > thanks for replying. On the same system the following works :
>>> >
>>> > x = load 'har:///a/b/b/22.har/00/*,har:///a/b/c/d/23.har/00/*' using
>>> > PigStorage('\t');
>>> >
>>> > Two separate file paths with har protocol work.
>>> >
>>> > A single path works but if I do the following I get an error.
>>> > x = LOAD 'har:///a/b/c/{d.har,e.har}/z/ab/*' using PigStorage('\t');
>>> >
>>> > Thanks
>>> > Mohnish
>>> >
>>> > On Tue, Sep 25, 2012 at 6:09 PM, Cheolsoo Park <[email protected]
>>> > >wrote:
>>> >
>>> > > Hi Mohnish,
>>> > >
>>> > > I am not very familiar with har files, so I might be wrong here.
>>> > >
>>> > > Looking at the call stack, the exception is thrown from
>>> initialize(URI
>>> > > name, Configuration conf) in HarFileSystem.java. In the source code,
>>> the
>>> > > comment of this method says the following:
>>> > >
>>> > > Initialize a Har filesystem per har archive. The
>>> > > > archive home directory is the top level directory
>>> > > > in the filesystem that contains the HAR archive.
>>> > >
>>> > >
>>> > > This sounds to me that HarFileSystem expects a single path.
>>> > >
>>> > >
>>> > > This gives error due to the curly braces being encoded to %7B and
>>> %7D.
>>> > >
>>> > >
>>> > > The encoded curly braces should be fine though. In fact, if they're
>>> not
>>> > > encoded, that's a problem because then a URISyntaxException will be
>>> > thrown
>>> > > by Java URI class.
>>> > >
>>> > > Hope that this helps,
>>> > > Cheolsoo
>>> > >
>>> > >
>>> > > On Tue, Sep 25, 2012 at 12:43 PM, Mohnish Kodnani <
>>> > > [email protected]
>>> > > > wrote:
>>> > >
>>> > > > Hi,
>>> > > > I am trying to give multiple paths to a pig script using path
>>> globbing
>>> > in
>>> > > > HAR file format and it does not seem to work. I wanted to know if
>>> this
>>> > is
>>> > > > expected or a bug / feature request.
>>> > > >
>>> > > > Command :
>>> > > > x = LOAD 'har:///a/b/c/{d.har,e.har}/z/ab/*' using
>>> PigStorage('\t');
>>> > > >
>>> > > > This gives error due to the curly braces being encoded to %7B and
>>> %7D.
>>> > > > I am trying this on Pig 0.8.0
>>> > > >
>>> > > > ERROR 2017: Internal error creating job configuration.
>>> > > >
>>> > > > org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066:
>>> Unable
>>> > to
>>> > > > open iterator for alias blah
>>> > > >         at
>>> org.apache.pig.PigServer.openIterator(PigServer.java:765)
>>> > > >         at
>>> > > >
>>> >
>>> org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:615)
>>> > > >         at
>>> > > >
>>> > > >
>>> > >
>>> >
>>> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:303)
>>> > > >         at
>>> > > >
>>> > > >
>>> > >
>>> >
>>> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
>>> > > >         at
>>> > > >
>>> > > >
>>> > >
>>> >
>>> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144)
>>> > > >         at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:76)
>>> > > >         at org.apache.pig.Main.run(Main.java:455)
>>> > > >         at org.apache.pig.Main.main(Main.java:107)
>>> > > > Caused by: org.apache.pig.PigException: ERROR 1002: Unable to store
>>> > alias
>>> > > > blah
>>> > > >         at org.apache.pig.PigServer.storeEx(PigServer.java:889)
>>> > > >         at org.apache.pig.PigServer.store(PigServer.java:827)
>>> > > >         at
>>> org.apache.pig.PigServer.openIterator(PigServer.java:739)
>>> > > >         ... 7 more
>>> > > > Caused by:
>>> > > >
>>> > > >
>>> > >
>>> >
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobCreationException:
>>> > > > ERROR 2017: Internal error creating job configuration.
>>> > > >         at
>>> > > >
>>> > > >
>>> > >
>>> >
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getJob(JobControlCompiler.java:679)
>>> > > >         at
>>> > > >
>>> > > >
>>> > >
>>> >
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.compile(JobControlCompiler.java:256)
>>> > > >         at
>>> > > >
>>> > > >
>>> > >
>>> >
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:147)
>>> > > >         at
>>> > > >
>>> > > >
>>> > >
>>> >
>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:382)
>>> > > >         at
>>> > > >
>>> >
>>> org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1209)
>>> > > >         at org.apache.pig.PigServer.storeEx(PigServer.java:885)
>>> > > >         ... 9 more
>>> > > > Caused by: java.io.IOException: Invalid path for the Har
>>> Filesystem.
>>> > > >
>>> > > >
>>> > >
>>> >
>>> har:///user/cronusapp/cassini_downsample_logs/prod/2012/09/%7B22.har,23.har%7D/00/*
>>> > > >         at
>>> > > >
>>> org.apache.hadoop.fs.HarFileSystem.initialize(HarFileSystem.java:100)
>>> > > >         at
>>> > > >
>>> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1563)
>>> > > >         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:225)
>>> > > >         at org.apache.hadoop.fs.Path.getFileSystem(Path.java:183)
>>> > > >         at
>>> > > >
>>> > > >
>>> > >
>>> >
>>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:348)
>>> > > >         at
>>> > > >
>>> > > >
>>> > >
>>> >
>>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:317)
>>> > > >         at
>>> > > > org.apache.pig.builtin.PigStorage.setLocation(PigStorage.java:219)
>>> > > >         at
>>> > > >
>>> > > >
>>> > >
>>> >
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getJob(JobControlCompiler.java:369)
>>> > > >         ... 14 more
>>> > > >
>>> > >
>>> >
>>>
>>
>>
>

Re: HAR file and path globbing

Reply via email to