It would seem that when there is a wildcard in the last location in the file path and when using Har file protocol, the combined paths are 0. I get this when trying out the below given example . 2012-09-27 09:22:28,074 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: LIMIT 2012-09-27 09:22:28,074 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - pig.usenewlogicalplan is set to true. New logical plan will be used. 2012-09-27 09:22:28,147 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name: x: Store(hdfs://nn/tmp/temp1300843291/tmp-1282091819:org.apache.pig.impl.io.InterStorage) - scope-2 Operator Key: scope-2) 2012-09-27 09:22:28,155 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false 2012-09-27 09:22:28,189 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1 2012-09-27 09:22:28,189 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1 2012-09-27 09:22:28,268 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job 2012-09-27 09:22:28,280 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3 2012-09-27 09:22:30,055 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job 2012-09-27 09:22:30,096 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission. 2012-09-27 09:22:30,597 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2012-09-27 09:22:46,428 [Thread-6] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : *21667* 2012-09-27 09:22:46,431 [Thread-6] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : *21667* 2012-09-27 09:22:46,440 [Thread-6] INFO com.hadoop.compression.lzo.GPLNativeCodeLoader - Loaded native gpl library 2012-09-27 09:22:46,443 [Thread-6] INFO com.hadoop.compression.lzo.LzoCodec - Successfully loaded & initialized native-lzo library [hadoop-lzo rev 335fea4fecb385745e9a6f2de174a5b26fbc6cae] 2012-09-27 09:24:04,257 [Thread-6] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - *Total input paths (combined) to process : 0*
It seems that MapRedUtil returns 0 paths to process after it tries to find proper splits. On Thu, Sep 27, 2012 at 9:30 AM, Mohnish Kodnani <[email protected]>wrote: > Any ideas on how I can figure out where things are not working or is this > expected behavior. > new observation : > > 1. It seems Path Globbing does not work with HAR Files with Pig, is this > intentional ? For example : > hadoop fs -ls har:///x/y/z/{a.har,b.har}/* works and lists all the files > in both har files. If I give the same path as input path to a pig script > it does not seem to work. > > 2. Wildcards in HAR path. > Like the above example if I do the following on hadoop fs it works > hadoop fs -ls har://x/y/*/a.har/* > This lists all files from all folders that have a.har > > If I give the path input path to pig it does not work. I have tried these > 2 things on pig 0.8 > Also, for the second use case. If I remove the last wild card where files > should be, then it works. > For example input path to pig : > har://x/y/*/a.har/logFile > > then pig can read the file and give me records back, but wild card at the > last location does not work. > > Any insights would be great around if this should or should not work. I > have 30000 files in one folder inside the har, I cannot list each one and > want to use wildcard as the last element in the path and use path globbing > to provide multiple har files. > > > thanks > mohnish > > > On Wed, Sep 26, 2012 at 10:44 AM, Mohnish Kodnani < > [email protected]> wrote: > >> I think its pig related because if i do hadoop fs -ls on the har file >> path with input globbing it works fine. >> >> >> On Tue, Sep 25, 2012 at 7:45 PM, Cheolsoo Park <[email protected]>wrote: >> >>> Sounds like I was wrong. ;-) >>> >>> You might get a better answer from hadoop user group since this is more >>> related to HarFileSystem than Pig I think. >>> >>> Thanks, >>> Cheolsoo >>> >>> On Tue, Sep 25, 2012 at 6:20 PM, Mohnish Kodnani >>> <[email protected]>wrote: >>> >>> > Hi Chelsoo, >>> > thanks for replying. On the same system the following works : >>> > >>> > x = load 'har:///a/b/b/22.har/00/*,har:///a/b/c/d/23.har/00/*' using >>> > PigStorage('\t'); >>> > >>> > Two separate file paths with har protocol work. >>> > >>> > A single path works but if I do the following I get an error. >>> > x = LOAD 'har:///a/b/c/{d.har,e.har}/z/ab/*' using PigStorage('\t'); >>> > >>> > Thanks >>> > Mohnish >>> > >>> > On Tue, Sep 25, 2012 at 6:09 PM, Cheolsoo Park <[email protected] >>> > >wrote: >>> > >>> > > Hi Mohnish, >>> > > >>> > > I am not very familiar with har files, so I might be wrong here. >>> > > >>> > > Looking at the call stack, the exception is thrown from >>> initialize(URI >>> > > name, Configuration conf) in HarFileSystem.java. In the source code, >>> the >>> > > comment of this method says the following: >>> > > >>> > > Initialize a Har filesystem per har archive. The >>> > > > archive home directory is the top level directory >>> > > > in the filesystem that contains the HAR archive. >>> > > >>> > > >>> > > This sounds to me that HarFileSystem expects a single path. >>> > > >>> > > >>> > > This gives error due to the curly braces being encoded to %7B and >>> %7D. >>> > > >>> > > >>> > > The encoded curly braces should be fine though. In fact, if they're >>> not >>> > > encoded, that's a problem because then a URISyntaxException will be >>> > thrown >>> > > by Java URI class. >>> > > >>> > > Hope that this helps, >>> > > Cheolsoo >>> > > >>> > > >>> > > On Tue, Sep 25, 2012 at 12:43 PM, Mohnish Kodnani < >>> > > [email protected] >>> > > > wrote: >>> > > >>> > > > Hi, >>> > > > I am trying to give multiple paths to a pig script using path >>> globbing >>> > in >>> > > > HAR file format and it does not seem to work. I wanted to know if >>> this >>> > is >>> > > > expected or a bug / feature request. >>> > > > >>> > > > Command : >>> > > > x = LOAD 'har:///a/b/c/{d.har,e.har}/z/ab/*' using >>> PigStorage('\t'); >>> > > > >>> > > > This gives error due to the curly braces being encoded to %7B and >>> %7D. >>> > > > I am trying this on Pig 0.8.0 >>> > > > >>> > > > ERROR 2017: Internal error creating job configuration. >>> > > > >>> > > > org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: >>> Unable >>> > to >>> > > > open iterator for alias blah >>> > > > at >>> org.apache.pig.PigServer.openIterator(PigServer.java:765) >>> > > > at >>> > > > >>> > >>> org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:615) >>> > > > at >>> > > > >>> > > > >>> > > >>> > >>> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:303) >>> > > > at >>> > > > >>> > > > >>> > > >>> > >>> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168) >>> > > > at >>> > > > >>> > > > >>> > > >>> > >>> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144) >>> > > > at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:76) >>> > > > at org.apache.pig.Main.run(Main.java:455) >>> > > > at org.apache.pig.Main.main(Main.java:107) >>> > > > Caused by: org.apache.pig.PigException: ERROR 1002: Unable to store >>> > alias >>> > > > blah >>> > > > at org.apache.pig.PigServer.storeEx(PigServer.java:889) >>> > > > at org.apache.pig.PigServer.store(PigServer.java:827) >>> > > > at >>> org.apache.pig.PigServer.openIterator(PigServer.java:739) >>> > > > ... 7 more >>> > > > Caused by: >>> > > > >>> > > > >>> > > >>> > >>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobCreationException: >>> > > > ERROR 2017: Internal error creating job configuration. >>> > > > at >>> > > > >>> > > > >>> > > >>> > >>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getJob(JobControlCompiler.java:679) >>> > > > at >>> > > > >>> > > > >>> > > >>> > >>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.compile(JobControlCompiler.java:256) >>> > > > at >>> > > > >>> > > > >>> > > >>> > >>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:147) >>> > > > at >>> > > > >>> > > > >>> > > >>> > >>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:382) >>> > > > at >>> > > > >>> > >>> org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1209) >>> > > > at org.apache.pig.PigServer.storeEx(PigServer.java:885) >>> > > > ... 9 more >>> > > > Caused by: java.io.IOException: Invalid path for the Har >>> Filesystem. >>> > > > >>> > > > >>> > > >>> > >>> har:///user/cronusapp/cassini_downsample_logs/prod/2012/09/%7B22.har,23.har%7D/00/* >>> > > > at >>> > > > >>> org.apache.hadoop.fs.HarFileSystem.initialize(HarFileSystem.java:100) >>> > > > at >>> > > > >>> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1563) >>> > > > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:225) >>> > > > at org.apache.hadoop.fs.Path.getFileSystem(Path.java:183) >>> > > > at >>> > > > >>> > > > >>> > > >>> > >>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:348) >>> > > > at >>> > > > >>> > > > >>> > > >>> > >>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:317) >>> > > > at >>> > > > org.apache.pig.builtin.PigStorage.setLocation(PigStorage.java:219) >>> > > > at >>> > > > >>> > > > >>> > > >>> > >>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getJob(JobControlCompiler.java:369) >>> > > > ... 14 more >>> > > > >>> > > >>> > >>> >> >> >
