On 2013-01-17 23:11, Jakub Glapa wrote:
Hi Jakub,
my pig script is going to produce a set of files that will be an
input for
a different process. The script would be running periodically so the
number
of files would be growing.
I would like to implement an expiry mechanism were I could remove
files
that are older than x or the number of files has reached some
threshold.
I know a crazy way were in bash script you can call "hadoop fs -ls
...",
parse the output and then execute "rmr" on matching entries.
Is there a "human" way to do this from under python script? Pig.fs()
I had the same issue than you few months ago. The public Pig scripting
API only exposes a FsShell object which is way too limited to do any
real work. However it is possible to get access to the Hadoop FileSystem
API from a Python script:
def get_fs():
"""Return a org.apache.hadoop.fs.FileSystem instance."""
# Pig scripting API exports a FsShell but not a FileSystem object.
ctx = ScriptPigContext.get()
props = ctx.getPigContext().getProperties()
conf = ConfigurationUtil.toConfiguration(props)
fs = FileSystem.get(conf)
return fs
Once you have a FileSystem object you can do whatever you want using
the standard Hadoop API.
Hope this helps.
-- Clément