Re: Pig and DistributedCache

Rohini Palaniswamy Tue, 19 Feb 2013 13:41:01 -0800

Eugene,
  As I said earlier, you can use a different dfs.umaskmode. Running pig
with -Ddfs.umaskmode=022 will give read access to all(755 instead of 700).
But all the files output, by the pig script will have those permission.


Better thing would be when you write the serialized file in the below step,
write it with more accessible permissions.
2. After that client side builds the filter, serialize it and move it to
server side.

Regards,
Rohini


On Tue, Feb 19, 2013 at 4:26 AM, Eugene Morozov
<[email protected]>wrote:

> Rohini,
>
> Sorry for misleading in previous e-mails with these users. Here is more
> robust explanation of my issue.
>
> This is what I've got when I've tried to run it.
>
> File has been successfully copied by using "tmpfiles".
> 2013-02-08 13:38:56,533 INFO
> org.apache.hadoop.hbase.filter.PrefixFuzzyRowFilterWithFile: File
>
> [/var/lib/hadoop-hdfs/cache/mapred/mapred/staging/vagrant/.staging/job_201302081322_0001/files/pairs-tmp#pairs-tmp]
> has been found
> 2013-02-08 13:38:56,539 ERROR
> org.apache.hadoop.hbase.filter.PrefixFuzzyRowFilterWithFile: Cannot read
> file:
>
> [/var/lib/hadoop-hdfs/cache/mapred/mapred/staging/vagrant/.staging/job_201302081322_0001/files/pairs-tmp#pairs-tmp]
> org.apache.hadoop.security.AccessControlException: Permission denied:
> user=hbase, access=EXECUTE,
>
> inode="/var/lib/hadoop-hdfs/cache/mapred/mapred/staging/vagrant/.staging":vagrant:supergroup:drwx------
>
>
> org.apache.hadoop.hbase.filter.PrefixFuzzyRowFilterWithFile - it's my
> filter, it just lives in org.apache... package.
>
> 1. I have user vagrant and this user runs pig script.
> 2. After that client side builds the filter, serialize it and move it to
> server side.
> 3. RegionServer starts playing here: it deserializes the filter and tries
> to use it while reading table.
> 4. Filter in its turn tries to read the file, but since RegionServer has
> been started under system user called "hbase", the filter also has
> corresponding authentification and cannot access the file, which has been
> written with another user.
>
> Any ideas of what to try?
>
> On Sun, Feb 17, 2013 at 8:22 AM, Rohini Palaniswamy <
> [email protected]
> > wrote:
>
> > Hi Eugene,
> >       Sorry. Missed your reply earlier.
> >
> >     tmpfiles has been around for a while and will not be removed in
> hadoop
> > anytime soon. So don't worry about it. The hadoop configurations have
> never
> > been fully documented and people look at code and use them. They usually
> > deprecate for  years before removing it.
> >
> >   The file will be created with the permissions based on the
> dfs.umaskmode
> > setting (or fs.permissions.umask-mode in Hadoop 0.23/2.x) and the owner
> of
> > the file will be the user who runs the pig script. The map job will be
> > launched as the same user by the pig script. I don't understand what you
> > mean by user runs map task does not have permissions. What kind of hadoop
> > authentication are you are doing such that the file is created as one
> user
> > and map job is launched as another user?
> >
> > Regards,
> > Rohini
> >
> >
> > On Sun, Feb 10, 2013 at 10:26 PM, Eugene Morozov
> > <[email protected]>wrote:
> >
> > > Hi, again.
> > >
> > > I've been able to successfully use the trick with DistributedCache and
> > > "tmpfiles" - during run of my Pig script the files are copied by
> > JobClient
> > > to job-cache.
> > >
> > > But here is the issue. The files are there, but they have permission
> 700
> > > and user that runs maptask (I suppose it's hbase) doesn't have
> permission
> > > to read them. Permissions are belong to my current OS user.
> > >
> > > In first, It looks like a bug, doesn't it?
> > > In second, what can I do about it?
> > >
> > >
> > > On Thu, Feb 7, 2013 at 11:42 AM, Eugene Morozov
> > > <[email protected]>wrote:
> > >
> > > > Rohini,
> > > >
> > > > thank you for the reply.
> > > >
> > > > Isn't it kinda hack to use "tmpfiles"? It's neither API nor good
> known
> > > > practice, it's internal details. How safe is it to use such a trick?
> I
> > > mean
> > > > after month or so we probably update our CDH4 to whatever is there.
> > > > Will it still work? Will it be safe for the cluster or for my job?
> Who
> > > > knows what will be implemented there?
> > > >
> > > > You see, I can understand the code, find such a solution, but I won't
> > be
> > > > able keep all of them in mind to check when we update the cluster.
> > > >
> > > >
> > > > On Thu, Feb 7, 2013 at 1:23 AM, Rohini Palaniswamy <
> > > > [email protected]> wrote:
> > > >
> > > >> You should be fine using tmpfiles and that's the way to do it.
> > > >>
> > > >>  Else you will have to copy the file to hdfs, and call the
> > > >> DistributedCache.addFileToClassPath yourself (basically what
> tmpfiles
> > > >> setting is doing). But the problem there as you mentioned is
> cleaning
> > up
> > > >> the hdfs file after the job completes. If you use tmpfiles, it is
> > copied
> > > >> to
> > > >> the job's staging directory in user home and gets cleaned up
> > > automatically
> > > >> when job completes. If the file is not going to change between
> jobs, I
> > > >> would advise creating it in hdfs once in a fixed location and
> reusing
> > it
> > > >> across jobs doing only DistributedCache.addFileToClassPath(). But if
> > it
> > > is
> > > >> dynamic and differs from job to job, tmpfiles is your choice.
> > > >>
> > > >> Regards,
> > > >> Rohini
> > > >>
> > > >>
> > > >> On Mon, Feb 4, 2013 at 1:26 PM, Eugene Morozov <
> > > [email protected]
> > > >> >wrote:
> > > >>
> > > >> > Hello, folks!
> > > >> >
> > > >> > I'm using greatly customized HBaseStorage in my pig script.
> > > >> > And during HBaseStorage.setLocation() I'm preparing a file with
> > values
> > > >> that
> > > >> > would be source for my filter. The filter is used  during
> > > >> > HBaseStorage.getNext().
> > > >> >
> > > >> > Since Pig script is basically MR job with many mappers, it means
> > that
> > > my
> > > >> > values-file must be accessible for all my Map tasks. There is
> > > >> > DistributedCache that should copy files across the cluster to have
> > > them
> > > >> as
> > > >> > local for any map tasks. I don't want to write my file to HDFS in
> > > first
> > > >> > place, because there is no way to clean it up after MR job is done
> > > >>  (may be
> > > >> > you can point me in the direction). On the other hand if I'm
> writing
> > > the
> > > >> > file to local file system "/tmp", then I may either specify
> > > >> deleteOnExit()
> > > >> > or just forget about it - linux will take care of its local
> "/tmp".
> > > >> >
> > > >> > But here is small problem. DistributedCache copies files only if
> it
> > is
> > > >> used
> > > >> > with command line parameter like "-files". In that case
> > > >> > GenericOptionsParsers copies all files, but DistributedCache API
> > > itself
> > > >> > allows only to specify parameters in jobConf - it doesn't actually
> > do
> > > >> > copying.
> > > >> >
> > > >> > I've found that GenericOptionsParser specifies property
> "tmpfiles",
> > > >> which
> > > >> > is used by JobClient to copy files before it runs MR job. And I've
> > > been
> > > >> > able to specify the same property in jobConf from my HBaseStorage.
> > It
> > > >> does
> > > >> > the trick, but it's a hack.
> > > >> > Is there any other correct way to achieve the goal?
> > > >> >
> > > >> > Thanks in advance.
> > > >> > --
> > > >> > Evgeny Morozov
> > > >> > Developer Grid Dynamics
> > > >> > Skype: morozov.evgeny
> > > >> > www.griddynamics.com
> > > >> > [email protected]
> > > >> >
> > > >>
> > > >
> > > >
> > > >
> > > > --
> > > > Evgeny Morozov
> > > > Developer Grid Dynamics
> > > > Skype: morozov.evgeny
> > > > www.griddynamics.com
> > > > [email protected]
> > > >
> > >
> > >
> > >
> > > --
> > > Evgeny Morozov
> > > Developer Grid Dynamics
> > > Skype: morozov.evgeny
> > > www.griddynamics.com
> > > [email protected]
> > >
> >
>
>
>
> --
> Evgeny Morozov
> Developer Grid Dynamics
> Skype: morozov.evgeny
> www.griddynamics.com
> [email protected]
>

Re: Pig and DistributedCache

Reply via email to